NMF & Outlier Detection

class: center, middle

### W4995 Applied Machine Learning

# NMF; Outlier detection

04/02/18

Andreas C. Müller
???
Today, I want to talk about non-negative matrix
factorization and outlier detection.

FIXME matrix factorization should be approximately equal, not equal
FIXME list KL divergence loss for NMF
FIXME related gaussian model and Kernel density to GMM?
FIXME talk more about all outliers are different in different ways vs classification (add example where outlier is better?)
---
class: center, middle

# Non-Negative Matrix Factorization
???
NMF is basically in line with what we talked about with
dimensionality reduction but also related to clustering.
It’s a particular algorithm in a wider family of matrix
factorization algorithms. 
---
#Matrix Factorization

<br>

.center[
![:scale 90%](images/matrix_factorization.png)
]
???
So matrix factorization algorithm works like this. We have
our data matrix, X, which is number of samples times number
of features. And we want to factor it into two matrices, A
and B.

A is number of samples times k and B is k times number of
features.

If we do this in some way, we get for each sample, a new
representation in k-dimensional, each row of k will
correspond to one sample and the columns of B will
correspond to the input features. So they encode how the
entries of A correspond to the original data. Often this is
called latent representation. So, these new features here in
A, somehow represent the rows of X and the features are
encoded in B.

---
#Matrix Factorization

<br>

.center[
![:scale 90%](images/matrix_factorization_2.png)
]

???
---
#PCA

<br>

.center[
![:scale 80%](images/pca.png)
]
???
The one example of this that we already saw is principal
component analysis. In PCA, B was basically the rotation and
then a projection. And A is just a projected matrix.

This is just one particular factorization you can do. I have
the sign equal, but usually, more generally, you're trying
to find matrixes A and B so that this is as close to equal
as possible.

In particular, for PCA, what we're trying to do is we try to
find matrices A and B so that the rows of B are orthogonal,
and we restrict the rank of k. K would be the number of
components and we get PCA, if we restrict the columns of B
to be orthogonal, then the product of matrixes A and B is
most close to X in the least square sense is the principal
component analysis.

---
class:spacious
# Other Matrix Factorizations
    
- PCA: principal components orthogonal, minimize
squared loss

- Sparse PCA: components orthogonal and sparse

- ICA: independent components

- Non-negative matrix factorization (NMF):
latent representation and latent features are nonnegative.

???
There’s a bunch of other matrix factorization. They
basically change the requirements we make in matrices A and
B and what is the loss that we optimize.

As I said, PCA wants the columns of B to be orthogonal, and
minimize the squared loss to the data.

In sparse PCA you do the same, but you say the entries of B
should also be sparse, so you put an L1 norm on it, and then
that maybe gives you more easily interpretable features or
definitely more sparse features.

Independent component analysis tries to make sure that the
projections given by B are statistically independent.

NMF is used more commonly in practice. In this, we restrict
ourselves to the matrices being positive, what I mean by
that is, each entry is a positive number.

---
#NMF

<br>

.center[
![:scale 80%](images/nmf.png)
]

???
In NMF, we compute X as a product of W and H. W stands for
weights, and H stands for hidden representation. We want to
find W and H so that their product is as close as possible
to X.

---
class: spacious

# Why NMF?

- Meaningful signs

- Positive weights

- No “cancellation” like in PCA

- Can learn over-complete representation

???
Since everything is now positive, you have meaningful signs.
In, PCA the signs were arbitrary so you could flip all the
signs and component and it would still be the same thing. In
NMF, everything is positive so there's no flipping of signs.

All the weights are positive, this makes it much easier for
us to conceptualize. So you could think about, as each data
point is represented as a positive combination of the
columns of W, so positive combinations of these basic parts.
This is often thought to represent something like part. So
you basically build up each data point out of these columns
in W that has like positive contributions.

If you have like positive and negative coefficients it’s
usually much harder to interpret.

In PCA, you can get the cancellation effect. Basically, you
can have components that vary a lot in different directions
and so once you add them all up, the variation vanishes.

Since you have positive weights, you can think of NMF a
little bit like soft clustering, in the sense of slightly
related to GMMs, you can think of the columns of W as being
something like prototypes and so you express each data point
as a combination of these prototypes. Since they're all
positive, it's more similar to clustering.

Another thing that’s quite interesting is that you can learn
overcomplete representations. What that means is that you
can learn representations where you have more components and
features. So before we had this k and it was drawn as being
smaller than number of features, if you have more than
number of features, you can always trivially make W the
identity matrix, and so nothing happens. But if you add
particular restrictions to W and H, then you can learn
something, where actually the H is wider than X, which might
be interesting if you're interested in extracting
interesting features. In NMF, you can do this by, for
example, imposing sparsity constraints.

- positive weights -> parts / components
- Can be viewed as “soft clustering”: each point is
positive linear combination of weights.

- (n_components > n_features) by asking for sparsity (in either W or H)

---
.center[
PCA (ordered by projection, not eigenvalue)

![:scale 70%](images/pca_projection.png)
]

.center[
NMF (ordered by hidden representation)
![:scale 70%](images/nmf_hid_rep.png)
]

???
Here's an example of what this looks like in practice. Here
we have NMF and PCA.

This is the digit dataset, so the images are 28x28 grayscale
images. I have these the digits, 0 and 1, and I tried to
express them as a linear combination of the PCA components
and as a linear combination of the NMF components. Here, I
ordered them by which component is the most important for
each of these data points.

For both, 0 and 1, the most active component is this one,
which is a guess either 0 or 1, depending on whether it's
positive or negative. You can see that neither of these
latter components looks a hardly like a 0 or a 1.

What you’re seeing here is the cancellation effect taking
place.

The idea with NMF is that in NMF, hopefully, you got
something that’s more interpretable in particular, since
everything is positive.

Usually, we can't actually look at our data this easily and
so being able to interpret the component might be quite
important. Often, people use this for gene analysis, for
example, where you have very high dimensional vectors, and
if you want to see which genes act together and so having
these positive combinations really helps with
interpretability.

---
# Downsides of NMF

- Can only be applied to non-negative data

- Interpretability is hit or miss

- Non-convex optimization, requires initialization

- Not orthogonal

.center[
![:scale 60%](images/nmf_downsides.png)
]
???
There's a couple of downsides. The most obvious one is it
can only be applied to non-negative data, if your data has
negative entries, then obviously, you can't express it as a
product of two positive matrices.

Also, 0 needs to mean something. 0 needs to be in the
absence of signal for this to make sense. So obviously, you
could always make your data positive by subtracting the
minimum of the data and just shifting the 0 point. But
that's not going to work very well. 0 really needs to mean
something in NMF to make sense.

One kind of data where NMF is commonly used is text data. In
text data, you look at word counts, and they're always
positive so they’re a very natural candidate.

As with all things interpretable, the interpretability of a
model is a hit or miss. So it depends a lot on the
application and on the data, whether you actually get
something out that you can understand. This is an
unsupervised model and so the outcome might be anything.

Another issue is that this is a non-convex optimization and
it requires initialization. So often this is initialized
with PCA, or you could actually think about initializing it
with KMeans or something else, there's no direct formula to
compute W and H.

So in PCA, we can just do an SVD and it gives us the
rotation and it gives us the representation of the data. In
NMF, we basically have to do something like gradient
descent. There are several different ways to do this but
basically, all of them come down to some form of optimizing
minimization problem and trying to find a W and H. So
depending on how you initialize them, and what kind of
optimization you run, you'll get different outcomes, and you
will get some local optimum, but you're not going to get the
global optimum. So that’s a bit disappointing.

Even if you did compute this for a dataset. So you have your
data matrix X, you decompose it into W and H and now you
might want to take new data and projected it into the same
space. In PCA, you can just do the rotation to new data. In
an NMF, you actually have to run the optimization again. So
there's no forward process that can give you H given W. So
you need to keep W fixed and then run the optimization over
H. So the only way to get the hidden representation for any
data is by running optimization, given the weights W doing
the optimization is convex, though, so it's not really
problematic, and the outcome will always be the same, but it
takes a lot longer to run optimization for something, than
just do rotation.

The components are not orthogonal as in PCA which can make
sometimes be a bit confusing because you can't really think
about it in terms of projections.

Finally, another thing that's quite different in NMF than
PCA is:

A)  They're not ordered

B) If you want fewer components, there will not be a subset
of the more components solution.

So here, is a solution for this positive dataset. If I do
NMF with two components, I get these two arrows, they
basically point towards the ends of the data which has the
extreme points so it can express everything in there as a
positive combination of the two.

And if I do only one component, I get the mean of the data,
because that's the best I can do with one component.

If I change the number of components, I might get completely
different results every time. So picking the right number of
components is quite critical because it really influences
what the solution looks like. So if compute 20 components,
and then I compute 10 components, the 10 will be just
completely different from any subset of the 20 components.

---

.center[
NMF with 20 components on MNIST

![:scale 60%](images/nmf_20_mnist.png)
]

.center[
NMF with 5 components on MNIST

![:scale 60%](images/nmf_5_mnist.png)
]

???
Here's another illustration of this. If I do 20 components,
I get the top, if I do 5 components, I get the bottom. And
you can see these are quite different. And if I would use
more components, they would probably be even more localized,
and be just like, sort of small pieces of strokes.

With 5 components, you get something like digits out there
and all of these together seems to sort of cover space
somewhat.

---
class:spacious
# Applications of NMF

- Text analysis (next week)

- Signal processing

- Speech and Audio (see <a href="https://librosa.github.io/librosa/generated/librosa.decompose.decompose.html#librosa.decompose.decompose">librosa</a>)

- Source separation

- Gene expression analysis

???
This is often used in text analysis. It's also used in
signal processing, particular for speech and audio. It's
often down for separating the sources, when there are
multiple speakers, and you want to separate out one of the
speakers or if you want to remove the singer’s voice and
keep the music intact.

This can be done using the library librosa. There are also
plugins for common audio players, where you can just remove
the voice with NMF.

Also, it's commonly used gene expression analysis.

---
class:center,middle
#Outlier Detection

???
The idea in outlier detection is to find points that are
different. There are two quite related kinds of outlier
detection summarized under outlier detection. There's
outlier detection and novelty detection.

---
#Motivation
.padding-top[
.left-column[
![:scale 100%](images/outlier_detection.png)
]

.right-column[
![:scale 100%](images/novelty_detection.png)
]
]
???
Both are unsupervised methods. The idea is to find things
that are different. In outlier detection, usually, your
training dataset also contains outliers which makes it a bit
dirty, while in novelty detection, you get a dataset and
then, later on, someone gives you new data and asks you what
is new here. So in novelty detection, your dataset would be
clean.

But in both, you want to identify something that's sort of
different from the standard distribution. Only, in the
outlier detection case, there are already different samples
in your training dataset.

This is one of the rigidly few unsupervised problems that
are pretty heavily used in practice.

- Find points that are “different” within the training set (and in the future).

- “Novelty detection” - no outliers in the training set.

- Outliers are not labeled! (otherwise it’s just imbalanced classification)

- Often outlier detection and novelty detection used interchangeably in practice.

---
class:spacious
# Applications

- Fraud detection (credit cards, click fraud, ...)

- Network failure detection

- Intrusion detection in networks

- Defect detection (engineering etc…)

- News? Intelligence?

???
- usual assumption: all outliers are different in a different way.
- Often people use classification datasets for outlier detection:
 that's a bit strange. See homework results.
---
class:spacious
# Basic idea

- Model data distribution $p(X)$

- Outlier: $p(X) < \varepsilon$

- For outlier detection: be robust in modelling $p(X)$

???
The main idea is, you model your data distribution, p(X).
And then you look at the data points that are unlikely under
the model. So if it's unlikely under the model, then it's
probably an outlier.

If you’re doing outlier detection that means your sample is
going to be contaminated. So they're outliers in the dataset
X already if that is the case, you want to be robust in
modeling p(X), so you want to model p(X) in a way that is
robust to outliers.

Both of these tasks are generally ill-defined. So unless you
actually know the real data distribution, how good something
does is hard to measure. Usually, you don't have ground
truth of what the outliers are if you have the ground truth
what the outliers are, you could just do an imbalance
classification task.

So similar to clustering, what we're doing here is building
a model, trying to find some outliers, and then check if
they're actually outliers. If we are satisfied with the
things that our model finds, then we can put into
production. But there's no guarantee that it's going to find
like x fraction of the outliers and since we don't have
labeled data, usually, we can't really measure our recall.
We won't ever know about the samples that we didn't find.

So maybe with respect to the homework and in the homework
actually have ground truth data. And it's similar to the
clustering setting, where researchers are basically cheating
and evaluating methods that are unsupervised in a supervised
manner. So once you have ground truth label, you can
evaluate your outlier detection, using, for example, AUC,
which is what you're going to do your homework.

But in a real-world setting, if you had the labels, you
would never use outlier detection.

Again, similar to clustering, what makes an outlier in a
particular dataset is ill-defined. So depends on the
application, what do you want to consider an outlier or not.
So in the homework, it’s ill and healthy, and the ill people
are the outliers. But it could also be that the people that
are much older and everybody else are the outliers.

If there's sort of a very clear density model, and so you
don't know what the density of the data is supposed to look
like and then, you know, things that don't share this
density, then you can sort of define this. But usually you
don't know what the density supposed to look like and so
there's no clearly defined notion of what is an outlier.
Similarly only, there's no clearly defined solution of what
should be a cluster.

So we're going to talk through a couple of different
algorithms that make different assumptions about what makes
a data point an outlier. So here, I said, we start usually
with a data distribution p(X).

- Task is generally ill-defined (unless you know the
real data distribution).
- Task is generally ill-defined (unless you know the
real data distribution).
---
#Elliptic Envelope

`$$p(X) = \mathcal{N}(\mu, \Sigma)$$`

.center[
![:scale 60%](images/elliptic_envelope.png)
]

???
This leads to the elliptic envelope outlier detection.
Basically, it fits a Gaussian into the data, and then it
looks at the points that are not fit well by the Gaussians
and those points are the outliers. Since this is meant for
the outlier detection task, what we're trying to do is
actually trying to find a robust estimate of the mean and
covariance matrix so that we can tolerate some outliers in
the training dataset, and still sort of recover the real
covariance matrix.

In this illustration, the black points, inliers, are what
you expect the data to look like and the outlier
distribution is plotted in red.

They overlap, so usually in unsupervised methods, you will
never label any of these red points as outliers but we could
try to figure out that these points here are outliers.

The way that elliptic envelope works are that it finds a
robust version of the variance matrix. Basically, it finds
the covariance matrix with the smallest determinant that
still covers a large portion of the data.

In most outlier detection methods, you have to specify how
many outliers you expect. So here, let's say we specify 10%
of the data to be outliers, then it will try to construct
the covariance matrix that covers 90% of the data, but holds
the lowest possible determinant.

If you do this, you get the red dotted circles, which are
the contours covariance fitted with the robust estimate and
the blue circles are the covariance fitted with the maximum
likelihood estimates, using all the data.

Obviously, these outliers can disturb the covariance matrix
and so the blue is sort of too thick in one direction
whereas the red is not. So now, if you have this red model,
you can basically say, all the data lies within these two
standard deviations and so everything outside here is an
outlier.

Fit robust covariance matrix and mean
FIXME add slide on Covariance: Minimum Covariance Determination (MinCovDet)

---
#Elliptic Envelope

- Preprocessing with PCA might help sometimes.

.smaller[
```python
from sklearn.covariance import EllipticEnvelope
ee = EllipticEnvelope(contamination=.1).fit(X)
pred = ee.predict(X)
print(pred)
print(np.mean(pred == -1))
```]

.smaller[
```python
[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 -1  1  1  1 -1 -1  1 -1  1  1 -1  1 -1 -1 -1  1  1 -1 -1 -1 -1  1 -1  1  1]
0.104

```]

???
The way we do this in scikit-learn. The covariance module
has all the robust covariance methods. As I said, you have
to specify the contamination, which is how many outliers do
you expect. Here, I set it to 10%.

If I predict I get 1s and -1s. All the outlier models in
scikit-learn use 1 for inliers and -1 to outliers.

The mean of the prediction tells me that about 10% of the
training data was labeled as an outlier, which is what we
would expect, given that we set it as 10%.  You could also
get a continuous score saying how much of an outlier is each
point with the score samples methods.

Here, in the elliptic envelope, contamination parameter
changes how the model is fit, for some other models, this
will actually only change the threshold. It’s important here
to have the right contamination parameters for the right
covariance fit.

For real-world model, set it based on what your expectations
are.

---
# Failure-case: Non-Gaussian Data

.center[
![:scale 80%](images/elliptic_envelope_plot.png)
]

???
If the data is not Gaussian, we get this.

In this dataset, my intuition was that the isolated three
points are the outliers and the rest are the normal data.
Since the data is non-Gaussian, it gives you 10% as outliers
that are furthest away from the mean. And this is not what I
wanted at all.

So if your data is very non-Gaussian, then the method will
not clearly work well.

Now, we could obviously use a more complicated density
model. So we talked about Gaussian mixture models, we could
use a Gaussian mixture model. Instead of fitting a single
Gaussian, we could try to fit multiple Gaussian. If you just
use the Gaussian mixture models in scikit-learn, they will
not do a robust fit, so that might make more sense in the
novelty detection than in the outlier detection sense
because it will try to fit all of the data. You can still
try to do it on the outlier detection task and just fit, so
if I fit three Gaussians here, it'll probably give me the
right outliers. But then again, I need to know what the
number of components is for my GMM, for this to work well.

So I still to make the assumptions about what the density
is.

Another approach is instead of having a parametric density
model like this, we can do a non-parametric density model
like a kernel density estimate.

Could do mixtures of gaussians obviously!
---
class: center, spacious
#Kernel Density

![:scale 80%](images/kde_vs_histogram.png)

???
KDE is a most simple non-parametric estimate for probability
distributions. The ticks at the bottom in the left plot are
data points. One way to visualize the distribution is to do
a histogram.

For the KDE estimate in the right, we put a small Gaussian
blob around each data point. So here, each data point here
at the bottom corresponds to one of these Gaussians and then
we just sum them all up and this gives you sort of this
smooth density here.

So this a little bit like a GMM where we have as many
components as data points. You could also use other kernels.

The word kernel here means slightly different than in SVMs.
Kernel here means kernel in the sense of signal processing.
So another commonly used kernel here is the top hat kernel,
which is a kernel in the signal processing sense but it's
not a kernel in the support vector machine sense, the
Gaussian kernel was the kernel in both senses.

A kernel can mean a lot of different things. And here, it
means basically windowing function. And so any windowing
function would work for this, you kind of put a little bit
of probability math around each data point.

An obvious problem with this is that you need to pick the
bandwidth.

- Non-parametric density model

- Gaussian blob on each data point

- Doesn’t work well in high dimensions
---
class:center

# Kernel Bandwidth

![:scale 50%](images/kde_bandwidth.png)

???
So depending on what bandwidth you pick, you can either over
smooth or under smooth the data. Here, in red, you’ve picked
too small of bandwidth while in green, you’ve picked two
large of bandwidth and black is probably a decent bandwidth.

Again, this is an unsupervised problem. So it's very hard to
do this. You can actually use cross validation to adjust the
bandwidth. So you can look at what's the score of the held
out data.

This is also not a robust estimate. So if you have outliers
in your data, the outliers in your data might influence your
estimate.

The other problem with this is that KDDs don't work in
higher dimensions. You can obviously not only do this
1-dimension, but you can also do this in any number of
dimensions. But here, you get the curse of dimensionality.
Basically, if you have higher dimensional space, you need
more and more data points to actually fill the space with
these small Gaussian blobs. If you would want to do a
histogram, in like 10 dimensions, and you don't have enough
data, then most of the histogram cells will be empty. And
here, this is only like a smooth version of the histogram so
it has the same problem.

So if your space is in high dimensional, most of the space
will be empty and this is not going to work very well.

- Need to adjust kernel bandwidth

- Unsupervised model, so how to pick kernel bandwidth?
- cross-validation can be used to pick the bandwidth, but if there's outliers in the training
data, could go wrong?

---

.smaller[
```python
kde = KernelDensity(bandwidth=3)
kde.fit(X_train_noise)
pred = kde.score_samples(X_train_noise)
pred = (pred > np.percentile(pred, 10)).astype(int)
```]

.center[
![:scale 70%](images/kernel_density_bw3.png)
]
???
If your space is low dimensional, and you can plot it, it
might work nicely. So here, I might have used the
cross-validation to find out the bandwidth of 3 is good. And
then I can look at the scores. So here, KDE is not an
outlier detection method in scikit-learn, but I can use it
as an outlier detection method by just looking at score
samples, which are the log probabilities of all the data
points under this probability model. And I say that
everything that's higher than the 10th percentile is an
inlier.

So basically, now I label 10% of the data as outliers and I
actually get the right three points that I wanted and a
couple more points. Obviously, I get more points, because I
told them I want 10% of my data to be outliers.

So this is a really is a very simple method. But it doesn’t
work well in higher dimensions and it gets very slow since
you have a lot of these kernels that you need to evaluate.
So you need to compute the distance between all pairs of
points mostly.

---

# One Class SVM

- Also uses Gaussian kernel to cover data

- Only select support vectors (not all points)

- Specify outlier ratio (contamination) via nu

.smaller[
```python
from sklearn.svm import OneClassSVM
scaler = StandardScaler()
X_train_noise_scaled = scaler.fit_transform(X_train_noise)
oneclass = OneClassSVM(nu=.1).fit(X_train_noise_scaled)
pred = oneclass.predict(X_train_noise_scaled)
```]

???
A more sophisticated variant of this is the one class SVM.
This also uses Gaussian kernels to basically cover the data.
But it selects only support vectors, not all points as basis
points. You get a similar function, in KDE.

But the density function is supported only by some support
vectors. Again, you need to specify the bandwidth parameter
gamma. So this only makes sense with an RVF kernel. It’s
quite similar to what KDE does, but only, it selects support
vectors and does it in a more robust way.

You also have to set the number of outliers you expect as
nu. Again, nu is part of the optimization process. So
setting the outlier fraction differently will change how the
models fit.

- Need to adjust kernel bandwidth
- nu is "training mistakes"

---
.center[
![:scale 80%](images/one_class_svm_plot.png)
]

???
So here, this was with a particular setting of gamma, and
you can see that it seems like a somewhat reasonable model.
If I made gamma a little bit smaller, it would probably have
found the right points. But here, it's even harder to adjust
the gamma parameter because there's no way to really do it.
So for KDE, I can do cross-validation to see how good of a
probability model this is, while the SVM doesn't have a
probability model. So I can't do cross-validation or
anything. So I just need to pick gamma in some way that
makes sense to me, which is not great.

If you can visualize data, obviously, that helps. But in
higher dimensions, you need to look at projections or other
things to see how to set gamma. So in a sense, is also sort
of a non-parametric density estimate. But it doesn't really
have a probabilistic model.

---
class:center,middle

#Isolation Forests

???
The final model I want to talk about is also a
non-parametric estimate that also doesn't have a probability
model and it’s called isolation forest

By far, it’s my favorite since it has no parameters to tune.

---
#Idea

- Outliers are easy to isolate from the rest

.center[
![:scale 80%](images/isolation_forests.png)
]

???
So the idea of isolation forest is that if you build a
random tree over a dataset, then if you want to figure out
how easy is it to split up a particular point, it's much
easier to split up a point that's an outlier, that’s on the
outside of the data than a point that's like somewhere where
the data is very dense.

The idea is that you build many completely random trees,
it's complete unsupervised, so it just keeps splitting the
data in some way and we look at how deep do we need to go to
isolate a data point from the other data points.

If on average, we have to go very deep into the tree, it’s
probably if some of our data is dense, it's not an outlier.
So on average, if we split off the point very early, it's
probably an outlier.

- Measure as Path length!

---
class:center,middle
.left-column[
![:scale 100%](images/isolation_forests.png)
]

.right-column[
![:scale 100%](images/avgpathlen_numtrees.png)
]
???
If you add more and more completely random trees, you get a
relatively stable score that tells you is it an outlier or
is it an outlier. You can think of this as sort of trying to
model the density of the data. But there's no probabilistic
model here.

X1 has over 1000 trees, you need a very long path line so
you go very deep into a tree before it's isolated from the
other point. That means it's in a very dense region.
Whereas, X0, on average is split up quite early from the
rest of data points so it's probably an outlier. So
basically, if you're on the outside of the data, no matter
what feature is picked, you’re likely to be split off, given
that you're an outlier with respect to any of these
features.

---
#Normalizing the Path Length

Average path length of unsuccessful search in Binary Search Tree:

`$$ c(n) = 2H(n-1) - \left(\frac{2(n-1)}{n}\right) \text{  (H = Harmonic number)}$$`

`$$ s(x,n) = 2^{-\frac{E(h(x))}{c(n)}} \text{   (h = depth in tree)}$$`

- s < 0.5 : definite inlier

- s close to 1: outlier

???
So to make this more coherent, we need to normalize the path
lines. And so depending on how many data points there are,
you expect to go deeper to the tree to separate something.
And so you can calculate what the average path length of an
unsuccessful search is in the binary tree, which is similar
to trying to isolate a point, and you can compute this
number.

And the score that we actually compute, the outlier score is
2 to the minus average path length over all the trees
divided by the average path lengths for an unsuccessful
search in binary trees. So this only depends on n which is
the number of data points. So basically, we are only
normalizing the score to make sense independent of the
dataset size.

If this score is smaller than 0.5 then definitely you’re an
inlier. If it’s closer to 1, it’s an outlier.

Basically, the way you determine outliers is by threshold
the score. So here, setting the number of expected outliers
doesn't change the algorithm at all, it only changes the
threshold for this core function. So here, picking the
number of outliers in advance is not as important.

---
class:spacious
# Building the forest

- Subsample dataset for each tree

- Default sample size of 256 works surprisingly well

- Stop growing tree at depth log_2(sample size) – so 8

- No bootstrapping

- More trees are better – default 100

- Need to specify contamination rate

???
It has no parameters, it’s quite simple to do this. So what
we're doing is actually we subsample the dataset for each
tree and we picked 256 samples. The default value of 256
samples always works, no matter what the dataset size is.
And we stopped growing the tree at depth log_2 (sample
size), which is 8.

So you repeatedly draw, without replacement, 256 samples
from your data and grow trees of depth 8 and then you look
at the average path length to isolate a point.

Obviously, as with all random forest, more trees are better
but the default in scikit-learn is 100.

In principle, these are free parameters of the algorithm,
like, how much to subsample for each tree, and how to prune
each tree. But people don't tune these parameters and it
just works well.

The contamination rate only picks the threshold on this
final score.

FIXME all the text
---
.center[![:scale 90%](images/building_forest_1.png)]
???
Here is what the algorithm produces and it worked as good as
I thought.

Remember, this is a toy dataset. It doesn't really tell you
that much about how works in the real world.

---
.center[![:scale 90%](images/building_forest_2.png)]
???
Here, I plotted the score for each data point. You could do
this for data models too. Here, since it’s tree-based, it’s
not nice and smooth as it is for support vector machine or
the KDE. I didn't have to tune any kernel bandwidth or
anything and so that's kind of nice.

Again, it also depends on what your assumptions are about
outliers. This will not work very well for the homework. And
you can think about why this does not work very well from
your homework. 
---
class:spacious
# Other density based-models

- PCA

- GMMs

- Robust PCA (not in sklearn :-()

- Any other probabilistic model - “robust” is better.

???
You can use other density based models. You can use just PCA which is also like a higher dimensional Gaussian model in some sense, where you drop some of the directions.

You can use GMMs.

There’s a robust variant of PCA that is, unfortunately, not
on scikit-learn.

You can use any probabilistic model that you want. But you
need to think about if the model is appropriate for the data
that you’re trying to model. And if you expect there's a lot
of outliers in your training dataset already, then you might
need to think about how to make the model robust.

So PCA is not robust, while robust PCA is robust. And so if
you have very big outliers in your dataset, it will skew
your PCA results and so that might not work as well.

robust only needed for outlier detection, not novelty detection.
---
.center[
![:scale 60%](images/other_density_models_1.png)
]

.center[
![:scale 60%](images/other_density_models_2.png)
]

.center[
![:scale 60%](images/other_density_models_3.png)
]
???
Here is a comparison of the three of the four models that we
talked about. So here, we're looking at isolation forest,
one class SVM, and robust covariance.

Basically, the robots covariance works perfectly for
isotropic Gaussian, because that's what it fits. If you have
multiple Gaussians it kind of breaks down. So if they're
close enough together, maybe it makes sense to model them as
joint Gaussian. But if you put them further apart,
basically, it will change the covariance shape and so it
will have a bad model of your data.

So the isolation forest does kind of reasonably well in all
cases. I mean, this is a 2D dataset and so it can just find
dense regions without a problem. Whereas the one class SVM,
gives slightly strange results, probably because it picked
some support vectors to cover the data.

Ideally, the one class SVM is supposed to be robust to
contamination in the training set, but as we can see, it's
not that robust to contamination. The one clause SVM might
work better when you have a clean training dataset.

There's no definition of true outliers, obviously. But here,
we have drawn very densely from either one or two Gaussian
models, the inliers are white and then we have random
uniform over this whole square some outliers.

The idea is basically, there's 3 different distribution that
the data is drawn from. There's like some Gaussian points
and some points that are just uniform. And you want to
isolate the very dense points from the uniform that not very
dense points.

---
class:spacious
# Summary

- Isolation Forest works great!

- Density models are great if they are correct.

- Estimating bandwidth can be tricky in the
unsupervised setting.

- Validation of results often requires manual inspection.

???
As with all unsupervised methods, for outlier detection,
validating the model and tuning the parameters are really
hard. So the more your model depends on parameters, like the
one class SVM depends a lot on parameters, it's just very
tricky to do that.

As with the clustering, validation often means looking into
the data, looking at single data points, why are they
outliers and trying to interpret the results. Because if you
have two labels, why don't you learn a classifier.

One possible approach that I didn’t talk about is if you
have a big dataset that's not labeled, you can run an
outlier detection algorithm, you can find like 10% most
outlier things according to your algorithm, you can confirm
manually whether they are outliers or not and then you could
run a classifier. It depends a little bit on whether the
outliers are all outliers in a similar way or outliers in
different ways. So if all your outliers are outliers in a
different way, then running a classifier will actually not
work. So in that setting, you might actually be better off
with an outlier detection method.

Even if you have labels, if all of your outliers are
outliers in a very different way, it might be better to just
build the model off the non-outlier data and then call
everything else that’s an outlier instead of trying to learn
a classifier. If there's no dense region of outliers, then
you can’t learn a classifier for that.

---
class: middle
# Questions ?