class: center, middle ### W4995 Applied Machine Learning # NMF; Outlier detection 04/02/18 Andreas C. Müller ??? Today, I want to talk about non-negative matrix factorization and outlier detection. FIXME matrix factorization should be approximately equal, not equal FIXME list KL divergence loss for NMF FIXME related gaussian model and Kernel density to GMM? FIXME talk more about all outliers are different in different ways vs classification (add example where outlier is better?) --- class: center, middle # Non-Negative Matrix Factorization ??? NMF is basically in line with what we talked about with dimensionality reduction but also related to clustering. It’s a particular algorithm in a wider family of matrix factorization algorithms. --- #Matrix Factorization
.center[ ![:scale 90%](images/matrix_factorization.png) ] ??? So matrix factorization algorithm works like this. We have our data matrix, X, which is number of samples times number of features. And we want to factor it into two matrices, A and B. A is number of samples times k and B is k times number of features. If we do this in some way, we get for each sample, a new representation in k-dimensional, each row of k will correspond to one sample and the columns of B will correspond to the input features. So they encode how the entries of A correspond to the original data. Often this is called latent representation. So, these new features here in A, somehow represent the rows of X and the features are encoded in B. --- #Matrix Factorization
.center[ ![:scale 90%](images/matrix_factorization_2.png) ] ??? --- #PCA
.center[ ![:scale 80%](images/pca.png) ] ??? The one example of this that we already saw is principal component analysis. In PCA, B was basically the rotation and then a projection. And A is just a projected matrix. This is just one particular factorization you can do. I have the sign equal, but usually, more generally, you're trying to find matrixes A and B so that this is as close to equal as possible. In particular, for PCA, what we're trying to do is we try to find matrices A and B so that the rows of B are orthogonal, and we restrict the rank of k. K would be the number of components and we get PCA, if we restrict the columns of B to be orthogonal, then the product of matrixes A and B is most close to X in the least square sense is the principal component analysis. --- class:spacious # Other Matrix Factorizations - PCA: principal components orthogonal, minimize squared loss - Sparse PCA: components orthogonal and sparse - ICA: independent components - Non-negative matrix factorization (NMF): latent representation and latent features are nonnegative. ??? There’s a bunch of other matrix factorization. They basically change the requirements we make in matrices A and B and what is the loss that we optimize. As I said, PCA wants the columns of B to be orthogonal, and minimize the squared loss to the data. In sparse PCA you do the same, but you say the entries of B should also be sparse, so you put an L1 norm on it, and then that maybe gives you more easily interpretable features or definitely more sparse features. Independent component analysis tries to make sure that the projections given by B are statistically independent. NMF is used more commonly in practice. In this, we restrict ourselves to the matrices being positive, what I mean by that is, each entry is a positive number. --- #NMF
.center[ ![:scale 80%](images/nmf.png) ] ??? In NMF, we compute X as a product of W and H. W stands for weights, and H stands for hidden representation. We want to find W and H so that their product is as close as possible to X. --- class: spacious # Why NMF? -- - Meaningful signs -- - Positive weights -- - No “cancellation” like in PCA -- - Can learn over-complete representation ??? Since everything is now positive, you have meaningful signs. In, PCA the signs were arbitrary so you could flip all the signs and component and it would still be the same thing. In NMF, everything is positive so there's no flipping of signs. All the weights are positive, this makes it much easier for us to conceptualize. So you could think about, as each data point is represented as a positive combination of the columns of W, so positive combinations of these basic parts. This is often thought to represent something like part. So you basically build up each data point out of these columns in W that has like positive contributions. If you have like positive and negative coefficients it’s usually much harder to interpret. In PCA, you can get the cancellation effect. Basically, you can have components that vary a lot in different directions and so once you add them all up, the variation vanishes. Since you have positive weights, you can think of NMF a little bit like soft clustering, in the sense of slightly related to GMMs, you can think of the columns of W as being something like prototypes and so you express each data point as a combination of these prototypes. Since they're all positive, it's more similar to clustering. Another thing that’s quite interesting is that you can learn overcomplete representations. What that means is that you can learn representations where you have more components and features. So before we had this k and it was drawn as being smaller than number of features, if you have more than number of features, you can always trivially make W the identity matrix, and so nothing happens. But if you add particular restrictions to W and H, then you can learn something, where actually the H is wider than X, which might be interesting if you're interested in extracting interesting features. In NMF, you can do this by, for example, imposing sparsity constraints. - positive weights -> parts / components - Can be viewed as “soft clustering”: each point is positive linear combination of weights. - (n_components > n_features) by asking for sparsity (in either W or H) --- .center[ PCA (ordered by projection, not eigenvalue) ![:scale 70%](images/pca_projection.png) ] .center[ NMF (ordered by hidden representation) ![:scale 70%](images/nmf_hid_rep.png) ] ??? Here's an example of what this looks like in practice. Here we have NMF and PCA. This is the digit dataset, so the images are 28x28 grayscale images. I have these the digits, 0 and 1, and I tried to express them as a linear combination of the PCA components and as a linear combination of the NMF components. Here, I ordered them by which component is the most important for each of these data points. For both, 0 and 1, the most active component is this one, which is a guess either 0 or 1, depending on whether it's positive or negative. You can see that neither of these latter components looks a hardly like a 0 or a 1. What you’re seeing here is the cancellation effect taking place. The idea with NMF is that in NMF, hopefully, you got something that’s more interpretable in particular, since everything is positive. Usually, we can't actually look at our data this easily and so being able to interpret the component might be quite important. Often, people use this for gene analysis, for example, where you have very high dimensional vectors, and if you want to see which genes act together and so having these positive combinations really helps with interpretability. --- # Downsides of NMF - Can only be applied to non-negative data -- - Interpretability is hit or miss -- - Non-convex optimization, requires initialization -- - Not orthogonal -- .center[ ![:scale 60%](images/nmf_downsides.png) ] ??? There's a couple of downsides. The most obvious one is it can only be applied to non-negative data, if your data has negative entries, then obviously, you can't express it as a product of two positive matrices. Also, 0 needs to mean something. 0 needs to be in the absence of signal for this to make sense. So obviously, you could always make your data positive by subtracting the minimum of the data and just shifting the 0 point. But that's not going to work very well. 0 really needs to mean something in NMF to make sense. One kind of data where NMF is commonly used is text data. In text data, you look at word counts, and they're always positive so they’re a very natural candidate. As with all things interpretable, the interpretability of a model is a hit or miss. So it depends a lot on the application and on the data, whether you actually get something out that you can understand. This is an unsupervised model and so the outcome might be anything. Another issue is that this is a non-convex optimization and it requires initialization. So often this is initialized with PCA, or you could actually think about initializing it with KMeans or something else, there's no direct formula to compute W and H. So in PCA, we can just do an SVD and it gives us the rotation and it gives us the representation of the data. In NMF, we basically have to do something like gradient descent. There are several different ways to do this but basically, all of them come down to some form of optimizing minimization problem and trying to find a W and H. So depending on how you initialize them, and what kind of optimization you run, you'll get different outcomes, and you will get some local optimum, but you're not going to get the global optimum. So that’s a bit disappointing. Even if you did compute this for a dataset. So you have your data matrix X, you decompose it into W and H and now you might want to take new data and projected it into the same space. In PCA, you can just do the rotation to new data. In an NMF, you actually have to run the optimization again. So there's no forward process that can give you H given W. So you need to keep W fixed and then run the optimization over H. So the only way to get the hidden representation for any data is by running optimization, given the weights W doing the optimization is convex, though, so it's not really problematic, and the outcome will always be the same, but it takes a lot longer to run optimization for something, than just do rotation. The components are not orthogonal as in PCA which can make sometimes be a bit confusing because you can't really think about it in terms of projections. Finally, another thing that's quite different in NMF than PCA is: A) They're not ordered B) If you want fewer components, there will not be a subset of the more components solution. So here, is a solution for this positive dataset. If I do NMF with two components, I get these two arrows, they basically point towards the ends of the data which has the extreme points so it can express everything in there as a positive combination of the two. And if I do only one component, I get the mean of the data, because that's the best I can do with one component. If I change the number of components, I might get completely different results every time. So picking the right number of components is quite critical because it really influences what the solution looks like. So if compute 20 components, and then I compute 10 components, the 10 will be just completely different from any subset of the 20 components. --- .center[ NMF with 20 components on MNIST ![:scale 60%](images/nmf_20_mnist.png) ] .center[ NMF with 5 components on MNIST ![:scale 60%](images/nmf_5_mnist.png) ] ??? Here's another illustration of this. If I do 20 components, I get the top, if I do 5 components, I get the bottom. And you can see these are quite different. And if I would use more components, they would probably be even more localized, and be just like, sort of small pieces of strokes. With 5 components, you get something like digits out there and all of these together seems to sort of cover space somewhat. --- class:spacious # Applications of NMF - Text analysis (next week) - Signal processing - Speech and Audio (see
librosa
) - Source separation - Gene expression analysis ??? This is often used in text analysis. It's also used in signal processing, particular for speech and audio. It's often down for separating the sources, when there are multiple speakers, and you want to separate out one of the speakers or if you want to remove the singer’s voice and keep the music intact. This can be done using the library librosa. There are also plugins for common audio players, where you can just remove the voice with NMF. Also, it's commonly used gene expression analysis. --- class:center,middle #Outlier Detection ??? The idea in outlier detection is to find points that are different. There are two quite related kinds of outlier detection summarized under outlier detection. There's outlier detection and novelty detection. --- #Motivation .padding-top[ .left-column[ ![:scale 100%](images/outlier_detection.png) ] .right-column[ ![:scale 100%](images/novelty_detection.png) ] ] ??? Both are unsupervised methods. The idea is to find things that are different. In outlier detection, usually, your training dataset also contains outliers which makes it a bit dirty, while in novelty detection, you get a dataset and then, later on, someone gives you new data and asks you what is new here. So in novelty detection, your dataset would be clean. But in both, you want to identify something that's sort of different from the standard distribution. Only, in the outlier detection case, there are already different samples in your training dataset. This is one of the rigidly few unsupervised problems that are pretty heavily used in practice. - Find points that are “different” within the training set (and in the future). - “Novelty detection” - no outliers in the training set. - Outliers are not labeled! (otherwise it’s just imbalanced classification) - Often outlier detection and novelty detection used interchangeably in practice. --- class:spacious # Applications - Fraud detection (credit cards, click fraud, ...) - Network failure detection - Intrusion detection in networks - Defect detection (engineering etc…) - News? Intelligence? ??? - usual assumption: all outliers are different in a different way. - Often people use classification datasets for outlier detection: that's a bit strange. See homework results. --- class:spacious # Basic idea - Model data distribution $p(X)$ -- - Outlier: $p(X) < \varepsilon$ -- - For outlier detection: be robust in modelling $p(X)$ ??? The main idea is, you model your data distribution, p(X). And then you look at the data points that are unlikely under the model. So if it's unlikely under the model, then it's probably an outlier. If you’re doing outlier detection that means your sample is going to be contaminated. So they're outliers in the dataset X already if that is the case, you want to be robust in modeling p(X), so you want to model p(X) in a way that is robust to outliers. Both of these tasks are generally ill-defined. So unless you actually know the real data distribution, how good something does is hard to measure. Usually, you don't have ground truth of what the outliers are if you have the ground truth what the outliers are, you could just do an imbalance classification task. So similar to clustering, what we're doing here is building a model, trying to find some outliers, and then check if they're actually outliers. If we are satisfied with the things that our model finds, then we can put into production. But there's no guarantee that it's going to find like x fraction of the outliers and since we don't have labeled data, usually, we can't really measure our recall. We won't ever know about the samples that we didn't find. So maybe with respect to the homework and in the homework actually have ground truth data. And it's similar to the clustering setting, where researchers are basically cheating and evaluating methods that are unsupervised in a supervised manner. So once you have ground truth label, you can evaluate your outlier detection, using, for example, AUC, which is what you're going to do your homework. But in a real-world setting, if you had the labels, you would never use outlier detection. Again, similar to clustering, what makes an outlier in a particular dataset is ill-defined. So depends on the application, what do you want to consider an outlier or not. So in the homework, it’s ill and healthy, and the ill people are the outliers. But it could also be that the people that are much older and everybody else are the outliers. If there's sort of a very clear density model, and so you don't know what the density of the data is supposed to look like and then, you know, things that don't share this density, then you can sort of define this. But usually you don't know what the density supposed to look like and so there's no clearly defined notion of what is an outlier. Similarly only, there's no clearly defined solution of what should be a cluster. So we're going to talk through a couple of different algorithms that make different assumptions about what makes a data point an outlier. So here, I said, we start usually with a data distribution p(X). - Task is generally ill-defined (unless you know the real data distribution). - Task is generally ill-defined (unless you know the real data distribution). --- #Elliptic Envelope `$$p(X) = \mathcal{N}(\mu, \Sigma)$$` .center[ ![:scale 60%](images/elliptic_envelope.png) ] ??? This leads to the elliptic envelope outlier detection. Basically, it fits a Gaussian into the data, and then it looks at the points that are not fit well by the Gaussians and those points are the outliers. Since this is meant for the outlier detection task, what we're trying to do is actually trying to find a robust estimate of the mean and covariance matrix so that we can tolerate some outliers in the training dataset, and still sort of recover the real covariance matrix. In this illustration, the black points, inliers, are what you expect the data to look like and the outlier distribution is plotted in red. They overlap, so usually in unsupervised methods, you will never label any of these red points as outliers but we could try to figure out that these points here are outliers. The way that elliptic envelope works are that it finds a robust version of the variance matrix. Basically, it finds the covariance matrix with the smallest determinant that still covers a large portion of the data. In most outlier detection methods, you have to specify how many outliers you expect. So here, let's say we specify 10% of the data to be outliers, then it will try to construct the covariance matrix that covers 90% of the data, but holds the lowest possible determinant. If you do this, you get the red dotted circles, which are the contours covariance fitted with the robust estimate and the blue circles are the covariance fitted with the maximum likelihood estimates, using all the data. Obviously, these outliers can disturb the covariance matrix and so the blue is sort of too thick in one direction whereas the red is not. So now, if you have this red model, you can basically say, all the data lies within these two standard deviations and so everything outside here is an outlier. Fit robust covariance matrix and mean FIXME add slide on Covariance: Minimum Covariance Determination (MinCovDet) --- #Elliptic Envelope - Preprocessing with PCA might help sometimes. .smaller[ ```python from sklearn.covariance import EllipticEnvelope ee = EllipticEnvelope(contamination=.1).fit(X) pred = ee.predict(X) print(pred) print(np.mean(pred == -1)) ```] .smaller[ ```python [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 -1 -1 1 -1 1 1 -1 1 -1 -1 -1 1 1 -1 -1 -1 -1 1 -1 1 1] 0.104 ```] ??? The way we do this in scikit-learn. The covariance module has all the robust covariance methods. As I said, you have to specify the contamination, which is how many outliers do you expect. Here, I set it to 10%. If I predict I get 1s and -1s. All the outlier models in scikit-learn use 1 for inliers and -1 to outliers. The mean of the prediction tells me that about 10% of the training data was labeled as an outlier, which is what we would expect, given that we set it as 10%. You could also get a continuous score saying how much of an outlier is each point with the score samples methods. Here, in the elliptic envelope, contamination parameter changes how the model is fit, for some other models, this will actually only change the threshold. It’s important here to have the right contamination parameters for the right covariance fit. For real-world model, set it based on what your expectations are. --- # Failure-case: Non-Gaussian Data .center[ ![:scale 80%](images/elliptic_envelope_plot.png) ] ??? If the data is not Gaussian, we get this. In this dataset, my intuition was that the isolated three points are the outliers and the rest are the normal data. Since the data is non-Gaussian, it gives you 10% as outliers that are furthest away from the mean. And this is not what I wanted at all. So if your data is very non-Gaussian, then the method will not clearly work well. Now, we could obviously use a more complicated density model. So we talked about Gaussian mixture models, we could use a Gaussian mixture model. Instead of fitting a single Gaussian, we could try to fit multiple Gaussian. If you just use the Gaussian mixture models in scikit-learn, they will not do a robust fit, so that might make more sense in the novelty detection than in the outlier detection sense because it will try to fit all of the data. You can still try to do it on the outlier detection task and just fit, so if I fit three Gaussians here, it'll probably give me the right outliers. But then again, I need to know what the number of components is for my GMM, for this to work well. So I still to make the assumptions about what the density is. Another approach is instead of having a parametric density model like this, we can do a non-parametric density model like a kernel density estimate. Could do mixtures of gaussians obviously! --- class: center, spacious #Kernel Density ![:scale 80%](images/kde_vs_histogram.png) ??? KDE is a most simple non-parametric estimate for probability distributions. The ticks at the bottom in the left plot are data points. One way to visualize the distribution is to do a histogram. For the KDE estimate in the right, we put a small Gaussian blob around each data point. So here, each data point here at the bottom corresponds to one of these Gaussians and then we just sum them all up and this gives you sort of this smooth density here. So this a little bit like a GMM where we have as many components as data points. You could also use other kernels. The word kernel here means slightly different than in SVMs. Kernel here means kernel in the sense of signal processing. So another commonly used kernel here is the top hat kernel, which is a kernel in the signal processing sense but it's not a kernel in the support vector machine sense, the Gaussian kernel was the kernel in both senses. A kernel can mean a lot of different things. And here, it means basically windowing function. And so any windowing function would work for this, you kind of put a little bit of probability math around each data point. An obvious problem with this is that you need to pick the bandwidth. - Non-parametric density model - Gaussian blob on each data point - Doesn’t work well in high dimensions --- class:center # Kernel Bandwidth ![:scale 50%](images/kde_bandwidth.png) ??? So depending on what bandwidth you pick, you can either over smooth or under smooth the data. Here, in red, you’ve picked too small of bandwidth while in green, you’ve picked two large of bandwidth and black is probably a decent bandwidth. Again, this is an unsupervised problem. So it's very hard to do this. You can actually use cross validation to adjust the bandwidth. So you can look at what's the score of the held out data. This is also not a robust estimate. So if you have outliers in your data, the outliers in your data might influence your estimate. The other problem with this is that KDDs don't work in higher dimensions. You can obviously not only do this 1-dimension, but you can also do this in any number of dimensions. But here, you get the curse of dimensionality. Basically, if you have higher dimensional space, you need more and more data points to actually fill the space with these small Gaussian blobs. If you would want to do a histogram, in like 10 dimensions, and you don't have enough data, then most of the histogram cells will be empty. And here, this is only like a smooth version of the histogram so it has the same problem. So if your space is in high dimensional, most of the space will be empty and this is not going to work very well. - Need to adjust kernel bandwidth - Unsupervised model, so how to pick kernel bandwidth? - cross-validation can be used to pick the bandwidth, but if there's outliers in the training data, could go wrong? --- .smaller[ ```python kde = KernelDensity(bandwidth=3) kde.fit(X_train_noise) pred = kde.score_samples(X_train_noise) pred = (pred > np.percentile(pred, 10)).astype(int) ```] .center[ ![:scale 70%](images/kernel_density_bw3.png) ] ??? If your space is low dimensional, and you can plot it, it might work nicely. So here, I might have used the cross-validation to find out the bandwidth of 3 is good. And then I can look at the scores. So here, KDE is not an outlier detection method in scikit-learn, but I can use it as an outlier detection method by just looking at score samples, which are the log probabilities of all the data points under this probability model. And I say that everything that's higher than the 10th percentile is an inlier. So basically, now I label 10% of the data as outliers and I actually get the right three points that I wanted and a couple more points. Obviously, I get more points, because I told them I want 10% of my data to be outliers. So this is a really is a very simple method. But it doesn’t work well in higher dimensions and it gets very slow since you have a lot of these kernels that you need to evaluate. So you need to compute the distance between all pairs of points mostly. --- # One Class SVM - Also uses Gaussian kernel to cover data - Only select support vectors (not all points) - Specify outlier ratio (contamination) via nu .smaller[ ```python from sklearn.svm import OneClassSVM scaler = StandardScaler() X_train_noise_scaled = scaler.fit_transform(X_train_noise) oneclass = OneClassSVM(nu=.1).fit(X_train_noise_scaled) pred = oneclass.predict(X_train_noise_scaled) ```] ??? A more sophisticated variant of this is the one class SVM. This also uses Gaussian kernels to basically cover the data. But it selects only support vectors, not all points as basis points. You get a similar function, in KDE. But the density function is supported only by some support vectors. Again, you need to specify the bandwidth parameter gamma. So this only makes sense with an RVF kernel. It’s quite similar to what KDE does, but only, it selects support vectors and does it in a more robust way. You also have to set the number of outliers you expect as nu. Again, nu is part of the optimization process. So setting the outlier fraction differently will change how the models fit. - Need to adjust kernel bandwidth - nu is "training mistakes" --- .center[ ![:scale 80%](images/one_class_svm_plot.png) ] ??? So here, this was with a particular setting of gamma, and you can see that it seems like a somewhat reasonable model. If I made gamma a little bit smaller, it would probably have found the right points. But here, it's even harder to adjust the gamma parameter because there's no way to really do it. So for KDE, I can do cross-validation to see how good of a probability model this is, while the SVM doesn't have a probability model. So I can't do cross-validation or anything. So I just need to pick gamma in some way that makes sense to me, which is not great. If you can visualize data, obviously, that helps. But in higher dimensions, you need to look at projections or other things to see how to set gamma. So in a sense, is also sort of a non-parametric density estimate. But it doesn't really have a probabilistic model. --- class:center,middle #Isolation Forests ??? The final model I want to talk about is also a non-parametric estimate that also doesn't have a probability model and it’s called isolation forest By far, it’s my favorite since it has no parameters to tune. --- #Idea - Outliers are easy to isolate from the rest .center[ ![:scale 80%](images/isolation_forests.png) ] ??? So the idea of isolation forest is that if you build a random tree over a dataset, then if you want to figure out how easy is it to split up a particular point, it's much easier to split up a point that's an outlier, that’s on the outside of the data than a point that's like somewhere where the data is very dense. The idea is that you build many completely random trees, it's complete unsupervised, so it just keeps splitting the data in some way and we look at how deep do we need to go to isolate a data point from the other data points. If on average, we have to go very deep into the tree, it’s probably if some of our data is dense, it's not an outlier. So on average, if we split off the point very early, it's probably an outlier. - Measure as Path length! --- class:center,middle .left-column[ ![:scale 100%](images/isolation_forests.png) ] .right-column[ ![:scale 100%](images/avgpathlen_numtrees.png) ] ??? If you add more and more completely random trees, you get a relatively stable score that tells you is it an outlier or is it an outlier. You can think of this as sort of trying to model the density of the data. But there's no probabilistic model here. X1 has over 1000 trees, you need a very long path line so you go very deep into a tree before it's isolated from the other point. That means it's in a very dense region. Whereas, X0, on average is split up quite early from the rest of data points so it's probably an outlier. So basically, if you're on the outside of the data, no matter what feature is picked, you’re likely to be split off, given that you're an outlier with respect to any of these features. --- #Normalizing the Path Length Average path length of unsuccessful search in Binary Search Tree: `$$ c(n) = 2H(n-1) - \left(\frac{2(n-1)}{n}\right) \text{ (H = Harmonic number)}$$` `$$ s(x,n) = 2^{-\frac{E(h(x))}{c(n)}} \text{ (h = depth in tree)}$$` - s < 0.5 : definite inlier - s close to 1: outlier ??? So to make this more coherent, we need to normalize the path lines. And so depending on how many data points there are, you expect to go deeper to the tree to separate something. And so you can calculate what the average path length of an unsuccessful search is in the binary tree, which is similar to trying to isolate a point, and you can compute this number. And the score that we actually compute, the outlier score is 2 to the minus average path length over all the trees divided by the average path lengths for an unsuccessful search in binary trees. So this only depends on n which is the number of data points. So basically, we are only normalizing the score to make sense independent of the dataset size. If this score is smaller than 0.5 then definitely you’re an inlier. If it’s closer to 1, it’s an outlier. Basically, the way you determine outliers is by threshold the score. So here, setting the number of expected outliers doesn't change the algorithm at all, it only changes the threshold for this core function. So here, picking the number of outliers in advance is not as important. --- class:spacious # Building the forest - Subsample dataset for each tree - Default sample size of 256 works surprisingly well - Stop growing tree at depth log_2(sample size) – so 8 - No bootstrapping - More trees are better – default 100 - Need to specify contamination rate ??? It has no parameters, it’s quite simple to do this. So what we're doing is actually we subsample the dataset for each tree and we picked 256 samples. The default value of 256 samples always works, no matter what the dataset size is. And we stopped growing the tree at depth log_2 (sample size), which is 8. So you repeatedly draw, without replacement, 256 samples from your data and grow trees of depth 8 and then you look at the average path length to isolate a point. Obviously, as with all random forest, more trees are better but the default in scikit-learn is 100. In principle, these are free parameters of the algorithm, like, how much to subsample for each tree, and how to prune each tree. But people don't tune these parameters and it just works well. The contamination rate only picks the threshold on this final score. FIXME all the text --- .center[![:scale 90%](images/building_forest_1.png)] ??? Here is what the algorithm produces and it worked as good as I thought. Remember, this is a toy dataset. It doesn't really tell you that much about how works in the real world. --- .center[![:scale 90%](images/building_forest_2.png)] ??? Here, I plotted the score for each data point. You could do this for data models too. Here, since it’s tree-based, it’s not nice and smooth as it is for support vector machine or the KDE. I didn't have to tune any kernel bandwidth or anything and so that's kind of nice. Again, it also depends on what your assumptions are about outliers. This will not work very well for the homework. And you can think about why this does not work very well from your homework. --- class:spacious # Other density based-models - PCA - GMMs - Robust PCA (not in sklearn :-() - Any other probabilistic model - “robust” is better. ??? You can use other density based models. You can use just PCA which is also like a higher dimensional Gaussian model in some sense, where you drop some of the directions. You can use GMMs. There’s a robust variant of PCA that is, unfortunately, not on scikit-learn. You can use any probabilistic model that you want. But you need to think about if the model is appropriate for the data that you’re trying to model. And if you expect there's a lot of outliers in your training dataset already, then you might need to think about how to make the model robust. So PCA is not robust, while robust PCA is robust. And so if you have very big outliers in your dataset, it will skew your PCA results and so that might not work as well. robust only needed for outlier detection, not novelty detection. --- .center[ ![:scale 60%](images/other_density_models_1.png) ] .center[ ![:scale 60%](images/other_density_models_2.png) ] .center[ ![:scale 60%](images/other_density_models_3.png) ] ??? Here is a comparison of the three of the four models that we talked about. So here, we're looking at isolation forest, one class SVM, and robust covariance. Basically, the robots covariance works perfectly for isotropic Gaussian, because that's what it fits. If you have multiple Gaussians it kind of breaks down. So if they're close enough together, maybe it makes sense to model them as joint Gaussian. But if you put them further apart, basically, it will change the covariance shape and so it will have a bad model of your data. So the isolation forest does kind of reasonably well in all cases. I mean, this is a 2D dataset and so it can just find dense regions without a problem. Whereas the one class SVM, gives slightly strange results, probably because it picked some support vectors to cover the data. Ideally, the one class SVM is supposed to be robust to contamination in the training set, but as we can see, it's not that robust to contamination. The one clause SVM might work better when you have a clean training dataset. There's no definition of true outliers, obviously. But here, we have drawn very densely from either one or two Gaussian models, the inliers are white and then we have random uniform over this whole square some outliers. The idea is basically, there's 3 different distribution that the data is drawn from. There's like some Gaussian points and some points that are just uniform. And you want to isolate the very dense points from the uniform that not very dense points. --- class:spacious # Summary - Isolation Forest works great! - Density models are great if they are correct. - Estimating bandwidth can be tricky in the unsupervised setting. - Validation of results often requires manual inspection. ??? As with all unsupervised methods, for outlier detection, validating the model and tuning the parameters are really hard. So the more your model depends on parameters, like the one class SVM depends a lot on parameters, it's just very tricky to do that. As with the clustering, validation often means looking into the data, looking at single data points, why are they outliers and trying to interpret the results. Because if you have two labels, why don't you learn a classifier. One possible approach that I didn’t talk about is if you have a big dataset that's not labeled, you can run an outlier detection algorithm, you can find like 10% most outlier things according to your algorithm, you can confirm manually whether they are outliers or not and then you could run a classifier. It depends a little bit on whether the outliers are all outliers in a similar way or outliers in different ways. So if all your outliers are outliers in a different way, then running a classifier will actually not work. So in that setting, you might actually be better off with an outlier detection method. Even if you have labels, if all of your outliers are outliers in a very different way, it might be better to just build the model off the non-outlier data and then call everything else that’s an outlier instead of trying to learn a classifier. If there's no dense region of outliers, then you can’t learn a classifier for that. --- class: middle # Questions ?