class: center, middle ### W4995 Applied Machine Learning # Evaluating Clustering 03/28/18 Andreas C. Müller ??? Last time we talked about clustering, and an obvious question is: Which algorithm do we use, and how do we set the parameters? That's a bit more of an art than a science, unfortunately. But I want to give you some tools to do better than just guessing. FIXME illustration of clustering picking up on different concepts? FIXME talk about http://datamining.rutgers.edu/publication/internalmeasures.pdf FIXME Calinski-Harabaz FIXME S_Dbw? https://pdfs.semanticscholar.org/dc44/df745fbf5794066557e52074d127b31248b2.pdf FIXME checkout https://www.coursera.org/learn/cluster-analysis/lecture/baJNC/6-5-external-measure-2-entropy-based-measures --- class: center, middle # Evaluation Measures ??? The fist thing I want to talk about is evaluation measures. There's basically two types: Supervised, where you have a ground-truth clustering, and unsupervised, where you don't. Let's start with the supervised methods. --- class: spacious #Supervised evaluation - Compare clustering against "Ground Truth". - Unclear what the assumptions are. - Impossible to do in practice. ??? - Usually classification datasets. - Often used for papers on clustering algorithms. - Compare partitions, not labels! Here you assume that you are provided some "true" cluster labels that you want to compare against. What people often do is just use a classification dataset, and basically treat it as an unsupervised clustering problem. This is particularly done by researchers in clustering that want to show that their algorithm performs better than some other algorithm. Doing this, you are measuring how well a completely unsupervised algorithm can identify the classes in the dataset as clusters. The problem here is that it's very unclear what kind of assumptions about the clusters you are making with this labeling of the data. There might be many valid ways to cluster a dataset, and it's unclear why the class labels should somehow correspond to the "best" clustering, or in what sense this clustering is "best". It's very unlikely that there's a coherent definition of cluster so that the classes correspond to clusters for any particular dataset - unless the classes are really well separated. The final downside is that this is actually impossible in practice, and therefore basically completely useless. Why would you use a clustering algorithm if you have ground truth? You'd always be better using a classification algorithm. Even if you have very little labeled data, using this in some way is probably still much better than trying to run a completely unsupervised clustering algorithm. So basically if you're running a supervised evaluation method, why are you doing clustering at all? This makes sense to do if you're studying clustering algorithms, but not really in any practical application. I still want to talk you through how these methods work. One thing to keep in mind for all the cluster evaluation metrics is that they compare partitions, not labellings. While clustering outcomes in scikit-learn (and other libraries) are always given as integers, you don't want to compare the actual integers, as you do in supervised learning. You want to compare the partitions given by the cluster labels. --- class:spacious # Thinking in Partitions - Partition: set expressed as a union of disjoint sets. - Y=[0, 0, 0, 1, 1, 1] → C={{0, 1, 2}, {3, 4, 5}} - Y=[0, 1, 0, 1, 2, 2] → C={{0, 2}, {1, 3}, {4, 5}} - Partition given by [1, 0, 1, 0] and [0, 1, 0, 1] are the same! ??? Here's a small example. Partitioning the data means expressing it as a union of disjoint sets. So a partitioning is a sett of sets. In this example, the labelling 0,0,0,1,1,1 means that the first three data points are in the same cluster, and the remaining datapoints are in a different cluster. And I wrote that down as a partition here, using the sample indices. Here's another example of 0,1,0,1,2,2, which means the zeroth and 2nd example are in one set, and the first and third sample are in a different set, and the fourth and fifth are in a different one again. So the labels really have no semantics, and the partitions given by 0,1,0,1 and 1,0,1,0 are the same, so if you compare these two clusterings, they should be regarded as exactly the same clustering. --- # Contingency Matrix For clusterings `$X_i$` and `$Y_j$`, `$n_{ij}=|X_i \cap Y_j|$` `$$ \begin{array}{c|cccc|c} {{} \atop X}\!\diagdown\!^Y & Y_1& Y_2& \ldots& Y_s& \text{Sums} \\ \hline X_1& n_{11}& n_{12}& \ldots& n_{1s}& a_1 \\ X_2& n_{21}& n_{22}& \ldots& n_{2s}& a_2 \\ \vdots& \vdots& \vdots& \ddots& \vdots& \vdots \\ X_r& n_{r1}& n_{r2}& \ldots& n_{rs}& a_r \\ \hline \text{Sums}& b_1& b_2& \ldots& b_s& \end{array}$$` ??? The most basic way to compare two different clustering results, say yours and the ground truth, or two different algorithms, is the contingency matrix. You might know contingency matrices from statistics, where they are used to relate discrete probability distributions. The Contingency matrix is quite similar to a confusion matrix, in that it counts how often a datapoint was given label i in one clustering and label j in the other clustering. So rows correspond to one clustering, and columns correspond to the other. However, because labels don't mean anything, the diagonal has no meaning, and both of these contingency matrices here mean that the two clusterings produce the same partition, i.e. they are identical. Also, contingency matrices do not need to be square. If you run clustering on a dataset, you don't usually know the number of clusters so you want to compare different clusterings with different numbers of clusters. FIXME needs example Now that we have this basic representation of cluster overlap, we can define derived metrics --- # Contingency Matrix - Both of these mean identical partitions: .center[ ![:scale 20%](images/contingency_matrix_1.png) ![:scale 20%](images/contingency_matrix_2.png) ] --- # Mutual Information `$$\Large MI(U,V)=\sum\limits_{i=1}^{|U|}\sum\limits_{j=1}^{|V|}P(i,j)\log\frac{P(i,j)}{P(i)P'(j)}$$` `$$\Large =\sum\limits_{i=1}^{|U|}\sum\limits_{j=1}^{|V|}\frac{|U_i \cap V_j|}{N} \log\frac{N|U_i \cap V_j|}{|U_i||V_j|}$$` ??? - Between 0 and 1 but not normalized or adjusted. If V is subpartition of U still = 1 A very common metric is mutual information. which is the mutual information of the discrete distributions given by the partitions (or the contingency table). FIXME unify notation (second equation only?) FIXME define P(i), P(i, j) Mutual information is a standard information theoretic measure for relating two distributions, providing a measure of how interdependent they are. There we just create distributions corresponding to the partitions and compare those. We can rewrite this uing the contingency matrix, the probability P(i, j) is just the the entry n_ij divided by the number of samples, and P(i) and P(j) are the row and column sums divided by the number of samples. Mutual information is also always between zero and one, with one being perfect information. However, if one partition is a sub-partitioning of the other partition, the mutual information is still one. --- #NMI and AMI `$$ NMI(U,V) = \frac{MI(U,V)}{\sqrt{H(U)H(V)}}$$` `$$ AMI = \frac{MI - E[MI]}{\max(H(U),H(V))-E[MI]} $$` ??? Similar to the rand index, the mutual information depends on the number of samples and size of the clusters, and it's common to adjust it for chance, to give it a more universal scale. Two common normalizations are the normalized mutual information NMI, which divides the mutual information by the squareroot of the products of the entropy of the marginal distributions, and the adjusted mutual information AMI, which has an adjustment for chance. FIXME definition of entropy? Both use the entropy to penalize overpartitioning, but only AMI is adjusted for chance, so zero on expecation. These two usually behave pretty similarly. There is other variants of NMI, for example dividing by the maximum, as I did here for AMI. - Penalizes overpartitioning (via entropy) - Adjusts for chance – any two random partitions have expected AMI of 0 FIXME definition of entropy? FIXME use max both times? FIXME explain expectation better! --- # Rand Index - Computed of partitions U, V over the same elements (say {0, …, n-1}) `$$ R(U, V) = \frac{r+s}{\binom{n}{2}} $$` - $r$ : Number of pairs of points in the same set in U and in V - $s$ : Number of pairs of points in the different sets in U and in V - $\binom{n}{2}$ : Number of possible pairs - Between 0 and 1 ??? One of the most useful one is the Rand Index. For two partitions C1 and C2, it computes the number of pairs points that are in the same set in C1 and in the same set in C2, called "a" here, and the number of pairs of points that are in different sets in both C1 and C2, called "b" here. This is divided by n choose 2, so the number of all possible pairs of points. This is always between zero and one, though the actual possible minimum and maximum depend on the number of clusters and the number of points. Let's do an example. We're comparing the labeling 0011 and the labelling 1100. They are the same, so hopefully the rand index is going to be 1. First, I write these as partitions here. Then I check which pairs of points are in the same set in both partitions. The pair 0,1 is in the same set in both partitions, so we count that. The pair 2,3 is also in the same set in both partitions, so we count that as well. No other pairs are in the same set in either partition, so we're done counting for "a". Now we count the pairs of points that are in different sets. Zero and 2 are in different sets in partition one, and also in different sets in partition 2. Same for zero and 3, and 1 and 2, and 1 and 3. So that's four pairs in total. So the nummerator is 2 + 4 = 6. The denominator is 4 choose 2, which is also six, so the rand index is 1, as expected. For a second example, let's look at these two partitions, 0,1,0,1 and 1,2,0,0. Here there's no pair of points that is in the same set in both of them. There are three pairs of points that are in different sets in both of them, (1,2), (2,3) and (0,3). So the rand index is 3/6, so 1/2. Alright, so the rand index is about matching pairs of points, which is slightly tricky to do in your head, but not that complicated really. --- # Rand Index Example `$$R([0,0,1,1],[1,1,0,0]) = R\left(\{\{0,1\},\{2,3\}\},\{\{0,1\},\{2,3\}\} \right)$$` `$$ = \frac{2+4}{6} = 1 $$` `$$R([0,1,0,1],[0,0,1,2]) = R(\{\{0,2\},\{1,3\}\},\{\{0,1\},\{2\},\{3\}\})$$` `$$ = \frac{0+3}{6} = \frac{1}{2} $$` ??? --- # Adjusted Rand Index (ARI) `$$ AdjustedIndex = \frac{\text{Index}-\text{ExpectedIndex}}{\text{MaxIndex}-\text{ExpectedIndex}}$$` `$$ARI = \frac{\sum_{ij}\binom{n_{ij}}{2} - \frac{\sum_{i}\binom{a_{i}}{2}\sum_{j}\binom{b_{j}}{2}}{\binom{n}{2}}} {\frac{1}{2}\left(\sum_{i}\binom{a_{i}}{2} + \sum_{j}\binom{b_{j}}{2}\right) -\frac{\sum_{i}\binom{a_{i}}{2} + \sum_{j}\binom{b_{j}}{2}}{\binom{n}{2}}}$$` ??? - Maximum of 1, 0 on expectation (can become negative!) [Don’t learn the formula] Because the range of the rand index depends on the data, in practice people prefer to use the adjusted rand index, which is adjusted for chance. So we take the index we just computed, subtract the expected index over this dataset, and divide it by the maximum index on this dataset minus the expected index. This means the expectation over all cluster assignments is zero, and the maximum over cluster assignments is 1. However, it means the adjusted rand index can become negative, basically if the partitions are less similar than they would be on expectation, or by chance. What this works out to is this nice formula here. The n_ij is the entries of the continency table we saw before, and the a_i and b_i are the row and columns sums of that contingency table. I really don't wanna remember that formula, so don't worry, it's not gonna be on the test. Just keep in mind that it measures exactly the same thing as the rand index, only adjusting for chance. You can see all of the normalization terms only depend on the row and column sums, so the number of samples in each of the clusters. Only this first part here depends on the contingency table. The most common use of this is to compare a particular clustering to a ground-truth clustering. --- # KMeans on digits dataset .center[![:scale 70%](images/ari_nmi_ami_digits.png)] ??? One of the most common applications is picking a number of clusters or tuning a clustering algorithm. So we could try to use the supervised methods to figure out what's the good number of clusters to cluster the digit dataset with KMeans. And here's the results. This is using adjusted rand index, adjusted mutual information and normalized mutual information. As you can see that they all have basically the same peak, interestingly, at 11. If you are in KMeans with 11, this will give us the best mutual information or the best rand index with respect to the ground truth labeling of the digits. You can also see that for the normalized mutual information, the over partitioning wasn't penalize that strongly. So basically, if you pick more and more clusters, it doesn't really degrade that much. While the adjusted rand index and adjusted mutual index, basically degrade more quickly so it’s very clear where the maximum is. Q: What are the two partitions I compare? One is KMeans and the other one is the class labels. --- class:center,middle #Unsupervised Evaluation ??? Now let's talk about unsupervised evaluation. So these are often a bit vaguer because you need to define what is a good clustering given the data and the partition. I find them a little bit less helpful. But you can actually use them in practice. --- # Silhouette Score - For each sample: `$$ s = \frac{b-a}{\max(a,b)} $$` - a is the mean distance to samples in same cluster - b is the mean distance to samples in nearest cluster - For whole clustering: average over all samples - Prefers compact clusters (like k-means) ??? One of the classical ones is the silhouette score. The silhouette score for each sample, you compute as the mean distance to samples in the nearest cluster minus the mean distance of two samples in the same cluster. This score can be computed for every data point. If you want to do is for the whole partition, you just average this over all the samples. One of the downsides of this is that it prefers compact clusters and not complex cluster shapes. Basically, KMeans will do usually pretty good at this whereas DBSCAN will not. --- .center[ ![:scale 60%](images/ari_silhoutte.png)] ??? Here, the ARI is comparing to the ground truth where the ground truth is one-half moon versus the other half moon, they are two different labels. And the silhouette score is completely unsupervised and measures the compactness of the cluster in a sense. And you can see that they basically completely disagree. In the top left, since the clusters are quite compact, the silhouette score is the highest of the bunch, even though there’s one cluster that spans both of the half moons. Whereas the one in the bottom right, recovers the ground truth and so the ARI is very high. But the silhouette score is really low, it says this is the worst clustering. Silhouette score will not be able to pick up on these more complex cluster shapes, it will always favor more compact clusters. Depending on your application that might be okay or might not be okay. --- .center[ ![:scale 80%](images/number_clusters.png)] - What's the right number of clusters? ??? This is kind of a very trivial example. But still, in higher dimensions, it might not be as obvious. So how many clusters are there? It’s drawn from four Gaussians. And actually, instead of just looking at the silhouette score, average over everything, I can look at the individual silhouette score of the points. --- # Silhouette Plots .center[ ![:scale 70%](images/silhouette_plots.png) ] - Plot sorted silhouette score for each sample in each cluster - http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html ??? Here at the top, I have KMeans which runs with eight clusters. The thing that people often use is these silhouette plots. These basically plot the silhouette score for each point in each of the clusters. People found these diagrams help them to figure out if it's a good clustering or not. What you're looking for is, you want to have these things have like a smooth curve, which basically indicates, most of the points have a very high silhouette score and then there's some couple that is sort of on the fringes of the cluster. If you see something that’s not so smoothly curved then it indicates, there's a couple of samples with very high silhouette score, then there's a sudden drop, and then there are some more points. This might also indicate that there are maybe two clusters within one cluster, which is actually the case here. And if you over partition your data, you might get something like the red one in the fourth diagram, where you don't have a nice curve, indicating that the red has a very small silhouette score compared to the other ones and also, they basically fall off very quickly. --- .center[ ![:scale 60%](images/number_clusters_gaussians.png) ] ??? So how does it look like if we have a more complicated dataset. So this is generated as a mixture of 10 Gaussians, that doesn't mean that 10 is a good clustering. I think probably 5 would be my best guess. But it's really there's not really a right answer. There's not very clear distinct clusters. --- .center[ ![:scale 60%](images/more_silhouette_plots.png) ] ??? We can also look at the silhouette score for this. Looking at these, it's very hard to say, what is a good clustering. Q: If the score can’t help us why are we using it? Because there's no better score. Question is, is this something that we care about? Is this something that's going to happen in practice? If you say yes, that this is something that you care about and you want to recover these 2 moons, then never ever use this. --- # Digits Dataset .center[ ![:scale 80%](images/silhouette_digits.png) ] ??? Here's the same thing for the digit data set. And of the ones that I tried, 15 clusters is the best. But again, by looking at the diagrams they're not very helpful to me. It definitely can show you the density of the cluster, in some sense. --- # Cluster Stability - Try multiple random samplings / perturbations - Measure consistency - Clustering stability: an overview (U. v. Luxburg) https://arxiv.org/abs/1007.1075 ??? There's one other measure that I want to talk about called cluster stability. Cluster stability is not really a simple measure of how good is a partition, but it's more of how good is a particular parameter setting, or how good is a particular algorithm. The idea behind cluster stability is that if you changed your data a little bit and your clustering is good then it should be the same. So basically, you're trying to add some noise, the desired method is to do bootstrap sampling or you could add like random noise to the data or something like that, and ideally, if there are actually good clusters in the data, doing bootstrap sampling or adding noise to the data, shouldn't really influence how the data is partitioned by your clustering. So basically, you're not looking at a single partition or comparing two partitions, instead you're looking at an algorithm and you're adding noise into the algorithm, and you're figuring out are the results stable. If the results are stable, it's good, if the results are not stable, it's not so good. - For comparing clustering algorithms / parameters (with different number of clusters for example) - Try multiple random samplings / perturbations - The configuration that yields the most consistent result among perturbations is best. --- # Stability Intuition .center[ ![:scale 80%](images/stability_intuition.png) ] ??? If we want to pick the number of clusters, for example. This is from the paper that I previously quoted. Let's say, you have these four different clusters. The dotted line is clustering the algorithm found. And so let's say I tried the algorithm trying to find two clusters. It might divide horizontally or vertically, so it's not going to be very stable. If I pick two large number of clusters, it will probably find the four clusters, but then it will randomly split up one of them. But it will split up a different one every time and so this will also not be stable. But if I asked for four clusters, the four clusters will always be the same independent of a bootstrap sampling or noise that was added to the data. --- .smaller[ ```python def cluster_stability(X, est, n_iter=20, random_state=None): labels = [] indices = [] for i in range(n_iter): # draw bootstrap samples, store indices sample_indices = rng.randint(0, X.shape[0], X.shape[0]) indices.append(sample_indices) est = clone(est) if hasattr(est, "random_state"): # randomize estimator if possible est.random_state = rng.randint(1e5) X_bootstrap = X[sample_indices] est.fit(X_bootstrap) # store clustering outcome using original indices relabel = -np.ones(X.shape[0], dtype=np.int) relabel[sample_indices] = est.labels_ labels.append(relabel) scores = [] for l, i in zip(labels, indices): for k, j in zip(labels, indices): # we also compute the diagonal which is a bit silly in_both = np.intersect1d(i, j) scores.append(adjusted_rand_score(l[in_both], k[in_both])) return np.mean(scores) ```] ??? You need to have your dataset and your model. So you're repeatedly fitting the clustering algorithm to do this. I want to make sure that I do this in the order of the original dataset. So for each original sample, either I have minus one, if it wasn't part of the bootstrap, or have the cluster labeled if it was a part. And then now for each pair of clusters that I did, I look at the samples that are both bootstrap samples and the intersection of indices. And then I look at, in this case, adjusted rand score between the two partitions. And then, in the end, I compute the mean over all these pairs of partitions. So here, I created a bunch of partitions randomly and I computed the rand score between all pairs of the partitions and then I computed mean. The idea is, if they are very consistent, then the adjusted rand score between all different pairs will be high and so the final number will be high. FIXME too complex to show --- .center[![:scale 70%](images/cluster_stability.png)] - Outcome depends critically on initialization and n_iter. - This is kmeans++ with n_iter=10 ??? Here, I look at KMeans with 10 clusters. You can actually see that the cluster stability here is 1 for the three in the top row because points in each of them are closer together than the other points. But then if I do five or more clusters it becomes less stable. This depends on how I initialized my KMeans. I did the KMeans++ initialization and restarted it a bunch of times. If I do random initialization and don't do restarts, then everything will be slightly less stable because of the randomness of KMeans. --- .center[![:scale 80%](images/cluster_stability_2.png)] ??? You can see that two is basically the stable cluster. But this is how data looks like in the real world. And you never know what the answer should be. That's why I think it's interesting to look at how it behaves for relatively random data. --- # Digits Dataset .center[ ![:scale 70%](images/digits_dataset_ncluster_scan.png) ] ??? Here is the digit dataset looking at different scores. You can see that for ARI is 12, stability is 8 and for silhouette score, it's 16. You can see that the silhouette score is pretty noisy. The stability goes down pretty rapidly if you add too many clusters. ARI is supervised while the other two methods are unsupervised. --- # Stability on Digits Dataset .center[ ![:scale 70%](images/cluster_stability_different_algos.png ) ] ??? Here, I'm looking at stability for different clustering algorithms. DBSCAN looks a little bit weird since there's no number of clusters in DBSCAN and so what I actually scanned was the epsilon. And if you set epsilon very high, say 2, then everything will be in the same cluster. If you set it to 1.8, there will be only one cluster and a bunch of noise points and so on. If I set it to 1.4, it’s starting to be two clusters, and so on. If I set it to too small then there will be mostly noise points. So if I set it 0.029, there's going to be one cluster, and the rest will be labeled as noise. Overall, I probably like KMeans the best. It has the same score (10) as agglomerative at its peak, which is a reasonable score. DBSCAN doesn't seem to be able to find any stable clusters with any of the settings. --- # Adult Dataset .center[ ![:scale 70%](images/adult_dataset_ncluster_scan.png) ] ??? This is the adult dataset stating if adults are making more or less than 50k a year. This is not very nice to cluster since it has a lot of categorical data. I ran KMeans and KMeans probably will not do that well on this kind of data. But what I found interesting is that the silhouette score is very noisy and goes up with more clusters, whereas ARI and stability go down. Here, I like the results of stability much more than the results of silhouette. --- # Labeled Faces in the Wild .center[![:scale 70%](images/labfaces_ncluster_scan.png)] - With PCA, 100 components. Without whitening: no stable clusters! ??? Since this dataset is quite high dimensional, I'm using PCA with 100 components and then doing the whitening. If I don't do whitening, there is no clustering, basically, that works. So if you try to cluster the original input space, there are no stable clusters. I mean, this is also something I can find out by using the stability score, for example. After computing, I can see that the ARI and stability behave similarly while the silhouette score is quite different. --- class:spacious #Qualitative Evaluation(aka eyeballing) - Look at low-dim visualization - Look at individual points - Look at cluster centers (if available) ??? Eyeballing the data is the most important thing about clustering. Supervised and unsupervised scores sometimes can be helpful, but I think what you really usually want out of clustering is doing explorative data analysis. So if I use my stability score and find five is the perfect number for KMeans on a dataset, what does that tell me? Probably not that much. What I'm interested in is, what are the five clusters in my dataset and do they mean something? The things that I usually do is look at low dimensional visualizations, look at individual points and look at cluster centers if you’re doing KMeans. So that's what I want to do here. --- # Digits .center[ ![:scale 100%](images/digits_eyeball.png) ] ??? I think here I picked the best parameters, according to the graph that I showed you, for DBSCAN, agglomerative clustering, and KMeans. I think this is quite informative, because I can see that DBSCAN doesn't find the things that I think would be the clusters and there's like a big orange cluster that’s all over the place and the PCA is also all over the place while agglomerative and KMeans are more localized, both in the PCA and in the t-SNE. Unlike DBSCAN, Agglomerative and KMeans seem to be doing something reasonable, that I can look into more. --- # K-Means cluster centers .center[ ![:scale 70%](images/kmeans_cluster_centers_digits.png) ] .smaller[ ```python fig, axes = plt.subplots(2, 5, subplot_kw={'xticks': (), 'yticks': ()}, figsize=(10, 5)) km = KMeans(n_clusters=10).fit(digits.data) for ax, center in zip(axes.ravel(), km.cluster_centers_): ax.imshow(center.reshape(8, 8), cmap='gray_r') fig.suptitle("K-Means cluster centers") ```] ??? One of the things I can look for KMeans is to look at the cluster centers. So, unfortunately, for the homework, this will not be as nice. For the digits, I can just look at the centers, I can plot them and therefore, I can visualize individual points in a meaningful way. If you have a 100-dimensional feature vector usually you cannot visualize them in a meaningful way. This is not usually what you can do. But here, in this case, where we have these images, we can look at the cluster centers and they actually seem to be quite natural. Apart from the two in the last column, the rest of them looks pretty clear. So this looks like a reasonable clustering. And it seems to have reasonably discovered the digits and the data themselves. --- # Agglomerative Clusters .center[ ![:scale 90%](images/agglomerative_clusters.png) ] ??? So for agglomerative clustering and DBSCAN, you can't do that, because there are no cluster centers. So what I like to do instead is just pick random points from the clusters. In this example with 10 clusters, each column corresponds to a cluster and the rows are just random samples from the cluster. And so now I can, again, maybe try to semantically identify these clusters. And most of these look pretty natural to me. - Each column corresponds to one of 10 clusters, “first” 5 samples are shown. - Number on top gives clusters size --- # DBSCAN Clusters .center[ ![:scale 90%](images/dbscan_clusters.png) ] ??? I did the same thing for DBSCAN and I showed the number of points in this cluster. But here, you can see this is like a very specialized clustered, it only has 27 samples on it. This is also a very specialized cluster since it has the 7s with the hook on the front. The third column, which has nearly 1000 data points, clearly has all different digits in it. So this is not very satisfactory. Basically, I looked at three different algorithms, looked at three different parameters, looked at stability scores, and looked at 2D visualizations and data points. So now, I have a relatively good idea what these three algorithms did. This was basically for all for exploratory data analysis. --- #For feature extraction .smaller[ ```python from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV km = KMeans(n_init=1, init="random") pipe = make_pipeline(km, LogisticRegression()) param_grid = {'kmeans__n_clusters': [10, 50, 100, 200, 500]} grid = GridSearchCV(pipe, param_grid, cv=5, verbose=True) grid.fit(X, y_people) ```] .center[ ![:scale 40%](images/param_kmeans_nclusters.png) ] ??? If you want to do feature extraction, if you want to use clustering in a supervised pipeline, this is much easier. So here's an example of using clustering in a supervise pipeline on the faces dataset. Here, I'm trying to classify faces into different people. And so what I'm doing is, I use KMeans as a feature extractor. So I make a pipeline of KMeans and logistic regression and then I can just grid search number of clusters on KMeans because since now I have an actual evaluation criterion, I want to have an accurate model. I don't care about finding semantic clusters, I only care about finding an accurate model. And so if I actually was going to use KMeans as a part of the supervised pipeline, I can throw all these metrics and all the introspection out of the window and I can just actually measure what I'm interested in. This is often what you want to do, often you want to have as part of a bigger pipeline and then you can just measure the whole pipeline. So here in this case, basically, as soon as I use 200 clusters, everything is good. The stability score told me that I will get the most stable clustering with like 40 or 50, but we don't really care about how stable they are. We just want a good feature space. And so if I use 200, or 500 or 1000 clusters, it's going to be fine. --- class: spacious # So what is n_clusters? - As preprocessing – The one that makes the whole pipeline work best. – Larger is often better. - For exploratory analysis: – The one that tells you the most about the data. - For most datasets, there is no “correct” number of clusters. ??? The question is, what is the number of clusters, how do you set the parameters, how do you pick and it really depends a lot on what your ultimate goal is. The two goals that I talked about, mostly it was for exploratory data analysis. And then you should do whatever it tells you most about the data. And we have these different metrics and different ways to see if it actually tells something about the data. For preprocessing, just fine tune your whole pipeline. Usually, the more is better, as you can see here, basically, adding more and more clusters, gives me more and more features and reasonable better results. And in the end, there's usually no correct number of clusters. If you want clusters that are interpretable, I think having stable clusters is something that makes a lot of sense. If you want to introspect them, you would want them to be stable. If you ran clustering and you get the results and then you go in and do all the manual analysis that I just talked about, and you look at the samples and so on, and then you run clustering again, and it gives you completely different results, then what would the results even mean? So I think, looking for stability if you want to interpretable clustering is like a reasonable thing to do. --- class:center,middle If you want semantics from clustering, you need manual confirmation. Clustering might not pick up on what you thought it would. ??? Finally, if you want any semantics from your clustering, you always need to look at the data, even on the digits, I found exactly 10 clusters. No one tells me you have to right 10 clusters. If I want to know if I’ve actually found the digits I need to look at the cluster centers and look at the data points as I did before and I need to identify what does each cluster correspond to and do what do they mean and that's the only way to actually extract any semantic information. --- class: middle # Questions ?