Dimensionality Reduction

class: center, middle

### W4995 Applied Machine Learning

# Dimensionality Reduction

## PCA, Discriminants, Manifold Learning

04/01/20

Andreas C. Müller

???
Today we're going to talk about dimensionality reduction,
mostly about principal component analysis. We're also going
to talk about discriminant analysis and manifold learning.

These all slightly different sets of goals. But the main
idea here is to take a high dimensional input dataset and
produce the dimensionality for processing. It can be either
before the machine learning pipeline or often for
visualization.

FIXME PCA components and scaling: xlabel for components 0 and 1
FIXME LDA projection slide with explanation and diagram!
FIXME compare Largevis, t-SNE, UMAP
FIXME print output in PCA slides: I need to know the dimenionality (29?)
FIXME trim tsne animation (why?)
FIXME write down math for inverse transform? (and for pca transform?)
FIXME show unwhitened vs whitened plot
---
class: centre,middle
# Principal Component Analysis
???
---
.center[
![:scale 70%](images/pca-intuition.png)
]

???
Here, we have a 2D input space, there's some point scattered
here. The color is supposed to show you where the data goes
in the transformations. PCA finds the directions of the
maximum variant in the data.

So starting with this blob of data, you look at the
direction that is the most elongated. And then you look for
the next opponent that captures the second most various,
that's orthogonal to the first component. Here in
2-dimensions, there's only one possibility to be orthogonal
to the first component and so that's our second component.

If you're in higher dimensions, there are obviously
infinitely many different directions that could be
orthogonal, and so then you would iterate.

What PCA returns is basically an orthogonal basis of the
space where the components are ordered by how much variance
of the data they cover and so we get a new basis of the
original space. So we can do this until like N dimensions
many times. In this example, I could do two times, after
that there are no orthogonal vectors left.

And so this new basis corresponds to rotating the input
space, I can rotate the input space so that the first
component is my X-axis and the second component is my Y-axis
and this is the transformation that PCA learns.

This doesn't do any dimensionality reduction as of yet, this
is just a rotation of the input space. If I want to use it
for dimensionality reduction, I can now start dropping some
of these dimensions. So they are ordered by decreasing
variance, I could in this example drop, for example, the
second component only to regain the first component.

So basically, I dropped the Y-axis and projected everything
to the X line. And this would be dimensionality reduction
with the PCA going from two dimensions to one dimension. And
the idea is that if your data lives on a lower dimensional
space, or essentially lives on lower dimensional space,
embedded in high dimensional space, you could think of this
is mostly aligned or Gaussian blob along this direction and
there's a little bit of noise in this direction. The most
information about the data is sort of along this line.

The idea is that does one-dimensional projection now covers
as much as possible of this data set.

You can also look at PCA as being sort of a de-noising
algorithm, by taking this reduced representation in the
rotated space and rotating it back to the original space. So
this is basically transforming the data using PCA, the
rotation dropping one dimension rotating back, you end up
with this. And so this is through information that is
retained in this projection, basically. And so, arguably,
you still have a lot of information from this dataset.

---
# PCA objective(s)
`$$\large\min_{X', \text{rank}(X') = r}\|X-X'\|$$`
.center[
![:scale 45%](images/pca-intuition.png)
]
???
The maximum variance view can be formalized like this. So
basically, you look at the variance of a projection. So you
look over all possible vectors, U1 that have norm 1 and you
look at what's the variance of projecting the data X along
this axis and then the maximum of this will be the maximum
variance of the data.

You can also rewrite this in this way.

Once you compute U1, you subtract the projection onto U1
from the data. Then you get everything orthogonal to U1. And
then with this dataset where you subtracted the component
along U1, you can again find the direction of maximum
variance, and then you get U2 and so on iterate this….

One thing that you should keep in mind here is that the
direction of U1 or the PCA component doesn't matter.

These two definitions of the first principle component, if
you set U1 to –U1 it would be the same maximum. It only
depends on the direction of U1, not the sign. So whenever
you look at principal components, the sign is completely
meaningless.

Restricted rank reconstruction
---
class:split-40
# PCA objective(s)
.left-column[
`$$\large\max\limits_{u_1 \in R^p, \| u_1 \| = 1} \text{var}(Xu_1)$$`
`$$\large\max\limits_{u_1 \in R^p, \| u_1 \| = 1} u_1^T \text{cov} (X) u_1$$`
]
.smaller.right-column[
.center[
![:scale 90%](images/pca-intuition.png)
]
]
???
So second way to describe PCA is using Reconstruction Error.
I can look at how well this reconstruction here matches the
input space. And if I want a reconstruction X’, that is the
best possible in the least squares sense, so that is the
square norm. If I look at the square norm of X – X’, then
this is minimized exactly when I'm using X’, X’ is the PCA
reduction.

This is like written in a slightly different way. Because if
you solve this minimization problem, then you don't get the
rotation back, you get the data back. But this problem is
exactly solved by basically rotating using PCA direction,
dropping the data and then rotating back.

Maximizing the variance is the same thing as minimizing the
Reconstruction Error, given a fixed number of dimensions.

Find directions of maximum variance. (Find projection (onto
one vector) that maximizes the variance observed in the
data.)

Subtract projection onto u1, iterate to find more components.
Only well-defined up to sign / direction of arrow!
---
# PCA Computation
- Center X (subtract mean).
- In practice: Also scale to unit variance.
- Compute singular value decomposition:
![:scale 100%](images/pca-computation.png)
???
If I want to apply PCA to my dataset, the first thing I have
to do, which is sort of part of the algorithm subtracts the
mean. Any package will subtract the mean for you before it
does the SVD. What you should probably also do in practice
is to scale the data to unit variance.

And then the PCA algorithm just computes the singular level
decomposition.

We decompose the data X into an orthogonal, a diagonal, and
an orthogonal. V^T is number of features, U is number of
samples times number of samples, D is diagonal containing
singular values. Singular values correspond to the variance
of each of these directions. Usually, in SVD you sort them
by singular values. So the largest singular value
corresponds to the first principle components.

There's an additional processing step that you can do which
is the most basic form of PCA. You can also use these
entries of the diagonal to really scale the data in the
rotated space so that the variance is the same across each
of the new dimensions.

---
class: center
# Whitening
![:scale 100%](images/whitening.png)
???
So I take my original data, I rotate it, and then I scale
each of the dimension so that they have standard deviation
one. It’s basically the same as doing PCA and then doing
standard scaler.

You can see now the variance, visually looks like a ball, it
was a Gaussian before. In the context of signal processing,
in general, this is called whitening.

And if you want to use PCA as a feature transformation, to
extract features you might want to do that, because
otherwise, the magnitude of the first principle component
will be much bigger than the magnitude of the other
components since you just rotated it. So the small variance
directions will have only very small variance because that's
how you define them.

The question here is do you think the principal components
are of similar importance? Or do you think the original
feature is of similar importance?

And if you think the principal components are of similar
important then you should do this. So if you want to go on
and put this in a classifier, this might be helpful. But I
don't think I can actually give you a clear rule when you
would want to do this or when not to. But it kind of makes
senses if you think of it as being centered scaler in a
transformed space that seems like something that might be
helpful for a classifier.

Same as using PCA without whitening, then doing
StandardScaler.
---
class: split-40
# PCA for Visualization
.tiny-code.left-column[```python
from sklearn.decomposition import PCA
print(cancer.data.shape)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(cancer.data)
print(X_pca.shape)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cancer.target)
plt.xlabel("first principal component")
plt.ylabel("second principal component")
components = pca.components_
plt.imshow(components.T)
plt.yticks(range(len(cancer.feature_names)), cancer.feature_names)
plt.colorbar()
```
![:scale 80%](images/pca-for-visualization-cancer-data.png)
]
.right-column[
![:scale 90%](images/pca-for-visualization-components-color-bar.png)]
???
I want to show you an example on real data set. So the first
data set I'm going to use it on is the breast cancer data
set.

The way I use PCA most often is probably for visualization.
I often do PCA with two components, because my monitor has
two dimensions.

I fit, transform the data and I scatter it and then I get
the scatter plot of the first principle component versus
second principal component. So this is the binary
classification problem, PCA obviously is a completely
unsupervised method, so it doesn't know what the class
labels but I could look whether this is a hard problem to
solve or is it an easy problem to solve. If I look at this
plot, maybe it looks like it's a relatively easy problem to
solve with the points being relatively well separated.

Now I look at the principal component vectors. So there are
two principal components, each of them has number of
features many dimensions. So this is shown here as a heat
map. And you can see that there are basically two features
that dominate the principal components, which is mean area
and worst area. And all the rest of the entries are
approximately zero. So actually, I didn't rotate my data at
all, I just looked at two features.

So I didn't scale my data, if I don't scale my data and a
particular feature has a much larger magnitude, this feature
will basically be the first principle component because if I
multiply a feature by 10,000, this feature will have the
largest variance of everything. There will be no direction
that has a larger variance than that.

---
class:spacious
# Scaling!
.tiny-code[```python
pca_scaled = make_pipeline(StandardScaler(), PCA(n_components=2))
X_pca_scaled = pca_scaled.fit_transform(cancer.data)
plt.scatter(X_pca_scaled[:, 0], X_pca_scaled[:, 1], c=cancer.target, alpha=.9)
plt.xlabel("first principal component")
plt.ylabel("second principal component")
```]
.center[![:scale 45%](images/scaled-pca-for-visualization-cancer-data.png)]
???
So now I scale the data. And then I get two components on
the cancer data set. And now actually the data looks even
better separated. Even with this unsupervised methods it
very clearly shows me what separation of the classes is, I
can also see things like the yellow class looks much denser
than the purple class, maybe there are some outliers here,
but this definitely gives me a good idea of this dataset.

So this is much more compact than looking at a 30x30 scatter
matrix, which I couldn't comprehend. So we can look at the
components again.

Imagine one feature with very large scale. Without scaling,
it’s guaranteed to be the first principal component!
---
class:split-40
# Inspecting components
.tiny-code[```python
components = pca_scaled.named_steps['pca'].components_
plt.imshow(components.T)
plt.yticks(range(len(cancer.feature_names)), cancer.feature_names)
plt.colorbar()
```]
.smaller.left-column[
![:scale 70%](images/inspecting-pca-scaled-components.png)
]

.right-column[
![:scale 100%](images/inspecting-pca-scaled-components-2.png)]

???
So components are stored in the components attribute of the
PCA class. And you can see now all of the features
contribute to the principal components.

And if you're an expert in the area, and you know what these
features mean, sometimes it's possible to read something in
this. It's definitely helpful often to look at these
component vectors.

Another way to visualize the component vectors is to do a
scatter plot. But that only works basically for the first
two principal components. So you can basically look at the
positions of the original features, each dot here is one of
the original features, and you can look at how much does
this feature contribute to principal component one versus
principal component two. And so I can see that the second
principal component is dominated by mean fractal dimension
and the first personal component is dominated by the worst
perimeter. And worse perimeter negatively contributes to the
second principal component. And worst case point is also
very strong in the first principle component but has zero in
the second principal component.

This was sort of the first approach trying to use PCA for
2-dimensional visualization and trying to understand what
the components mean.

sign of component is meaningless!
---
# PCA for regularization
.tiny-code[
```python
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=0)
lr = LogisticRegression(C=10000).fit(X_train, y_train)
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))
```
```
0.993
0.944
```
```python
pca_lr = make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(C=10000))
pca_lr.fit(X_train, y_train)
print(pca_lr.score(X_train, y_train))
print(pca_lr.score(X_test, y_test))
```
```
0.961
0.923
```
]
???
One thing that PCA is often used for, in particular for
statistics oriented people, is using it for regularizing a
model. You can use PCA to reduce the dimensionality of your
data set and then you can do a model on the on the reduced
dimensional data set. Since it has fewer features, this will
have less complexity, less power so you can avoid
overfitting by reducing the dimensionality.

I’ve given an example here on the same data set. I'm using
logistic regression, and I basically turn off
regularization. Instead of using L2 regularization in linear
regression, I'm trying to do reduce the input space.

So you can see that, if I do that, I basically overfit the
data perfectly. So on the training dataset, I have 99.3%
accuracy, and on test dataset I have 44.4% accuracy.

Now if I use PCA with two dimensions, I reduce this down to
just two features. I basically reduce overfitting a lot so I
get a lower training score but I also get a lower test
score.

Generally, if you find that you only need 2-dimension to
solve a problem, then that’s a very simple problem. But if
you're shooting for accuracy, clearly you’ve regularized too
much, you're restricted to the model too much by just using
two components. One way to figure out a good number of
components, obviously, doing grid search with
cross-validation. Another way is to look at the variance
that it is covered by each of the components, so this is
basically the singular values.

---
# Variance covered
.center[
![:scale 55%](images/variance-covered.png)
]
.tiny-code[```python
pca_lr = make_pipeline(StandardScaler(), PCA(n_components=6), LogisticRegression(C=10000))
pca_lr.fit(X_train, y_train)
print(pca_lr.score(X_train, y_train))
print(pca_lr.score(X_test, y_test))
```
```
0.981
0.958
```]

???
So here are for the singular values. Here is the explained
variance ratio, basically how much of the variance in the
data is explained by each of the components. The components
are sorted by decreasing variance so obviously, this always
goes down. But now I can sort of look for kinks in this
variance that's covered and I can see how many components it
makes sense to keep for this dataset maybe if you don't want
to do a grid search.

Looking at the top plot, it looks like the six components
are enough. And if I actually use six components, I overfit
slightly less in the training dataset, I get a better result
than I did without using PCA.

People from statistics like this better than regularization
since this doesn't bias to coefficients. I like it slightly
less because the PCA is completely unsupervised and it might
discard important information. But it's definitely a valid
approach to regularize linear model.

If you're using a linear model, what's nice is that you can
still interpret the coefficients afterward. The linear model
is just the dot product with the weights and PCA is just a
rotation. And so it can take the coefficients and rotate
them back into the original space.

Could obviously also do cross-validation + grid-search
---
class:split-40
# Interpreting coefficients
.tiny-code[```python
pca = pca_lr.named_steps['pca']
lr = pca_lr.named_steps['logisticregression']
coef_pca = pca.inverse_transform(lr.coef_)```]
.center[Comparing PCA + Logreg vs plain Logreg:]
 
.left-column[
![:scale 100%](images/PCA+logreg.png)
]
.right-column[
![:scale 100%](images/logreg+noPCA.png)
]
???
So that's what I'm doing here. I use the logistic
regression coefficients in the PCA space, I think this was
six components. And so then I can scatter them and look at
them, what do these components mean, in terms of the
original features.

Here, I’m looking at with no PCA, which is just the logistic
regression on the original data set, versus doing PCA, doing
logistic regression on the six-dimensional data set, and
then rotating the six coefficients back into the input
space. Even though I did this transformation of the data, I
can still see how this model in the reduced feature space
uses these features.

The left plot shows plain logistic regression coefficients
and the back-rotated coefficients with PCA.
The plot on the right shows for each feature the coefficient
in the standard model and in the model including PCA.
And you can see that they're
sort of related. They're not going to be perfectly related
because I reduced the dimensionality of this space and also
the model with PCA did better in this case. So clearly
they’re different.

Rotating coefficients back into input space. Makes sense
because model is linear! Otherwise more tricky.
---
class:split-40
# PCA is Unsupervised!
.left-column[
![:scale 100%](images/pca-is-unsupervised-1.png)
]
--
.right-column[
![:scale 100%](images/pca-is-unsupervised-2.png)
]
???
Look at this dataset. It's a very fabricated data set. This
is a 3d classification problem in two dimensions. This looks
very unstructured.

So if I do a PCA, I get this. And so if I now look at the
first two principal components, the problem is actually
completely random, because that's how I synthetically
constructed it. And if I look at the third principal
component, the third component perfectly classifies the
dataset. Even though I constructed to be like that, there's
nothing that prevents the data set in reality not to look
like that. Just because the data is very small, or has small
variance in a particular direction doesn't mean this
direction is not important for classification. So basically,
if you’re dropping components with small variance, basically
what you're doing is you're doing feature selection in the
PCA space based on the variance of the principal components.
And so that's not necessarily a good thing. Because small
variance in principal component doesn't necessarily mean
uninformative.

Dropping the first two principal components will result in
random model!
All information is in the smallest principal
component!

---
# PCA for feature extraction
.center[
![:scale 85%](images/pca-for-feature-extraction.png)
]
???
Next, I want to give an example of where PCA has been used,
at least historically, for faces. This is a data set
collected in the 90s.

The idea is you have a face detector, given the face, you
want to identify who this person is. Working on the original
pixel space kind of seems like a pretty bad idea.

The images here have about 6000 pixels, they're relatively
low resolution but still has like 5000 features, just
looking at the pixel values, which means very high
dimensional.

You can do a PCA and this is sort of the components that you
get from this. You can see here, for example, here, from
which direction does light come from, how much eyebrows do
you have and do you have a mustache. These are the PCA
components constructed on this data set.

The idea is that these PCA components can be better features
for doing classification than the original pixels. And this
is an application where whitening helps. Rescaling the
feature components so that they all have the same variance
actually helps classification.

One different way to think about these components instead of
thinking about the mass rotations is you can think about
equivalently that each point is a linear combination of all
the components because the components form a basis of the
input space. You can think that each picture, for example,
Winona Ryder here, is a linear combination of all the
components and these coefficients of linear combinations are
the projection in the PCA space. I mean, these are the
coefficients in the PCA bases after rotation. To me, this
makes it more interpretable what these components might
mean.

So if I have, light from the person's right, then I would
expect X1 want to be high, because then this component
becomes active. If the light comes from a different
direction, I would expect X1 to be negative. If there's no
left or right gradient, I would expect the component to be
about zero.

---
# 1-NN and Eigenfaces
.tiny-code[
```python
from sklearn.neighbors import KNeighborsClassifier
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_people, y_people, stratify=y_people, random_state=0)
print(X_train.shape)
# build a KNeighborsClassifier using one neighbor
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
```
```
(1547, 5655)
0.23
```
]
--
.tiny-code[
```python
pca = PCA(n_components=100, whiten=True, random_state=0).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
X_train_pca.shape
```
```
(1547, 100)
```
```python
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train_pca, y_train)
knn.score(X_test_pca, y_test))
```
```
0.31
```
]
???
Alright, so now let's look at what it does. So here, on this
dataset, I’m doing one nearest neighbor classifier. The
reason I'm using one nearest neighbor classifier is that
that's often used if people want to demonstrate what a good
representation of the data is. The idea is that one nearest
neighbor classifier just looks at your nearest neighbor and
so it only relies on your representation of the data. So if
1NN classifier does well, your representation of the data is
good. That’s the intuition behind this.

If I use a linear model that's penalized, PCA will probably
not make a lot of difference actually.

Here with the nearest neighbor equal to one, you get about
23% accuracy. And you can see that the data set is very
wide. So there are a lot more features than there are
samples.

If there are more features than in samples. PCA can only do
as many components as there is the minimum of features and
samples. So in this case, I could project a maximum of up to
1547 directions because this is the rank of the matrix and I
can only get so many Eigenvectors.

If I do the same thing with PCA, I go down from 5600
dimensions to only 100 dimensions and then the 1NN
classifier actually does quite a bit better. Now it's 31%
accurate.

You could do a grid search to find the number of components
or you could look at the explained variance ratio.

---

# Reconstruction
.center[
![:scale 70%](images/reconstruction.png)
]
???
Another way to select how many components you want for
classification instead of looking at the explained various
ratios is looking at the reconstructions. That is most
useful if you can look at the data points in some meaningful
way.

So for example, here, the bottom right image in my
illustration is what happens if I rotate to 10 dimensions
and then I rotate back to the original space. For example,
you can see here if I use only 10 components, these people
are not recognizable at all. So I wouldn't expect 10
components to make to make a lot of sense. On the other
hand, if I look at 500 components, even though there’s
high-frequency noise the person is recognizable. Using 500,
I'm probably not going to lose anything here.

This is also one way to see how many components you might
need for a particular result. And so if I just use two
components, you would see probably nothing.

And yet, if I used as many components as there are samples
in this case, if I used like 1500 components, it would be
completely perfect, because I would basically rotate in
space, not drop any dimensions and rotate back.

Another thing that's interesting when you’re doing
reconstructions is there’s a different application for PCA
called outlier detection.

You can think of PCA as a density model in the sense, you're
fitting a big Gaussian and then you dropping directions of
small variance. And so you can look at what are the faces
that are reconstructed well versus those that are not
reconstructed well. And so if they're not reconstructed as
well, that means they are quite different from most of the
data.

---
class:split-40
# PCA for outlier detection
.tiny-code[```python
pca = PCA(n_components=100).fit(X_train)
reconstruction_errors = np.sum((X_test - pca.inverse_transform(pca.transform(X_test))) ** 2, axis=1)
```
]
.left-column[
![:scale 90%](images/best-reconstructions.png)
 
Best reconstructions
]
.right-column[
![:scale 90%](images/worst-reconstructions.png)
 
Worst reconstructions
]
???
So here, I used 100 components. And now I can look at what's
the reconstruction error of rotating the data, dropping some
components and rotating back, and the ones with good
reconstructions are the ones that look like most of the
data. And the ones that have bad reconstructions are the
ones that are sort of outliers.

If something looks a lot like the data, that's probably
something the principal component will cover. Whereas if
something looks very different than the data, that's
something that principal component will not cover.

If you look at these images, what is an outlier and what is
not an outlier has nothing to do with the person in it but
has only to do with the cropping of the image. So these
images are all not very well aligned, they have like tilted
head, wearing glasses, cropped too wide, cropped not wide
enough, wearing caps and so on. Since they look very
different from most of the images, and that's why they're
outliers.

But you can use this in general, as a density model,
basically, looking at how well can the data point be
reconstructed with PCA.

There are more uses of PCA. These are like a couple of
different applications of PCA that all have sort of quite
different goals and quite different ways of setting
dimensionality.

---

class: middle
# Discriminant Analysis
???
So the next thing I want to talk about is discriminate
analysis. In particular, linear discriminate analysis. This
is also a classifier, but it also can do dimensionality
reduction.
---
class: spacious
# Linear Discriminant Analysis aka Fisher Discriminant

`$$    P(y=k | X) = \frac{P(X | y=k) P(y=k)}{P(X)} = \frac{P(X | y=k) P(y = k)}{ \sum_{l} P(X | y=l) \cdot P(y=l)}$$`
???
Linear Discriminant Analysis is also called LDA, but never
call it LDA because LDA also means Latent Dirichlet
Allocation and so it might be confusing.

Linear Discriminant Analysis is a linear classifier that
builds on Bayes rule, in that, it assumes the density model
for each of the classes, if you have a density model for
each of the classes, you can build a classifier using Bayes
rule.

This is Bayes rule of having a joint distribution of X and
Y. If you then look at the conditional probability of X, you
can say that the probability of Y being a particular class
given X is the probability of X given Y is class, times
probability of being in this class divided by P(X).

And so you can expand this now here, the P(x) as a sum over
all classes. And so the only thing you really need is a
probability distribution of P(X) given a particular class.
So if you fix any possibilities distribution for particular
class Bayes rule gives you like a classifier. Linear
Discriminant Analysis does basically one particular
distribution, which is the Gaussian distribution.

- Generative model: assumes each class has Gaussian distribution
- Covariances are the same for all classes.
- Very fast: only compute means and invert covariance matrix (works well if n_features << n_samples)
- Leads to linear decision boundary.
- Imagine: transform space by covariance matrix, then nearest centroid.
- No parameters to tune!
- Don’t confuse with Latent Dirichlet Allocation (LDA)

---
class: spacious
# Linear Discriminant Analysis aka Fisher Discriminant

`$$    P(y=k | X) = \frac{P(X | y=k) P(y=k)}{P(X)} = \frac{P(X | y=k) P(y = k)}{ \sum_{l} P(X | y=l) \cdot P(y=l)}$$`

`$$ p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma^{-1} (X-\mu_k)\right) $$`
???
So what we're doing is basically, we're fitting a Gaussian
to each of the classes, but we're sharing the covariance
matrix. So for each class, we estimate the mean of the
class, and then we estimate the joint covariance matrix, so
we subtract the mean of the class from all the data points,
and then we compute the covariance matrix of that.

So now we have fitted Gaussian into each class, they all
have the same covariance matrix and then we can apply this
rule. And this gives us a classifier.

---
class: spacious
# Linear Discriminant Analysis aka Fisher Discriminant

`$$    P(y=k | X) = \frac{P(X | y=k) P(y=k)}{P(X)} = \frac{P(X | y=k) P(y = k)}{ \sum_{l} P(X | y=l) \cdot P(y=l)}$$`

`$$ p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma^{-1} (X-\mu_k)\right) $$`

`$$    \log\left(\frac{P(y=k|X)}{P(y=l | X)}\right) = 0 \Leftrightarrow (\mu_k-\mu_l)\Sigma^{-1} X = \frac{1}{2} (\mu_k^t \Sigma^{-1} \mu_k - \mu_l^t \Sigma^{-1} \mu_l) $$`

???
You can actually write this in a slightly different way. The
decision boundary between the two classes is the same as
this formula on the right.

What this is basically, the single data point lies on the
decision boundary, if inverse square matrix times x is some
constant. This is actually just a nearest centroid
classifier in the space that was rotated by this joint
covariance matrix or rotated and stretched by the joint
covariance matrix.

So basically, I compute the joint covariance matrix and then
I look at what's the closest mean and I adjust it based on
the prior probabilities of the class.

---
class: spacious
# Quadratic Discriminant Analysis
LDA: `$$ p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma^{-1} (X-\mu_k)\right) $$`

QDA: `$$ p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma_k^{-1} (X-\mu_k)\right) $$`

???
- Each class is Gaussian, but separate covariance matrices!
- More flexible (quadratic decision boundary), but less robust: have less points per covariance matrix.
- Can’t think of it as transformation of the space.
???
Quadratic Discriminate Analysis is exactly the same thing,
only that you have a separate covariance matrix for each of
the classes. This makes a more flexible classifier, but
also, you have fewer points to estimate each of the various
matrices. So you have many more parameters, which makes it a
little bit harder to fit.

---
class:center
![:scale 55%](images/linear-vs-quadratic-discriminant-analysis.png)
???
If your data points are actually Gaussian, then obviously,
this works well because it assumes each class is a Gaussian.

If they have the same covariance matrix, Linear Discriminant
Analysis obviously finds the perfect boundary. If they have
different covariance matrix, then Quadratic Discriminant
Analysis does better.

What does this all have to do with dimensionality reduction?

So we can use it as a classifier. Linear Discriminant
Analysis is just a linear classifier with a particular
probabilistic model. If I want a linear classifier, I’ll
probably use a logistic regression instead. But what’s nice
about this is the joint covariance function.

---
# Discriminants and PCA
- Both fit Gaussian model
- PCA for the whole data
- LDA multiple Gaussians with shared covariance
- Can use LDA to transform space!
- At most as many components as there are classes - 1 (needs between class variance)
???
Linear Discriminant Analysis compared to PCA, they both fit
a Gaussian to the data. The PCA basically fits a Gaussian to
the whole dataset while Linear Discriminant Analysis fits
multiple Gaussian with a shared covariance structure. So
that means we can use this covariance structure to rotate
our space or stretch our space using Linear Discriminant
Analysis.

Because of the way it's defined, we can get as many
components as there are classes minus one because you
basically look at the vector that connects the means of the
classes in the rotated space. So you rotate the space, and
then you look at the space spend by the means.

---
class:center
# PCA vs Linear Discriminants
![:scale 100%](images/pca-lda.png)
???
Here's an example of doing Linear Discriminant Analysis
versus PCA on a synthetic dataset where I have three classes
and they all share a covariance matrix.

Obviously, that's the case Linear Discriminant Analysis was
designed for. If I do a PCA, PCA can’t really do a lot here
because it flips the x-axis because jointly all three of
this kind of looked like Gaussian and PCA is completely
unsupervised so it can't distinguish between these three.

If I used Linear Discriminant Analysis, it will a give me
this. And basically, the first discriminant that I mentioned
here will be the one that perfectly separates all the means.
So it gets out the interesting dimensions that helped me
classify.
---
class:center
# Data where PCA failed
![:scale 100%](images/pca-fail.png)
???
So if I look at the data set that I created so that PCA
failed. So here, this was the original data set, all the
information was in the third principal component, if I do
Linear Discriminant Analysis I will get only one component.
The one component that was found by Linear Discriminant
Analysis, it's just the one that perfectly separates the
data.

So this is a supervised transformation of the data that
tries to find the features that best separate the data. I
find this very helpful for visualizing classification data
sets in a way that's nearly as simple as PCA but can take
the class information into account.

---
# Summary
- PCA good for visualization, exploring correlations
- PCA can sometimes help with classification as regularization or for feature extraction.
- LDA is a supervised alternative to PCA.
???
PCA, I really like for visualization and for exploring
correlations, you can use it together with a classification
for regularization if you have lots of features. Often, I
find that regularization of the classifier works better.

T-SNE probably makes the nicest pictures of the bunch.

And if you want a linear transformation of your data in a
supervised setting Linear Discriminant Analysis is a good
alternative to PCA.

LDA also  also yields a rotation of the input spacA
---
class: centre,middle
# Manifold Learning
???
Manifold learning is basically a class of algorithms that
are the much more complicated transformation of the data. A
manifold is a geometric structure in higher dimensions.

So the idea is that you have low dimensional space, and sort
of embedded into a high dimensional space. The classical
example is this Swiss roll.

---
class: center, middle
![:scale 100%](images/manifold-learning-structure.png)
???
Here, this is sort of a two-dimensional data set that was
embedded in a weird way in the three-dimensional data set.
And you hope to recover the structure of the data. And the
structure of the data here would be the 2D structure of the
layouts with Swiss roll. So PCA can't do that, because PCA
is only linear. So if you would project this down with two
dimensions, it's going to smooch all of these together and
the blue points will be very close to the red points, if it
projects down this axis, for example. And that's sort of not
really what the geometry of the data is like.

So for this, you would need nonlinear transformations of the
input space that can sort of try to capture the structure in
the data.

Learn underlying “manifold” structure, use for dimensionality reduction.
---
# Pros and Cons
- For visualization only
- Axes don’t correspond to anything in the input space.
- Often can’t transform new data.
- Pretty pictures!
???
Manifold learning is for visualization purposes only. One of
the problems is the axis don't correspond to anything,
mostly. And often you can’t transform your data, but you can
get pretty pictures for your publications or to show your
boss.

These are mostly used for exploratory data analysis and less
in a machine learning pipeline. But they’re still sort of
useful.

---
# Algorithms in sklearn
- KernelPCA – does PCA, but with kernels!
 Eigenvalues of kernel-matrix
- Spectral embedding (Laplacian Eigenmaps)
 Uses eigenvalues of graph laplacian
- Locally Linear Embedding
- Isomap “kernel PCA on manifold”
- t-SNE (t-distributed stochastic neighbor embedding)
???
So some of the most commonly used algorithms which are in
scikit-learn are…

KernelPCA, which is just PCA but with kernels. So you set up
the covariance matrix, you look at the kernel matrix.

Spectral embedding, which looks at the eigenvalues of the
graph Laplacian. It's somewhat similar to kernel methods but
you compute the eigenvectors off like a similarity matrix.

There’s locally linear embedding.

There's Isomap.

Finally, t-SNE.

t-SNE stands for t-distribution stochastic neighbor
embedding, this is sort of the one that maybe has the least
strong theory behind it. But they're all kind of heuristics
and a little bit of hacky.

t-SNE is something that people found quite useful in
practice for inspecting datasets.

---
# t-SNE

`$$p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)}$$`
--

`$$p_{ij} = \frac{p_{j\mid i} + p_{i\mid j}}{2N}$$`

`$$q_{ij} = \frac{(1 + \lVert \mathbf{y}_i - \mathbf{y}_j\rVert^2)^{-1}}{\sum_{k \neq i} (1 + \lVert \mathbf{y}_i - \mathbf{y}_k\rVert^2)^{-1}}$$`

`$$KL(P||Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$`

???
The main idea of t-SNE is that you define a probability
distribution over your data points, given one data point,
you want to say, “If I'm going to pick one of my neighbors
at random, and I'm going to do this weighted by distance,
how likely is it I'm going to pick neighbor I?”

This is just a Gaussian distribution with a particular sigma
that's dependent on the data point.

So around each data point, you look at the density of their
neighbors and you make it symmetric.

Assuming, we have embedded our data into some Yi and now it
can look at the distance between two of these Ys but we're
not using Gaussian, we’re using t-distributions.

So what you're doing then is, you initialize the Yi’s
randomly, and then you do gradient descent to basically make
these two distributions similar. This is quite different
from other embedding methods. Here, we're not learning
transformation at all, we’re just assigning these Yi’s
randomly in our target space, and then move them around so
that the distances are similar to the distances in the
original space.

What this means is basically things that were similar in
your original space should be similar in the new space as
well.

The reasons that they use a student-t here instead of a
Gaussian is because they didn't want to focus too much on
things that are too far away. You only want to focus on the
points that are pretty close to you.

This is used for embedding into two and three-dimensional
spaces. So 99% of the time you're using this with the Yi
being in two-dimensional space and the Xi being the original
data points.

The sigma is adjusted locally depending on how dense the
data is in a particular region. But there's the overall
parameter which is called the perplexity which basically
regulates how the local sigmas are set. So perplexity is
some transformation of the entropy of the distribution. So
you set the Sigma such that the entropy of the distribution
is the parameter you're setting.

If you set perplexity parametric low, you going to look at
only the close neighbors. If you set it large, you're going
to look at a lot of neighbors. This will indirectly
influence the sigma.

- Starts with a random embedding
- Iteratively updates points to make “close” points close.
- Global distances are less important, neighborhood counts.
- Good for getting coarse view of topology.
- Can be good for  nding interesting data point
- t distribution heavy-tailed so no overcrowding.
- (low perplexity: only close neighbors)
---
class:center
![:scale 63%](images/tsne-embeddings-digits.png)
???
(Read the next slide and come to this)
So one way it can help me understand the data, this is
actually with different random seats so look slightly
different. Here, I show some of the images, so you can get
an idea of what's happening.

You can see that if the one is more angled and has like a
bar at the bottom, it's grouped over here. And if the one is
more straight and doesn't have a bar, that group over here.
The nines got actually split up into three groups. These are
actually pretty natural cluster, so it found out that there
are different ways to write nine with a bar in the bottom
and without a bar at the bottom.

Writing the code for this is a little bit of a hassle. But
often I find it quite helpful to look at these t-SNE
embedding and looking at outliers, helps me understand my
dataset.

One of the downsides of this is, I didn't write anything on
the axis, because there's nothing on the axis. These are
randomly initialized data points that have been moved around
to have similar distances. This is completely rotationally
variant. So if I rotated this picture or flipped it, it
would still mean the same thing.  So there's no possible
interpretation for the axis.

---
class:split-40
.smaller[```python
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data / 16.
X_tsne = TSNE().fit_transform(X)
X_pca = PCA(n_components=2).fit_transform(X)
```]
.left-column[![:scale 95%](images/pca-digits.png)]
--
.right-column[![:scale 95%](images/tsne-digits.png)]
???
Now I want to show you some of the outcomes. This is a PCA
visualization of digit dataset, so both PCA and peacenik are
completely unsupervised. This is just doing PCA with two
components, first principle, and second principal component,
and then I color it by the class labels.

PCA obviously doesn’t anything about the class labels, I
just put them in here so we can see what's happening. You
can see that sort of some of these classes is pretty well
separated, others are like, kind of smooshed together.

This is in t-SNE, when I first saw this I thought this was
pretty impressive, because this Algorithm is completely
unsupervised, but it still manages to mostly group nearly
all of these classes in very compact clusters.

In this completely unsupervised method, I could sort of draw
boundaries by hand in 2D from the original 64-dimensional
space and I could group the data. This 2D view can help me
understand the data.
---
class: center, compact
# Showing algorithm progress

![:scale 50%](images/digits_tsne_animation.gif)

.smallest[https://github.com/oreillymedia/t-SNE-tutorial]

---
class:split-40
# Tuning t-SNE perplexity
.left-column[
![:scale 70%](images/tsne-tuning-2.png)
![:scale 70%](images/tsne-tuning-5.png)]
.right-column[
![:scale 70%](images/tsne-tuning-30.png)
![:scale 70%](images/tsne-tuning-300.png)]
???
The one parameter I mentioned, is perplexity. I want to give
a couple of illustrations of the influence of this
perplexity parameter. This is again the digit dataset.

The results will be slightly different for each run because
of the randomness. The author suggests perplexity equal to
30 and says it always works.

Here it’s set to 2, 5, 30 and 300. And you can see that 30
actually does produce the most compact clusters. What will
give you the best results depends on the dataset.

Generally, for small data sets, I think people found lower
perplexity works better while for larger data sets, larger
perplexity works better. And again, there’s sort of a
tradeoff between looking at very close neighbors and looking
at neighbors further away.

- Important parameter: perplexity
- Intuitively: bandwidth of neighbors to consider
- (low perplexity: only close neighbors)
- smaller datasets try lower perplexity
- authors say 30 always works well.

---
class:spacious
![:scale 25%](images/tsne-moons.png) ![:scale 55%](images/tsne-perplexity.png)
???
Here's the same for a very small two dimensional data set. I
wanted to see how well t-SNE can separate these out. Again,
perplexity equal to 30, probably looks the best to me.
---
class: center, middle, spacious
# Play around online

http://distill.pub/2016/misread-tsne/
???
There's a really cool article in this link.

The question was how do we reconcile categorical data with
this?

So this indeed, this seems to be continuous data. You can
apply it after one hot encoding. Sometimes the results will
look strange.

You might probably be able to come up with something that
more specifically uses the discrete distributions. But all
the things I talked about are really for continuous data.

Is s-SNE more computationally expensive than PCA?

Yes, by a lot. PCA is computing an SPD, which is like a very
centered linear algebra thing, which is very fast. And a lot
of people spent like 50 years optimizing it. Still, there's
a bunch of things to speed t-SNE.
---
# UMAP
https://github.com/lmcinnes/umap
.smaller[
- Construct graph from simplices, use graph layout algorithm
- Much faster than t-SNE: MNIST takes 2 minutes vs 45 minutes for t-SNE]

![:scale 45%](images/umap_mnist.png)
![:scale 50%](images/umap_fashion_mnist.png)
---
# LargeVis
https://github.com/elbamos/largeVis
.smaller[
- Use nearest neighbor graph constructed very efficiently (and approximately)
- Much faster than t-SNE
]
![:scale 47%](images/largevis_mnist.png)
![:scale 47%](images/largevis_wikidoc.png)
---
class: center, middle

# Questions ?