Advanced Neural Networks

class: center, middle

### W4995 Applied Machine Learning

# Advanced Neural Networks

04/27/20

Andreas C. Müller

???
FIXME add model inspection?
FIXME add attention
FIXME VGG images low res
FIXME update with state-of-the-art instead of VGG16
FIXME my resnet is broken
FIXME add finetuning with keras example
FIXME dropout: make clear different for every sample in batch
FIXME residual connections: plot the graph for the network I'm running
---
# MNIST and Permuted MNIST
![:scale 90%](images/mnist_org.png)
.smallest[

```python
rng = np.random.RandomState(42)
perm = rng.permutation(784)
X_train_perm = X_train.reshape(-1, 784)[:, perm].reshape(-1, 28, 28)
X_test_perm = X_test.reshape(-1, 784)[:, perm].reshape(-1, 28, 28)
```

]
![:scale 90%](images/mnist_permuted.png)
---
# Fully Connected vs Convolutional
.tiny[
.left-column[
```python
model = Sequential([
    Dense(512, input_shape=(784,), activation='relu'),
    Dense(10, activation='softmax'),
])
model.compile("adam", "categorical_crossentropy",
              metrics=['accuracy'])
```
```
_____________________________________________________
Layer (type)          Output Shape         Param #
=====================================================
dense_7 (Dense)       (None, 512)           401920
_____________________________________________________
dense_8 (Dense)       (None, 10)            5130
=====================================================
Total params: 407,050
Trainable params: 407,050
_____________________________________________________
```
]
]
--

.tiny[
.right-column[
```python
num_classes = 10
cnn = Sequential()
cnn.add(Conv2D(32, kernel_size=(3, 3),
               activation='relu',
               input_shape=input_shape))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(Conv2D(32, (3, 3), activation='relu'))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(Flatten())
cnn.add(Dense(64, activation='relu'))
cnn.add(Dense(num_classes, activation='softmax'))
```
```
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 32)        9248
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 32)          0
_________________________________________________________________
flatten_1 (Flatten)          (None, 800)               0
_________________________________________________________________
dense_9 (Dense)              (None, 64)                51264
_________________________________________________________________
dense_10 (Dense)             (None, 10)                650
=================================================================
Total params: 61,482
Trainable params: 61,482
_________________________________________________________________
```
]

]
---
# Training curves

.left-column[
## Original Data

![:scale 90%](images/mnist_org_curve.png)

]
--
.right-column[
## Shuffled Data

![:scale 90%](images/mnist_perm_curve.png)
]

Any algorithm that doesn't take 2d layout into account is invariant to shuffling of pixels!
(including anything in sklearn!)
---
![:scale 35%](images/carpet_snake.png)
--
<br />
![:scale 22%](preview/snek_0_1271.jpeg)
![:scale 22%](preview/snek_0_3411.jpeg)
![:scale 22%](preview/snek_0_4863.jpeg)
![:scale 22%](preview/snek_0_5876.jpeg)
![:scale 22%](preview/snek_0_9549.jpeg)
![:scale 22%](preview/snek_0_4484.jpeg)
![:scale 22%](preview/snek_0_6377.jpeg)
![:scale 22%](preview/snek_0_4599.jpeg)

---
# Data Augmentation

- Rotation
- Random crops
- Mirroring
- ...

Make sure these reflect realistic and relevant variability!

---
class: smaller
# Image data augmentation with Keras

## https://keras.io/preprocessing/image/

```python
datagen = ImageDataGenerator(
    featurewise_center=True,
    featurewise_std_normalization=True,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True)

# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)

# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
                    steps_per_epoch=len(x_train) / 32, epochs=epochs)
```
.smaller[
Also check [``flow_from_directory``](https://keras.io/preprocessing/image/).
]
---
class: center, middle
# Interpreting Neural Networks
---
# Deconvolution
.center[
![:scale 100%](images/deconvolution_1.png)
]
https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
???
If you do it on natural images, you get something like this.
This is one of the standard neural networks trained on image
net dataset.

So you can look at the filters in the input layer and
because they're in image space, you can just visualize them.
Here, this network has 9 filters in the first layer. If you
look at the higher layers, if you want to see the filters,
it's kind of hard, because they don't work on the input
space, they work on the space created by these filters. But
you can kind of try to project them down into the original
image. And if you do that, then you get something like
these. So this is sort of a visualization of what the second
layer learned, this is sort of what they try to detect in
the input. And for each of these, here, you can see the
patches in the training set that most corresponded to them.

---
class: center, middle

![:scale 100%](images/deconvolution_2.png)

???
The goal is that with each layer going up, it learns more
and more abstract things. And so these are different units
in the dataset and what they correspond to.

This network learned to detect people even though there’s no
person class in image net.

---
class: center, middle

![:scale 100%](images/deconvolution_3.png)

???
If you go even deeper into the network, you can see that the
units correspond to even more abstract things.

These are all learned for the class of image classification.
So this was just trained on the output, there was no
information about the location of anything in the image, it
was just trained. The dogs in these images are of different
breeds and different vehicles, and so on. So there was never
any location information in the training, but it's still
picked up a lot of things.
---
class: middle
![:scale 100%](images/distill-feature-vis.png)

.smaller[https://distill.pub/2017/feature-visualization/]
???
This is a paper on feature visualization in neural networks.
This is from Google net, also trained on the same set as
image net. What they’re doing is they’re taking the network,
they’re fixing the activation in a particular feature map
and then they do backpropagation through the network to
adjust the input image so that this particular feature map
is most activated. This way, you want to create an image
that activates this particular feature map or particular
filter the most.

This is one of the papers in a series of papers about how to
restrict gradient descent on the input image to get some
nice outputs.

These are deeper into the network. In the beginning, it’s
like edges of different frequencies and orientations, then
you get something like textures, which are already quite
rich, and then you get the larger textures, and then you get
sort of more object like things and if you go further up,
you get something that is very close to object.

This sort of allows you to try to understand what the higher
layers of these networks detect and it's quite interesting
because a lot of these become quite specific. And this is
all learned in a completely supervised way so the only
feedback that the model got was learning to label these
images correctly. But we get very fine-grained features in
the network.

---
class: middle
# Drop-out
???
Drop out is a technique that’s relatively new compared to
all the other things. It's a particular way to regularize
the network, which cannot be done with scikit-learn.
---
# Drop-out Regularization

.center[
![:scale 75%](images/dropout_reg.png)
]
???
For each sample, during training, you actually remove
certain hidden nodes. So basically, you just X out the
activations.

You do this every time you trade over a training sample
anew. So, this will be different for every example. The idea
at the very high level is to add noise into the training
procedure so that you basically make it harder to overfit.
Since you drop out these parts, it gets much harder for the
model to learn the training set by heart because you always
remove different information.

The goal is that you then learn weights that are more robust
and since you're trying to prevent overfitting that possibly
generalizes better.

The dropout rate is often pretty high, sometimes it 50%,
which means you set 50% of your units to zero.

This is only during learning. You only want to add noise in
the learning process. When you want to do predictions, you
use all the weights but down weight them by the drop out
rate.

Here, we're not only perturbing the input image, but we're
also perturbing the hidden units in a very particular way.
As it turns out that actually works much better than just
adding noise to the input.

.smallest[
- https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

- Rate often as high as .5, i.e. 50% of units set to zero!

- Predictions: use all weights, down-weight by 1 - dropout rate
]

???

- Randomly set activations to zero.

Drop out is a very successful regularization technique
developed in 2014. It is an extreme case of adding
noise to the input, a previously established method to
avoid overfitting.
Instead of adding noise, we actually set given inputs to
0. And not only on the input layer, also the
intermediate layer.
For each sample, and each iteration we pick different
nodes. Randomization avoids overfitting to particular
examples.
---
class:spacious
#Ensemble Interpretation

- Every possible configuration represents different
network.

- With p=.5 we jointly learn `$\binom{n}{\frac{n}{2}}$` networks

- Networks share weights

- For last layer dropout: prediction is approximate
geometric mean of predictions of sub-networks.
???
---
#Implementing Drop-Out

.smaller[
```python
from keras.layers import Dropout

model_dropout = Sequential([
    Dense(1024, input_shape=(784,), activation='relu'),
    Dropout(.5),
    Dense(1024, activation='relu'),
    Dropout(.5),
    Dense(10, activation='softmax'),
])
model_dropout.compile("adam", "categorical_crossentropy", metrics=['accuracy'])
history_dropout = model_dropout.fit(X_train, y_train, batch_size=128,
                            epochs=20, verbose=1, validation_split=.1)
```
]

???
---
class:spacious
# When to use drop-out

- Avoids overfitting

- Allows using much deeper and larger models

- Slows down training somewhat

- Wasn’t able to produce better results on MNIST (I
don’t have a GPU) but should be possible

???
This is basically a regularizer.

You're can learn much, much bigger networks with drop out so
you can do deeper and wider networks without overfitting.
That was the original intent.

It slows down training even for the same size because you
basically zero out a portion of the weights and these
weights will not be updated. And so you need to do more
iterations before you've learned all the weights.

The question was, can we paralyze over different networks.

It's not really implemented as different networks, it's more
helpful to think of it as different networks. There are way
too many networks as to store them separately.

Generally, drop out is one of the techniques that people use
that usually improve generalization and decrease
overfitting.

Another technique that you should look up is residual
networks. It’s another generic strategy to speed up training
and it allows to learn deeper networks.

Not all state of the art neural networks you will see use
drop out, but some of them might. And similarly, you will
see some that use residual networks or batch normalization.

There’s basically a toolbox of different tricks that people
use and plot together to get good networks.

Another technique that is often used with neural networks is
augmenting the training data. So adding noise to training
data, or changing the training data in some way so that you
get more data is something that is often very helpful.

We'll talk about images next. With natural images, you can
flip it on the y-axis and it will still semantically be the
same image unless you want to do OCR. And so if you do that,
you just doubled the size of your training set. And for
neural networks, the bigger training set, the better.

If you do something like add noise, this actually gives you
an infinite sized training set because every time you see a
sample, you add noise in a different way and so you on the
fly, generate arbitrarily big training datasets. Drop out is
sort of a way to do this generically and do this on the
hidden layer. But if you have domain knowledge, like for
images, and you can do this in a way that's specific to the
domain, then that is often more helpful.

Another thing that is used with images, for example, is
looking at different crops/scales of the image. So if you
can crop your image in different ways, like 3 pixels to left
and 3 pixels to the right and each of them is a new training
sample, this way, you can also get infinitely many samples.
Obviously, they're all correlated but it's better than just
using the same sample over and over and makes overfitting
harder. And these are actually very, very important
techniques.

So if you look at state of the art models they will always
use these techniques to augment the training dataset,
because getting images is easy, for example, but getting
labeled images is very hard. And so if you know how you
interpret the image, you know what you should do with the
label. If you're just doing classification and cropping it
differently or rotating, it will not change the label.

Unless your label is a person holding something in the left
hand versus holding something in the right hand and then you
know you need to switch the label.
class: center, middle
# Deep Residual Networks (ResNet)
[He et. al. - Deep Residual Learning for Image Recognition (2015)](https://arxiv.org/pdf/1512.03385.pdf)
???
One that made quite a big jump in performance is deep
residual networks.
---
class:center,middle
# Batch Normalization
???
The next thing I want to talk about is one more trick that
we can use to improve learning called batch normalization.
This is a heuristic people found that speeds up learning and
often gives better results.

The idea here is that you want the hidden units to be scaled
well. So what this does is kind of trying to scale hidden
units to zero mean unit variance and doing this per batch.
---
# Problem
![:scale 90%](images/resnet-no-deep-nets.png)
???
We can't fit deep networks well - not even on training set!
"vanishing gradient problem" - was motivation for relu, but not solved yet.

The deeper the network gets, usually the performance gets
better. But if you make your network too deep, then you
can't learn it anymore.

This is on CIFAR-10, which is a relatively small dataset.
But if you try to learn a 56-layer convolutional, you cannot
even optimize it on the training set. So basically, it's not
that we can't generalize, we can't optimize. So these are
universal approximators, so ideally, we should be able to
overfit completely the training set.

But here, if we make it too deep, we cannot overfit the
training set anymore. It's kind of a bad thing. Because we
can’t really optimize the problem. So this is sort of
connected to this problem of vanishing gradient that it's
very hard to backpropagate the error through a very deep net
because basically, the idea is that the further you get from
the output, the gradients become less and less informative.
We talked about RELU units, which sort of helped to make
this a little bit better. Without RELU units, you had like 4
or 5 layers, with RELU units, you have like 20 layers. But
if you do 56 layers, it's not going to work anymore.

---
#Batch Normalization

.center[
![:scale 80%](images/batch_norm.png)
]

.smallest[
[Ioffe, Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)
]

???
Another relatively recent advance in neural networks is
batch normalization. The idea is that neural networks
learn best when the input is zero mean and unit
variance. We can scale the data to get that.
But each layer inside a neural network is itself a neural
network with inputs given by the previous layer. And
that output might have much larger or smaller scale
(depending on the activation function).
Batch normalization re-normalizes the activations for a
layer for each batch during training (as the
distribution over activation changes). The avoids
saturation when using saturating functions.
To keep the expressive power of the model, additional
scale and shift parameters are learned that are
applied after the per-batch normalization.
---
# Convnet with Batch Normalization
.smaller[```python
from keras.layers import BatchNormalization

num_class = 10
cnn_small_bn = Sequential()
cnn_small_bn.add(Conv2D(8, kernel_size=(3, 3),
                 input_shape=input_shape))
cnn_small_bn.add(Activation("relu"))
cnn_small_bn.add(BatchNormalization())
cnn_small_bn.add(MaxPooling2D(pool_size=(2, 2)))
cnn_small_bn.add(Conv2D(8, (3, 3)))
cnn_small_bn.add(Activation("relu"))
cnn_small_bn.add(BatchNormalization())
cnn_small_bn.add(MaxPooling2D(pool_size=(2, 2)))
cnn_small_bn.add(Flatten())
cnn_small_bn.add(Dense(64, activation='relu'))
cnn_small_bn.add(Dense(num_classes, activation='softmax'))
```]
???
very small to make it fit on a slide
---
# Learning speed and accuracy
.center[![:scale 80%](images/learning_speed.png)
]
???
FIXME label axes!

The solid lines are with batch normalization and the dotted
lines are without batch normalization.

You can see the learning is much faster and also gets to a
better accuracy.

---
# For larger net (64 filters)
.center[![:scale 80%](images/learning_speed_larger.png)
]
???
FIXME label axes

Here's the same thing with a larger network where we have 64
filters instead of 8. This learns even faster and learns to
overfit really well.
---
# Problem
![:scale 90%](images/resnet-no-deep-nets.png)
???
We can't fit deep networks well - not even on training set!
"vanishing gradient problem" - was motivation for relu, but not solved yet.

The deeper the network gets, usually the performance gets
better. But if you make your network too deep, then you
can't learn it anymore.

Even it's not going to work anymore, even on the training
set. So this has been like a big problem. And it has a
surprisingly simple solution, which is the RES-NET layer.

---
class: middle
# Residual Neural Networks
---
class: center
# Solution
![:scale 50%](images/residual-layer.png)

`$$\text{out} = F(x, \{W_i\}) + x  \quad \text{ for same size layers}$$`

`$$\text{out} = F(x, \{W_i\}) + W_sx \quad \text{ for different size layers}$$`
--

`$$F(x) = \text{out} - x \quad \text{learing the residual}$$`

???
instead of learning a function. learn the difference to identity.
if sizes different, add linear projection.

Here's how the residual layer looks like. And the idea is,
let's say you have a bunch of weight layers, instead of
learning a function, we learn how the function is different
from the identity.

So you're not trying to model the whole relationship between
x and y, you want to model how is y different from x.

In practice, what happens is you have multiple weight
layers, usually like 2, and you have skip connection that
gives the identity from before these layers to after these
layers.

So basically, if you set these weights all to zero, you have
a pass-through layer. And this allows information to be
back-propagated much more easily because you have all these
identity matrices, so something always gets backpropagated.
So this obviously only works if y and x have the same shape.

So in CNNs, often the convolutional layers have sort of the
same shape. But then you also have max-pooling layers. And
so what you can do is, instead of having the identity, you
use a linear transformation.

And this way, the gradients can propagate better. So just
seems like a very simple idea but people have tried it
before and it really made a big difference.

---
class: center, middle
![:scale 25%](images/resnet-architecture.png)
???
dotted lines are linear projection, others are identity.

MTG19, which a state of the art neural network before and
these are all the layers.

Next to it is a 34 layered, convolutional neural network.

And then is the residual network. For each pair of two
layers, you add in a skip connection. This black arrow is
just an identity mapping because things are the same size.
If there's pooling, there's this dashed arrow, which is a
linear transformation. But basically, they're sort of a
pathway that allows you to project the identity from the
very beginning to the very end. Or from the output signal
you get back towards the beginning.

---
class: center, middle
![:scale 100%](images/resnet-success.png)
???
Here’s the result. The solid line is the training error and
the bold line is the test error. The lines are depicted over
the number of iterations.

So what you can see here is the 18 layers works well than
the 34 layers. And the training set is worse than the test
set everywhere. But with the 34 layers, even the training
set can’t beat the test set of the 18 layers.

The next one is exactly the same architecture, but we put in
all these identity matrices. And now we can see that the 18
layer is pretty unchanged. But the 34 layer is now actually
better than the 18 layers. And in particular, we can overfit
the dataset a little bit. This is a much better result than
before. And when you publish this, this was state of the art
and quite a big jump.

---
class: center
![:scale 70%](images/resnet-results.png)

This was Dec 2015. Current state-of-the-art: 11.5% top-1 error
https://paperswithcode.com/sota/image-classification-on-imagenet
(and this is less relevant as benchmark)

???
uses batch normalization and dropout.

152 layers is a whole lot of layers. ResNet is by Microsoft.
So after this came out, people in Google stole the ResNet
trick from Microsoft and then they got better. So residual
networks become widely used after they published this.
---
# ResNets in Keras
.smallest[
Requires functinal API:
https://keras.io/getting-started/functional-api-guide/

```python
from keras.layers import Input, Dense
from keras.models import Model

# This returns a tensor
inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels)  # starts training
```
]
---
# CNN with Functional API
.tiny[
.left-column[
```python
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten
from keras.models import Model

num_classes = 10
inputs = Input(shape=(28, 28, 1))
conv1_1 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(inputs)
conv1_2 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(conv1_1)
maxpool1 = MaxPooling2D(pool_size=(2, 2))(conv1_2)
conv2_1 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(maxpool1)
conv2_2 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(conv2_1)
maxpool2 = MaxPooling2D(pool_size=(2, 2))(conv2_2)
flat = Flatten()(maxpool2)
dense = Dense(64, activation='relu')(flat)
predictions = Dense(num_classes, activation='softmax')(dense)

model = Model(inputs=inputs, outputs=predictions)
```
]
.right-column[
```
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_3 (InputLayer)         (None, 28, 28, 1)         0
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 28, 28, 32)        320
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 28, 28, 32)        9248
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 14, 14, 32)        0
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 14, 14, 32)        9248
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 14, 14, 32)        9248
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 32)          0
_________________________________________________________________
flatten_3 (Flatten)          (None, 1568)              0
_________________________________________________________________
dense_13 (Dense)             (None, 64)                100416
_________________________________________________________________
dense_14 (Dense)             (None, 10)                650
=================================================================
Total params: 129,130
Trainable params: 129,130
Non-trainable params: 0
_________________________________________________________________
```
]
]

---
class: compact
# Adding Skip connections
.tiny[
.left-column[
```python
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten
from keras.models import Model

num_classes = 10
inputs = Input(shape=(28, 28, 1))
conv1_1 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(inputs)
conv1_2 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(conv1_1)
maxpool1 = MaxPooling2D(pool_size=(2, 2))(conv1_2)
conv2_1 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(maxpool1)
conv2_2 = Conv2D(32, (3, 3), activation='relu',
                 padding='same')(conv2_1)
# Actually doesn't work, this net is too small
skip2 = add([maxpool1, conv2_2])
maxpool2 = MaxPooling2D(pool_size=(2, 2))(skip2)
flat = Flatten()(maxpool2)
dense = Dense(64, activation='relu')(flat)
predictions = Dense(num_classes, activation='softmax')(dense)

model = Model(inputs=inputs, outputs=predictions)
```
]
.right-column[
.tiniest[
```
_________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
=========================================================================================
input_14 (InputLayer)           (None, 28, 28, 1)    0
_________________________________________________________________________________________
conv2d_57 (Conv2D)              (None, 28, 28, 32)   320         input_14[0][0]
_________________________________________________________________________________________
conv2d_58 (Conv2D)              (None, 28, 28, 32)   9248        conv2d_57[0][0]
_________________________________________________________________________________________
max_pooling2d_23 (MaxPooling2D) (None, 14, 14, 32)   0           conv2d_58[0][0]
_________________________________________________________________________________________
conv2d_59 (Conv2D)              (None, 14, 14, 32)   9248        max_pooling2d_23[0][0]
_________________________________________________________________________________________
conv2d_60 (Conv2D)              (None, 14, 14, 32)   9248        conv2d_59[0][0]
_________________________________________________________________________________________
add_11 (Add)                    (None, 14, 14, 32)   0           max_pooling2d_23[0][0]
                                                                 conv2d_60[0][0]
_________________________________________________________________________________________
max_pooling2d_24 (MaxPooling2D) (None, 7, 7, 32)     0           add_11[0][0]
_________________________________________________________________________________________
flatten_14 (Flatten)            (None, 1568)         0           max_pooling2d_24[0][0]
_________________________________________________________________________________________
dense_35 (Dense)                (None, 64)           100416      flatten_14[0][0]
_________________________________________________________________________________________
dense_36 (Dense)                (None, 10)           650         dense_35[0][0]
=========================================================================================
Total params: 129,130
Trainable params: 129,130
Non-trainable params: 0
_________________________________________________________________________________________
```
]
]
]

---
class: center, middle

## https://keras.io/applications/

<table class='alternating'>
<thead>
<tr>
<th align='left'>Model</th>
<th align="right">Size</th>
<th align="right">Top-1 Accuracy</th>
<th align="right">Top-5 Accuracy</th>
<th align="right">Parameters</th>
<th align="right">Depth</th>
</tr>
</thead>
<tbody>
<tr>
<td align='left'><a href="#xception">Xception</a></td>
<td align="right">88 MB</td>
<td align="right">0.790</td>
<td align="right">0.945</td>
<td align="right">22,910,480</td>
<td align="right">126</td>
</tr>
<tr>
<td align='left'><a href="#vgg16">VGG16</a></td>
<td align="right">528 MB</td>
<td align="right">0.713</td>
<td align="right">0.901</td>
<td align="right">138,357,544</td>
<td align="right">23</td>
</tr>
<tr>
<td align='left'><a href="#vgg19">VGG19</a></td>
<td align="right">549 MB</td>
<td align="right">0.713</td>
<td align="right">0.900</td>
<td align="right">143,667,240</td>
<td align="right">26</td>
</tr>
<tr>
<td align='left'><a href="#resnet">ResNet50</a></td>
<td align="right">98 MB</td>
<td align="right">0.749</td>
<td align="right">0.921</td>
<td align="right">25,636,712</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="#resnet">ResNet101</a></td>
<td align="right">171 MB</td>
<td align="right">0.764</td>
<td align="right">0.928</td>
<td align="right">44,707,176</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#resnet">ResNet152</a></td>
<td align="right">232 MB</td>
<td align="right">0.766</td>
<td align="right">0.931</td>
<td align="right">60,419,944</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#resnet">ResNet50V2</a></td>
<td align="right">98 MB</td>
<td align="right">0.760</td>
<td align="right">0.930</td>
<td align="right">25,613,800</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#resnet">ResNet101V2</a></td>
<td align="right">171 MB</td>
<td align="right">0.772</td>
<td align="right">0.938</td>
<td align="right">44,675,560</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#resnet">ResNet152V2</a></td>
<td align="right">232 MB</td>
<td align="right">0.780</td>
<td align="right">0.942</td>
<td align="right">60,380,648</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#inceptionv3">InceptionV3</a></td>
<td align="right">92 MB</td>
<td align="right">0.779</td>
<td align="right">0.937</td>
<td align="right">23,851,784</td>
<td align="right">159</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#inceptionresnetv2">InceptionResNetV2</a></td>
<td align="right">215 MB</td>
<td align="right">0.803</td>
<td align="right">0.953</td>
<td align="right">55,873,736</td>
<td align="right">572</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#mobilenet">MobileNet</a></td>
<td align="right">16 MB</td>
<td align="right">0.704</td>
<td align="right">0.895</td>
<td align="right">4,253,864</td>
<td align="right">88</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#mobilenetv2">MobileNetV2</a></td>
<td align="right">14 MB</td>
<td align="right">0.713</td>
<td align="right">0.901</td>
<td align="right">3,538,984</td>
<td align="right">88</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#densenet">DenseNet121</a></td>
<td align="right">33 MB</td>
<td align="right">0.750</td>
<td align="right">0.923</td>
<td align="right">8,062,504</td>
<td align="right">121</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#densenet">DenseNet169</a></td>
<td align="right">57 MB</td>
<td align="right">0.762</td>
<td align="right">0.932</td>
<td align="right">14,307,880</td>
<td align="right">169</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#densenet">DenseNet201</a></td>
<td align="right">80 MB</td>
<td align="right">0.773</td>
<td align="right">0.936</td>
<td align="right">20,242,984</td>
<td align="right">201</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#nasnet">NASNetMobile</a></td>
<td align="right">23 MB</td>
<td align="right">0.744</td>
<td align="right">0.919</td>
<td align="right">5,326,716</td>
<td align="right">-</td>
</tr>
<tr>
<td align='left'><a href="https://keras.io/applications/#nasnet">NASNetLarge</a></td>
<td align="right">343 MB</td>
<td align="right">0.825</td>
<td align="right">0.960</td>
<td align="right">88,949,818</td>
<td align="right">-</td>
</tr>
</tbody>
</table>
???
Here are all the things that are inside Keras. Inception is
created by Google by copying from Microsoft. These are
pretty well-performing models but this list changes daily.
---
class: middle
# Transfer learning
---
class: center
# Transfer Learning

.center[
![:scale 80%](images/pretrained_network.png)
]

.smaller[See http://cs231n.github.io/transfer-learning/]
???

- Train on “large enough” data.

- Apply to new “small” dataset.

- Take activations of last or second to last fully
connected layer.

Often we have a small but specific image dataset for a
particular application. Training a neural net is not
feasible unless we have tens of thousands or
hundreds of thousands of images.
However, if we have a convolutional neural net that
was already trained on a large dataset that is similar
enough, we can hope that the features it learned are
also helpful for our task.
The easiest way to adapt a trained network to a new
task is to just apply it to our dataset and take the
activations of the second to last or last layer.
If the original task was rich enough – say 10000
different classes as in imagenet – these layers
contain a lot of information about the image.
We can then use these activations as features for
another classifier like a linear model or smaller dense
neural network.

The main point is that we don’t need to retrain all the
weights in the network. You can think of it as
retraining only the last layer – the classification layer
– of the network, while holding all the convolutional
filters fixed. If they learned generic patterns like
edges and patterns, these will still be useful for your
task.
You can download pre-trained neural networks for
many architectures online.
Using a pre-trained network is sometimes also known
as transfer learning.
This potentially doesn’t work with images from a very
different domain, like medical images.

What you do in transfer learning is you basically get rid of
the last layer, you look at the second to last layer and you
just use the representation in this layer as a feature
representation. So this is basically a way to embed the
image into vector space, similar to what word2vec did, only
now we're embedding this whole image into space, and we get
4069-dimensional representation.

People have tried to use neural networks for feature
learning for many years, but it never really worked really
well. What made this work really well in practice is, this
is basically a supervised task, so this whole thing is
learned to do this classification in these 1000 classes. So
this is a very specific task to optimize, the model tries to
be discriminative with respect to these classes. But on the
other hand, these 1000 classes spend a whole variety of
natural images, a whole variety of different kinds of
objects, and scenes, and everything so that to actually
perform well on this task, you need a representation of the
world that is somewhat generic.

So the hope is that these 4000-dimensional feature vector
doesn't encode only information about the 1000 classes that
we are actually interested in here but more general
information about the image. Last time, I showed you neurals
that reactive to faces, clothing, there are no categories
like faces or clothing in the dataset but still, there are
some filters that respond to that.

This has been very successful. The reason why we want to do
this is it's very hard to train CNNs, and you need a lot of
data. So if you have a specific image recognition task, it's
very likely that the things you're interested in are not one
of the 1000 classes, because you have mostly different
breeds of dogs. But you probably also don't have a million
training images to actually train an architecture like this
yourself. And you really need a lot of training data to make
this work. So here, what we're doing is we will use this
network learned on this very large, very generic database to
learn representation. And then we can use as a
representation for different tasks. And this is what you're
supposed to do in the last task in the homework.

---
class: spacious
# Ball snake vs Carpet Python
.center[
![:scale 100%](images/ball_snake_vs_python.png)
]
???
I want to now classify ball snakes and carpet pythons.
---
.smaller[
```python
import flickrapi
import json
flickr = flickrapi.FlickrAPI(api_key, api_secret, format='json')
json.loads(flickr.photos.licenses.getInfo().decode("utf-8"))
def get_url(photo_id="33510015330"):
    response = flickr.photos.getsizes(photo_id=photo_id)
    sizes = json.loads(response.decode('utf-8'))['sizes']['size']
    for size in sizes:
        if size['label'] == "Small":
            return size['source']

get_url()
ids = search_ids("ball snake", per_page=100)
urls_ball = [get_url(photo_id=i) for i in ids]

from urllib.request import urlretrieve
import os
for url in urls_carpet:
    urlretrieve(url, os.path.join("snakes", "carpet", os.path.basename(url)))
```]
???
---
.center[
![:scale 80%](images/carpet_python_snake.png)
]

???
I get 100 carpet snake pictures and 100 ball snake pictures
from Flickr.

This would be way too small of a training dataset to train a
convolutional neural network. If we extract features using
an existing convolutional neural network, we can learn
something like a linear classifier on top of these
representations.

There is noise in the dataset and similar looking things.

---
# Extracting Features using VGG
.smaller[
```python
from keras.preprocessing import image

X = np.array([image.img_to_array(img) for img in images_carpet + images_ball])

# load VGG16
model = applications.VGG16(include_top=False, weights='imagenet')

# preprocessing for VGG16
from keras.applications.vgg16 import preprocess_input

X_pre = preprocess_input(X)
features = model.predict(X_pre)
print(X.shape)
print(features.shape)
features_ = features.reshape(200, -1)
```
```
(200, 224, 224, 3)
(200, 7, 7, 512)
```]
???
VGG16 like each convnet has particular input requirements, for example 224x224 images.

Include top=false means I don’t want the last layer which
does the 1000 class classification.

I need to load the images and convert them into an array in
the shape that keras wants it. So you need to make sure that
the input has the right representation.

You need to also make sure that it's preprocessed in the
same way as all the images that VGG16 was trained with.
VGG16 was trained in a particular way that image net was
cropped. These were 224x224 images, so you need to make sure
that whatever you input into your model is 224x224.

And the way you bring it to this should probably be the same
way it was done for the whole training set. So here for each
of the applications, there's also a preprocess input, which
basically does exactly the same preprocessing as what was
done for the training of the network. And this is really
important.

Model.predict will take a while for a big dataset because
you need to run through these whole neural networks.

X_pre is after the preprocessing, I have 100 images of
snakes of each kind, it's 224x224x3 and this the right input
for the model. And then after I call predict, after all
these convolutional layers, I get 7x7 feature map, and
there's 512 of them.

You could probably also do more smarter things, you could
look at the output of multiple layers, like the last layer,
the second to last layer, or something like that, I just
used the last layer and just flatten everything to get out
of it. So here's there’s a little bit of input structure
left so still a 7x7 image, but I just ignored the image
structure.

---
# Classification with LogReg
.smaller[
```python
from sklearn.linear_model import LogisticRegressionCV
lr = LogisticRegressionCV().fit(X_train, y_train)
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, lr.predict(X_test))
```
```
1.0
0.82
array([[24,  1],
       [ 8, 17]])
```]
???
On the training set, I get 100% accuracy and on the test
set, I get 82% accuracy. It's a balanced dataset, so the
chance is 50%. If I did any classification directly on the
image, it would be impossible basically.

These images are so high dimensional and so varied, that it
will be impossible to learn classifier on these images
directly.

If you look at this image, it's 224x224x3 which means it's
like about 100,000, if you think of it as a feature vector.
So you would have 100,000 features and 200 examples and
training a linear model on this, in particular given how the
data is represented is basically hopeless. But using this
pre-trained convolutional neural network, we can actually do
something quite successful here on this tiny dataset. And
this is how convolutional networks are very commonly used in
practice.

Just select a relatively small dataset that is specific to
your domain and you use a pre-trained neural network as a
feature extraction mechanism.

---
#Finetuning

.center[
![:scale 90%](images/finetuning.png)
]

???

- Start with pre-trained net

- Back-propagate error through all layers

- “tune” filters to new data.

A more complicated variant of this is to load a network
trained on some other dataset, and replace the last
layer with your classification task.
Instead of training only the last layer, we can also keep
training all the previous layers, backpropagating the
gradient through the network and adjusting the
previously learned filters for our task.
You can think of this as warm-starting a neural network
from one that was trained on another dataset.
If you do that, we often want to train the last layer a
little bit before we backpropagate through the
network. Otherwise the random initialization of the
last layer might destroy the filters that we used for
initialization.

Another option is, instead of just taking the output
representation by the network, we can do fine tuning. In
fine-tuning, you keep the network but you don't only use it
for feature extraction, you basically use the weights
learned on the different dataset as initialization. You
learn all the weights, you throw away the last layer,
because that was specific to the classification task you
had. And now you learn like a linear classifier on top, but
you not only learn the weight of the linear classifier in
the back, you also backpropagate the error through the whole
network.

Usually, you want to do like a little bit of a burn at the
beginning where you tune just the last layer, but then you
basically keep learning the whole network, it's probably not
going to change the filters too much but it's going to
adjust the network to your specific task. But it's much,
much easier than trying to learn from scratch.

The problem with is that if you have too little data, you
can overfit your data very easily. Neural networks have very
many parameters so if you try to train them or even fine
tune them on a very small dataset, then you're just going to
overfit. So depending on dataset size and how similar to
image net may be, either fine-tuning or just extracting the
last layer might be the best way to go forward.

You should really think about convolutional neural networks
as doing something fundamentally different than fully
connected networks or any other classifier because they use
the 2d structure of the input. If you think of the MNest
digits, if you shuffled all the pixels in the same way for
all of the images, any classifier that we looked at so far,
will have exactly the same result. The ordering of the
pixels is completely ignored by fully connected neural
networks, random forest, SVM, and linear model.

If you shuffled all the pixels in the same way for all
images, it would look like complete garbage and it would be
impossible for a human to actually classify them. The
machine learning algorithms do exactly the same thing if
they’re not shuffled. They completely ignore all the
neighborhood structure and they’re similarly accurate.

On convolutional neural networks, it really makes use of
this neighborhood structure of the networks. So if we
shuffled all the pixels, the convolutional network will
completely fail because the neighboring pixels don't have
anything to do with each other anymore and the convolutional
network could not learn anything.

So this is sort of a crucial difference between how the
convolutional networks work versus how any of the other
classifiers work. They really make use of the topology of
the input space to this structure. And if you didn't do
that, so for MNest you could do something even if you ignore
the pixels structure, you will not get anything. So it's
really important to use something that's aware of the image
structure.

---
class: center, middle
# Recurrent Neural Networks
???
Recurrent networks are usually used for time series, or
generally for sequential data.
---
class: center, middle
![:scale 100%](images/recurrent-neural-net.png)

.smallest[
Images from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
]
???
In RNN, we have an input, hidden layer and an output.
Basically, the hidden layer is connected to itself. So at
the first time step, you get some input here, compute the
hidden layer, and get some output. At the second time step,
you get some input again and compute some output, but you
can also look back to the last hidden layer state. But the
weights are all shared. So the matrix, U, V, and W are
constant all the time.

This is done if you have a time series or any series that
you want to tag everything in the series. So for example, in
NLP people like to know is something a noun or a verb. So if
you feed in a sentence, then for each word it would say what
kind of word it is, as a part of speech tagging. Or if you
want to make a prediction over time, if you have an input
signal that evolves over time, and you want to predict an
output signal, then you could also use that.

Here basically, this works if you have two parallel
sequences and if you want an output for each element in the
input sequence.

The problem is that if you have long term dependencies, this
doesn't work very well. If the output here depends on the
input way back then the network will have forgotten about
this. Because you always have the same weight matrix V and
it just propagates information forward and forward. It's
very hard for a network to remember things that happened
previously.

So as to overcome this problem, people started using LSTMs.

---
class: center, middle

![:scale 100%](images/LSTM.png)
.smallest[[Hochreiter, Schmidhuber - Long Short-Term Memory (1997)](http://www.bioinf.jku.at/publications/older/2604.pdf)]
???
Which stands for long short term memory. This is from 1997.
It was ignored for 10 years. But now it's all the rage.

The idea is that now the hidden layer looks like this. The
idea is that you have multiple parts of this hidden layer,
which influence how you use the last state, how you use the
input, and how you're going to propagate it forward.

Here at the current time step, there's weight matrix Ft,
this is a multiplicative influence on the last Ct, which is
sort of one part of the memory. There's two parts of the
memory, Ct and Ht.

Lt is how much the current activation influence Ct. TanH is
a layer that's the new activation of Ct. The final output
gate that says how much should Ct influence the output in H.

So basically, Ct is sort of a control stream, and Ht is the
activations.

So instead of having one hidden layer, you have the hidden
layers made out of some memory, some forget, some waiting
and some gating.

So each layer would be replaced by one of these.

---
class: center, middle

![:scale 100%](images/GRU.png)
.smallest[[Cho et. al. - Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation(2014)](https://arxiv.org/pdf/1406.1078)]
???
Since they're so complicated, a friend of mine came up with
this thing called the GRU which is simpler.

Here, you only have a single string of hidden data, and you
have something like a forget gate. The idea between these
forget gates is you want to be able to propagate error more
easily and they are something similar to shortcut
connections over time.

Basically, if there's no influence coming from Xt on σ and σ
channel, then the information just flows through directly.
And this basically is similar to the shortcut units in the
RES-NET only like 20 years earlier.

So here the idea is to allow backpropagation through more
time by having some quite intricate structure. This a little
bit faster and a little bit easier to understand.

---
class: center, middle

![:scale 100%](images/lstm-unrolled.png)
???
But I think most people are using the LSTM these days. Each
hidden layers basically replaced by one of these. So here,
for each time step, you put in the LSTM layer that has these
two streams of data coming from the last LSTM layer, and you
produce an output.

This is a state of the art for doing things like parts of
speech tagging, or just predicting on time series or
predicting on any kinds of series.

This is pretty easy to do with keras or any other deep
learning toolbox.

Unfortunately, not that many problems have this form. Often
you want to have interactions between sequences where they
don't have the same length, for example, text. Also, they
might have long term dependencies.

So for example, if you want to translate German to English,
the word that you say at the end, might depend on the word
that was said in the beginning, and the other way around. So
if I tried to do something like this, the output here might
actually depend on the input later, in a sentence and so you
can do it with this architecture. Also, they will have
different lengths, everything is longer in German.

---
class: center, middle

![:scale 100%](images/seq2seq.png)
.smallest[[Sutskever et. al. - Sequence to Sequence Learning with Neural Networks (2014)](https://arxiv.org/pdf/1409.3215.pdf)]
???
Here, you have an input sequence, A, B, C, and you want to
predict the sequence W, X, Y, Z. And so what you do is you
start reading the input sequence token by token or word by
word, in this example, you start reading A, B, and C and
read the end of sentence. Once the model reads end of
sentence, you start trying to predict. And then you predict,
predict, predict until the model predicts end of sentence.

For each new prediction step, you also feed in what was the
output of the last step. And then again, you train this
thing with backpropagation. It's called back progression
through time, because you need to backpropagate along this
whole sequence.

The surprising thing about this is this actually works.

---
class: center
# Machine Translation
![:scale 45%](images/seq2seq-machine-translation.jpg)
???
This is how it looks in practice. This an example of machine
translation taken from TensorFlow tutorial.

This model is just trained doing backpropagation. And it
learns to translate languages properly. This is how Google
Translate works.

The structure of the two sentences is quite similar. But
these sentences are not aligned in any way. You just have an
input sequence and an output sequence. It works for any 2
sequences basically. This is called sequence to sequence
learning.

With this architecture, you can learn to predict any
arbitrary sequences.

---
class: center
# Question answering
![:scale 80%](images/general-qa.png)
.smallest[[Chen et. al. - Reading Wikipedia to Answer Open-Domain Questions](https://arxiv.org/pdf/1704.00051v2.pdf)
https://rajpurkar.github.io/SQuAD-explorer/]
???
These are actually training examples.

There are datasets of questions and answers. The model is
supposed to learn these using a context paragraph. Making
this actually work is more than just doing a sequence to
sequence but the main architecture, the main component, is a
sequence to sequence. So here, it tries to predict from the
question the answer using some context.

This is actually from a training set but the model actually
performs reasonably well on this.

Of course, these networks need a lot of data to train, they
take very long to train and you need to fiddle around with
it a lot, you need to do things like batch normalization
drop-out and so on and you need these weird LSTM layers. But
there's a whole lot of research happening in this kind of
space.

People have been using recurrent neural networks also to
learn word representations that are sort of more powerful
than word2vec.
---
# Adverserial Samples

.center[
![:scale 80%](images/adverserial.png)
]

.smallest[
<br />
[Szegedy et. al.: Intriguing properties of neural networks](https://arxiv.org/abs/1312.6199)
]
???
Since convolutional neural nets are so good at image
recognition, some people think they are pretty
infallible. But they are not. There is this interesting
paper about intriguing properties of neural networks,
that introduces adverserial samples.
Adverserial samples are samples that were created by
an adversary or attacker to fool your model. Here,
they changed images to be classified as Ostrich by
AlexNet trained on imagenet.
The picture on the left is change just slightly, and went
from correctly classified to classified as Ostrich.
This technique uses gradient descent on the input and
requires access to all the weights in the network to
create the samples.
Given how high-dimensional the input space is, this is
not very surprising from a mathematical perspective,
but it might be somewhat unexpected.

requires the weights, done by gradient descent.
FIXME where should this go?

---
class: middle
# Questions ?