class: center, middle ### W4995 Applied Machine Learning # Keras & Convolutional Neural Nets 04/17/19 Andreas C. Müller ??? HW: don't commit cache! Don't commit data! Most <1mb, some 7gb. Say something about GPUs. FIXME needs much better explanation for convolution and padding strategies FIXME explain or define "feature map" FIXME explain several filters being connected to same output map FIXME mention that CNNs have few weights, lots of activations FIXME add explanation of convolution as matrix multiplication, relation to weight sharing FIXME update with state-of-the-art instead of VGG16 FIXME 1d filter slide not clear. do animation again or something similar? FIXME slide full vs valid vs same convolutions (yes actually) FIXME update syntax conv2d! FIXME doc screenshots resolution FIXME new convnet architectures? Amobea FIXME add slide on imagenet? FIXME y_i in batch normalization confusing. FIXME add something about data augmentation - next lecture! FIXME show how permuting pixels doesn't affect fully connected network, but does affect convolutional network FIXME remove dropout maybe, and talk more about convolutional nets? and then also move batch normalization to next lecture? FIXME data augmentation before dropout!! FIXME dropout: make clear different for every sample in batch FIXME How do I think about batchsize? optimization trade-off, limited ram, ... FIXME Keras fit doesn't reset! FIXME show how receptive field size increases with depth FIXME why is there a fully connected layer at the end? FIXME this lecture should really be only convolutions! put keras in previous, get rid of graph stuff? FIXME better explanation for change of size in valid convolution, put formula on slide FIXME explain number of parameters in convolutional layers FIXME call to summary shows too many params? weird? FIXME residual connections: plot the graph for the network I'm running FIXME explain weight sharing by writing convolution as matrix multiplication FIXME what do we call "parameters" --- class:center,middle #Introduction to Keras ??? --- #Keras Sequential .smaller[ ```python from keras.models import Sequential from keras.layers import Dense, Activation ``` ``` Using TensorFlow backend. ``` ```python model = Sequential([ Dense(32, input_shape=(784,)), Activation('relu'), Dense(10), Activation('softmax')]) # or model = Sequential() model.add(Dense(32, input_dim=784)) model.add(Activation('relu')) # or model = Sequential([ Dense(32, input_shape=(784,), activation='relu'), Dense(10, activation='softmax')]) ``` ] ??? There are two interfaces to keras, sequential and the functional, but we’ll only discuss sequential. Sequential is for feed-forward neural networks where one layer follows the other. You specify the layers as a list, similar to a sklearn pipeline. Dense layers are just matrix multiplications. Here we have a neural net with 32 hidden units for the mnist dataset with 10 outputs. The hidden layer nonlinearity is relu, the output if softmax for multi-class classification. You can also instantiate an empty sequential model and then add steps to it. For the first layer we need to specify the input shape so the model knows the sizes of all the matrices. The following layers can infer the sizes from the previous layers. --- .smaller[ ```python model.summary() ``` ] .center[ ![:scale 100%](images/model_summary.png) ] ??? FIXME parameter calculation! model.summary() gives you information about all the layers you have. Checking the summary allows you to see if you’ve assembled the model in the way that you wanted to. So once you assemble this keras model, you need to compile it. --- # Setting Optimizer .center[ ![:scale 90%](images/optimizer.png) ] .smaller[ ```python model.compile("adam", "categorical_crossentropy", metrics=['accuracy']) ``` ] ??? Compile method picks optimization procedure and loss Which basically just means setting some option for the optimizer, this will then give you the compute graph models. Each of parameters can be either a string or you can provide a callable or like an object that implements the right interface. The metrics here are just for monitoring. --- # Training the model .center[ ![:scale 100%](images/training_model.png) ] ??? Fit gets many parameters, not as in sklearn. --- #Preparing MNIST data .smaller[ ```python from keras.datasets import mnist import keras (X_train, y_train), (X_test, y_test) = mnist.load_data() X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255 print(X_train.shape[0], 'train samples') print(X_test.shape[0], 'test samples') num_classes = 10 # convert class vectors to binary class matrices y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) ``` ``` 60000 train samples 10000 test samples ``` ] ??? Running it on MNIST. To use the labels with keras, you need to convert them to one hot encoding. The output layer is 10-dimensional and each of the dimension corresponds to one of the possible outputs. Since the number of classes is 10, this will give 60,000x10 labels for the training set and 10,000x10 labels for the test set. --- # Fit Model ```python model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1) ``` .center[ ![:scale 80%](images/model_fit.png) ] ??? Once I set up, I can call fit. Each epoch tells you the loss and the accuracy. Loss is the thing that’s actually optimizing. Here, I didn’t specify a validation set so the accuracy is the training set accuracy. So if I gave it a bigger model, and let it run long enough then the accuracy become 1. This model is probably too small to overfit everything. Question is, why I scaled between 0 and 1 instead of mean 0 and standard deviation 1? There are 3 reasons for that. First, that’s what people do. Secondly, the minimum and maximum value mean something, they mean, black and white and so having them as 0 and 1, is kind of meaningful. Thirdly, the different pixels have the same scale. Each of them should be between 0 and 255. But actually, in the dataset, they have different variances, some pixels on the outside are constant white and so if I want to scale them to standard deviation one, I would get division by zero. I could maybe fix that, but there's some that are like slightly gray sometimes and so this would be scaled up enormously, even though that doesn’t really mean a lot. So I kind of used my prior knowledge saying, all these pixels should have the same scale and so I scaled them all in the same way. --- #Fit with Validation ```python model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_split=.1) ``` .center[ ![:scale 100%](images/validation_fit.png) ] ??? This is with the validation set. This is very similar to scikit-learn, only you have more options than fit. --- #Evaluating on Test Set ```python score = model.evaluate(X_test, y_test, verbose=0) print("Test loss: {:.3f}".format(score[0])) print("Test Accuracy: {:.3f}".format(score[1])) ``` ``` Test loss: 0.120 Test Accuracy: 0.966 ``` ??? --- # Loggers and Callbacks .smaller[ ```python history_callback = model.fit(X_train, y_train, batch_size=128, epochs=100, verbose=1, validation_split=.1) pd.DataFrame(history_callback.history).plot() ``` ] .center[ ![:scale 70%](images/logger_callback_plot.png) ] ??? We can actually look a little bit deeper into what's happening using callbacks. Keras has very powerful callbacks to do things like early stopping. By default, it just records the history of what happens. These callbacks are returned from the model.fit() History callback has an attribute called history which has all the data of what happened during training. Accuracy on a training set goes up all the time while the loss on the training set goes down all the time. Accuracy in the validation set stops and stays stable after a certain point. The loss on the validation set actually gets worse. So the loss overfits while accuracy doesn't overfit. This gives you a lot of information to debug and check if overfitting happening, how long does it take to train, should you train longer? This interface is pretty simple and pretty similar to scikit-learn but it’s not entirely compatible but sometimes you want to use these Keras models with scikit-learn tools like grid search CV or cross-validation or pipeline. --- #Wrappers for sklearn .smaller[ ```python from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor from sklearn.model_selection import GridSearchCV def make_model(optimizer="adam", hidden_size=32): model = Sequential([ Dense(hidden_size, input_shape=(784,)), Activation('relu'), Dense(10), Activation('softmax'), ]) model.compile(optimizer=optimizer,loss="categorical_crossentropy", metrics=['accuracy']) return model clf = KerasClassifier(make_model) param_grid = {'epochs': [1, 5, 10], # epochs is fit parameter, not in make_model! 'hidden_size': [32, 64, 256]} grid = GridSearchCV(clf, param_grid=param_grid, cv=5) grid.fit(X_train, y_train) ``` ] ??? See https://keras.io/scikit-learn-api/ Useful for grid-search. You need to define a callable that returns a compiled model. You can search parameters that in Keras would be passed to “fit” like the number of epochs. Searching over epochs in this way is not necessarily a good idea, though. Keras also has wrappers to scikit-learn. These objects behave exactly like scikit-learning objects. To use these, you need to define a function that would return the model. I can then give the make model function to the Keras classifier. The CLF behaves like a scikit-learn estimator. And this enables me to grid search things. I can grid search the parameters of this make model function. So I can grid search, for example, the hidden size. But I could have arbitrary Python code in his make model function. So it could also grid search, what the activations are or how many layers there are or something like that. And you can also grid search the things that are actually part of the fit function in keras. So if you remember the number of epochs was part of fit in keras. But for the keras classifier, it's obviously part of the constructor because it's a scikit-learn API. So you could search both over the fit parameters and over the parameters of the make model function. In practice, maybe grid searching number of epochs is not the smartest thing to do. Maybe looking at the validation set is better. But these are things you could do. And now I can use grid search CV with the keras classifier object, parameter grid, and cross-validation. And everything is, as you would expect. For later homework, doing cross-validation might be too slow for any of the bigger nets. So you could still do this if you set CV to something like stratified shuffle split and do a single split of the data. Stratified shuffled split and iterate equal to one is basically the same as using a single validation holdout set and you can use this for grid search if you want. --- .smaller[ ```python res = pd.DataFrame(grid.cv_results_) res.pivot_table(index=["param_epochs", "param_hidden_size"], values=['mean_train_score', "mean_test_score"]) ``` ] .center[ ![:scale 70%](images/keras_api_results.png) ] ??? Training longer overfits more and more units overfit more, but both also lead to better results. We should probably train much longer actually. Setting the number of epochs via cross-validation is a bit silly since it means starting from scratch again each time. Using early stopping would be better. --- class: middle # Drop-out ??? Drop out is a technique that’s relatively new compared to all the other things. It's a particular way to regularize the network, which cannot be done with scikit-learn. --- # Drop-out Regularization .center[ ![:scale 75%](images/dropout_reg.png) ] ??? For each sample, during training, you actually remove certain hidden nodes. So basically, you just X out the activations. You do this every time you trade over a training sample anew. So, this will be different for every example. The idea at the very high level is to add noise into the training procedure so that you basically make it harder to overfit. Since you drop out these parts, it gets much harder for the model to learn the training set by heart because you always remove different information. The goal is that you then learn weights that are more robust and since you're trying to prevent overfitting that possibly generalizes better. The dropout rate is often pretty high, sometimes it 50%, which means you set 50% of your units to zero. This is only during learning. You only want to add noise in the learning process. When you want to do predictions, you use all the weights but down weight them by the drop out rate. Here, we're not only perturbing the input image, but we're also perturbing the hidden units in a very particular way. As it turns out that actually works much better than just adding noise to the input. -- .smallest[ - https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf - Rate often as high as .5, i.e. 50% of units set to zero! - Predictions: use all weights, down-weight by 1 - dropout rate ] ??? - Randomly set activations to zero. Drop out is a very successful regularization technique developed in 2014. It is an extreme case of adding noise to the input, a previously established method to avoid overfitting. Instead of adding noise, we actually set given inputs to 0. And not only on the input layer, also the intermediate layer. For each sample, and each iteration we pick different nodes. Randomization avoids overfitting to particular examples. --- class:spacious #Ensemble Interpretation - Every possible configuration represents different network. - With p=.5 we jointly learn `$\binom{n}{\frac{n}{2}}$` networks - Networks share weights - For last layer dropout: prediction is approximate geometric mean of predictions of sub-networks. ??? --- #Implementing Drop-Out .smaller[ ```python from keras.layers import Dropout model_dropout = Sequential([ Dense(1024, input_shape=(784,), activation='relu'), Dropout(.5), Dense(1024, activation='relu'), Dropout(.5), Dense(10, activation='softmax'), ]) model_dropout.compile("adam", "categorical_crossentropy", metrics=['accuracy']) history_dropout = model_dropout.fit(X_train, y_train, batch_size=128, epochs=20, verbose=1, validation_split=.1) ``` ] ??? --- class:spacious # When to use drop-out - Avoids overfitting - Allows using much deeper and larger models - Slows down training somewhat - Wasn’t able to produce better results on MNIST (I don’t have a GPU) but should be possible ??? This is basically a regularizer. You're can learn much, much bigger networks with drop out so you can do deeper and wider networks without overfitting. That was the original intent. It slows down training even for the same size because you basically zero out a portion of the weights and these weights will not be updated. And so you need to do more iterations before you've learned all the weights. The question was, can we paralyze over different networks. It's not really implemented as different networks, it's more helpful to think of it as different networks. There are way too many networks as to store them separately. Generally, drop out is one of the techniques that people use that usually improve generalization and decrease overfitting. Another technique that you should look up is residual networks. It’s another generic strategy to speed up training and it allows to learn deeper networks. Not all state of the art neural networks you will see use drop out, but some of them might. And similarly, you will see some that use residual networks or batch normalization. There’s basically a toolbox of different tricks that people use and plot together to get good networks. Another technique that is often used with neural networks is augmenting the training data. So adding noise to training data, or changing the training data in some way so that you get more data is something that is often very helpful. We'll talk about images next. With natural images, you can flip it on the y-axis and it will still semantically be the same image unless you want to do OCR. And so if you do that, you just doubled the size of your training set. And for neural networks, the bigger training set, the better. If you do something like add noise, this actually gives you an infinite sized training set because every time you see a sample, you add noise in a different way and so you on the fly, generate arbitrarily big training datasets. Drop out is sort of a way to do this generically and do this on the hidden layer. But if you have domain knowledge, like for images, and you can do this in a way that's specific to the domain, then that is often more helpful. Another thing that is used with images, for example, is looking at different crops/scales of the image. So if you can crop your image in different ways, like 3 pixels to left and 3 pixels to the right and each of them is a new training sample, this way, you can also get infinitely many samples. Obviously, they're all correlated but it's better than just using the same sample over and over and makes overfitting harder. And these are actually very, very important techniques. So if you look at state of the art models they will always use these techniques to augment the training dataset, because getting images is easy, for example, but getting labeled images is very hard. And so if you know how you interpret the image, you know what you should do with the label. If you're just doing classification and cropping it differently or rotating, it will not change the label. Unless your label is a person holding something in the left hand versus holding something in the right hand and then you know you need to switch the label. class: center, middle # Deep Residual Networks (ResNet) [He et. al. - Deep Residual Learning for Image Recognition (2015)](https://arxiv.org/pdf/1512.03385.pdf) ??? One that made quite a big jump in performance is deep residual networks. --- class:center,middle #Convolutional neural networks ??? --- class:spacious # Idea - Translation invariance - Weight sharing ??? FIXME figure with illustration There are 2 main ideas. One is if you move your focus of an image, the semantics of the image are the same. So if I look a little bit further right a pixel or so then this shouldn't really change my interpretation of the image. It also implies if I detect something on the right side, it's sort of the same as detecting something in the left side, for many natural image tasks. So if I want to say is there’s a person in the image, it doesn't matter if the person is in the left, right, top or bottom, detecting a person will always be the same thing. Weight sharing is that basically, you can detect something no matter where it is in an image. --- #Definition of Convolution `$$ (f*g)[n] = \sum\limits_{m=-\infty}^\infty f[m]g[n-m] $$` `$$ = \sum\limits_{m=-\infty}^\infty f[n-m]g[m] $$` .center[ ![:scale 80%](images/convolution.png) ] ??? The definition is symmetric in f, but usually one is the input signal, say f, and g is a fixed “filter” that is applied to it. You can imagine the convolution as g sliding over f. If the support of g is smaller than the support of f (it’s a shorter non-zero sequence) then you can think of it as each entry in f * g depending on all entries of g multiplied with a local window in f. Not that the output is shorter than the input by half the size of g. this is called a valid convolution. We could also extend f with zeros, and get a result that is larger than f by half the size of g, that’s called a full convolution. We can also just pad a little bit and get something that is of the same size as f. Also not that the filter g is flipped as it’s indexed with -m It’s easier to think about this in a signal processing context, where we have some signal, and you have a filter that you want to apply to the signal. --- #1d example: Gaussian smoothing .center[ ![:scale 80%](images/Gaussian_Smoothing.png) ] ??? This is how it looks in a 1D dataset. I can try to smooth this signal out by, for example, using a Gaussian filter. So here, the top left would be F and the top right would be G. Computing the convolution between F and G basically means each point in the new series is a Gaussian weighted average of the surrounding points. The result I get is displayed in the bottom. So smoothing is one of the simplest things we can do with convolutions. --- #2d Smoothing .center[ ![:scale 80%](images/2dsmoothing.png) ] ??? This can also be done in 2D. --- #2d Gradients .center[ ![:scale 80%](images/2dgradient.png) ] ??? Gradients can be computed with convolutions. One way to compute a smooth gradient on the image is using this filter. After applying the filter I get vertical edges on a particular scale. Whereas the areas which are of the same color are smooth. You can also compute Laplacian, and you can detect simple parent images. So what a convolution neural network does is, most layers apply convolutions of this form and what we want to learn are these entries in these filters. In convolutional neural network, the weights that I'm going to learn are not weights to multiply, but it's these filters that apply all the image. And so by this, you get the two things that we want which are translation invariance and being invariant to location. Since the same filters are applied everywhere, detecting an edge at the top, it's the same as detecting an edge at the bottom. And because of the way that this is computed, if I shift the image by one pixel, the output shifts by one pixel, but the correspondence between input and output will stay the same. --- #Convolution Neural Networks .center[ ![:scale 100%](images/CNET1.png) ] - Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner: Gradient-based learning applied to document recognition ??? Here is the architecture of an early convolutional net form 1998. The basic architecture in current networks is still the same. You can multiple layers of convolutions and resampling operations. You start convolving the image, which extracts local features. Each convolutions creates new “feature maps” that serve as input to later convolutions. To allow more global operations, after the convolutions the image resolution is changed. Back then it was subsampling, today it is max-pooling. So you end up with more and more feature maps with lower and lower resolution. At the end, you have some fully connected layers to do the classificattion. --- #Max Pooling .center[ ![:scale 100%](images/maxpool.png) ] ??? - Need to remember position of maximum for back-propagation. - Again not differentiable → subgradient descent One of the many things that changed since the research paper MNest is that, instead of averaging, we're now doing max pooling usually. Looking at the 4x4 pixels, instead of subsampling, subsampling basically means just average them, what people are doing now is just taking the maximum. So this is again, used to reduce the size of the input. For example, if we do 2x2 non-overlapping max pooling, we go from 4x4 to 2x2. And people also sometimes do like bigger max pooling, overlapping max pooling, and there are many different variants of what you can do. But people now usually use max pooling for decreasing the resolution. If you do this, this is not differentiable, again, similar to the relu. Again, you need to do separating descend but in reality, you don't really need to worry about it. If you want to do gradient descent, you need to store where the maximum came from, though. If you just store the output, you can’t backpropagate through it, you need to know that the 20 came from here. Q: Can we take the minimum instead of the maximum? Probably no if you're using rectified linear units because the minimum will usually be zero, it caps at zero. If you use tan-H, it should be all symmetric and probably doesn't matter. --- .center[ ![:scale 60%](images/cnn_digits.png) ] ??? These are the activations. --- # Deconvolution .center[ ![:scale 100%](images/deconvolution_1.png) ] https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf ??? If you do it on natural images, you get something like this. This is one of the standard neural networks trained on image net dataset. So you can look at the filters in the input layer and because they're in image space, you can just visualize them. Here, this network has 9 filters in the first layer. If you look at the higher layers, if you want to see the filters, it's kind of hard, because they don't work on the input space, they work on the space created by these filters. But you can kind of try to project them down into the original image. And if you do that, then you get something like these. So this is sort of a visualization of what the second layer learned, this is sort of what they try to detect in the input. And for each of these, here, you can see the patches in the training set that most corresponded to them. --- class: center, middle ![:scale 100%](images/deconvolution_2.png) ??? The goal is that with each layer going up, it learns more and more abstract things. And so these are different units in the dataset and what they correspond to. This network learned to detect people even though there’s no person class in image net. --- class: center, middle ![:scale 100%](images/deconvolution_3.png) ??? If you go even deeper into the network, you can see that the units correspond to even more abstract things. These are all learned for the class of image classification. So this was just trained on the output, there was no information about the location of anything in the image, it was just trained. The dogs in these images are of different breeds and different vehicles, and so on. So there was never any location information in the training, but it's still picked up a lot of things. --- class: middle ![:scale 100%](images/distill-feature-vis.png) .smaller[https://distill.pub/2017/feature-visualization/] ??? This is a paper on feature visualization in neural networks. This is from Google net, also trained on the same set as image net. What they’re doing is they’re taking the network, they’re fixing the activation in a particular feature map and then they do backpropagation through the network to adjust the input image so that this particular feature map is most activated. This way, you want to create an image that activates this particular feature map or particular filter the most. This is one of the papers in a series of papers about how to restrict gradient descent on the input image to get some nice outputs. These are deeper into the network. In the beginning, it’s like edges of different frequencies and orientations, then you get something like textures, which are already quite rich, and then you get the larger textures, and then you get sort of more object like things and if you go further up, you get something that is very close to object. This sort of allows you to try to understand what the higher layers of these networks detect and it's quite interesting because a lot of these become quite specific. And this is all learned in a completely supervised way so the only feedback that the model got was learning to label these images correctly. But we get very fine-grained features in the network. --- .center[ ![:scale 80%](images/other_architectures.png) ] ??? Here are two more recent architectures, AlexNet from 2012 and VGG net from 2015. These nets are typically very deep, but often have very small convolutions. In VGG there are 3x3 convolutions and even 1x1 convolutions which serve to summarize multiple feature maps into one. There is often multiple convolutions without pooling in between but pooling is definitely essential. --- class:center,middle #Conv-nets with keras ??? --- #Preparing Data .smaller[ ```python batch_size = 128 num_classes = 10 epochs = 12 # input image dimensions img_rows, img_cols = 28, 28 # the data, shuffled and split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data() X_train_images = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) X_test_images = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) input_shape = (img_rows, img_cols, 1) y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) ``` ] ??? For convolutional nets the data is n_samples, width, height, channels. MNIST has one channel because it’s grayscale. Often you have RGB channels or possibly Lab. The position of the channels is configurable, using the “channels_first” and “channels_last” options – but you shouldn’t have to worry about that. --- # Create Tiny Network ```python from keras.layers import Conv2D, MaxPooling2D, Flatten num_classes = 10 cnn = Sequential() cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) cnn.add(MaxPooling2D(pool_size=(2, 2))) cnn.add(Conv2D(32, (3, 3), activation='relu')) cnn.add(MaxPooling2D(pool_size=(2, 2))) cnn.add(Flatten()) cnn.add(Dense(64, activation='relu')) cnn.add(Dense(num_classes, activation='softmax')) ``` ??? For convolutional nets we need 3 new layer types: Conv2d for 2d convolutions, MaxPooling2d for max pooling and Flatten go reshape the input for a dense layer. There are many other options but these are the most commonly used ones. --- #Number of Parameters .left-column[ Convolutional Network for MNIST ![:scale 100%](images/cnn_params_mnist.png) ] .right-column[ Dense Network for MNIST ![:scale 100%](images/dense_params_mnist.png) ] ??? Convolutional networks have many more activations but they have fewer weights to learn because they’re shared everywhere. --- #Train and Evaluate .smaller[ ```python cnn.compile("adam", "categorical_crossentropy", metrics=['accuracy']) history_cnn = cnn.fit(X_train_images, y_train, batch_size=128, epochs=20, verbose=1, validation_split=.1) cnn.evaluate(X_test_images, y_test) ``` ``` 9952/10000 [============================>.] - ETA: 0s [0.089020583277629253, 0.98429999999999995] ``` ] .center[ ![:scale 50%](images/train_evaluate.png) ] ??? You get some overfitting but generally, the results are pretty decent. --- #Visualize Filters .smaller[ ```python weights, biases = cnn_small.layers[0].get_weights() weights2, biases2 = cnn_small.layers[2].get_weights() print(weights.shape) print(weights2.shape) ``` ``` (3,3,1,8) (3,3,8,8) ``` ] .center[![:scale 40%](images/visualize_filters.png)] ??? These are the filters in the first layer, I learned 8 3x3 filters. This makes to 8 maps, meaning 64 filters. Each row corresponds to one output map. --- .center[![:scale 80%](images/digits.png)] ??? --- class:center,middle # Batch Normalization ??? The next thing I want to talk about is one more trick that we can use to improve learning called batch normalization. This is a heuristic people found that speeds up learning and often gives better results. The idea here is that you want the hidden units to be scaled well. So what this does is kind of trying to scale hidden units to zero mean unit variance and doing this per batch. --- #Batch Normalization .center[ ![:scale 80%](images/batch_norm.png) ] .smallest[ [Ioffe, Szegedy: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167) ] ??? Another relatively recent advance in neural networks is batch normalization. The idea is that neural networks learn best when the input is zero mean and unit variance. We can scale the data to get that. But each layer inside a neural network is itself a neural network with inputs given by the previous layer. And that output might have much larger or smaller scale (depending on the activation function). Batch normalization re-normalizes the activations for a layer for each batch during training (as the distribution over activation changes). The avoids saturation when using saturating functions. To keep the expressive power of the model, additional scale and shift parameters are learned that are applied after the per-batch normalization. --- # Convnet with Batch Normalization .smaller[```python from keras.layers import BatchNormalization num_class = 10 cnn_small_bn = Sequential() cnn_small_bn.add(Conv2D(8, kernel_size=(3, 3), input_shape=input_shape)) cnn_small_bn.add(Activation("relu")) cnn_small_bn.add(BatchNormalization()) cnn_small_bn.add(MaxPooling2D(pool_size=(2, 2))) cnn_small_bn.add(Conv2D(8, (3, 3))) cnn_small_bn.add(Activation("relu")) cnn_small_bn.add(BatchNormalization()) cnn_small_bn.add(MaxPooling2D(pool_size=(2, 2))) cnn_small_bn.add(Flatten()) cnn_small_bn.add(Dense(64, activation='relu')) cnn_small_bn.add(Dense(num_classes, activation='softmax')) ```] ??? very small to make it fit on a slide --- # Learning speed and accuracy .center[![:scale 80%](images/learning_speed.png) ] ??? FIXME label axes! The solid lines are with batch normalization and the dotted lines are without batch normalization. You can see the learning is much faster and also gets to a better accuracy. --- # For larger net (64 filters) .center[![:scale 80%](images/learning_speed_larger.png) ] ??? FIXME label axes Here's the same thing with a larger network where we have 64 filters instead of 8. This learns even faster and learns to overfit really well. --- class: middle # Questions ?