TrainingTricks
Deep Learning Training Tricks¶
As neural networks become deeper, the process of their training becomes more and more challenging. One major problem is so-called vanishing gradients or exploding gradients. This post gives a good introduction into those problems.
To make training deep networks more efficient, there are a few techniques that can be used.
Keeping values in reasonable interval¶
To make numerical computations more stable, we want to make sure that all values within our neural network are within reasonable scale, typically [-1..1] or [0..1]. It is not a very strict requirement, but the nature of floating point computations is such that values of different magnitudes cannot be accurately manipulated together. For example, if we add 10-10 and 1010, we are likely to get 1010, because smaller value would be "converted" to the same order as the larger one, and thus mantissa would be lost.
Most activation functions have non-linearities around [-1..1], and thus it makes sense to scale all input data to [-1..1] or [0..1] interval.
Initial Weight Initialization¶
Ideally, we want the values to be in the same range after passing through network layers. Thus it is important to initialize weights in such a way as to preserve the distribution of values.
Normal distribution N(0,1) is not a good idea, because if we have n inputs, the standard deviation of output would be n, and values are likely to jump out of [0..1] interval.
The following initializations are often used:
- Uniform distribution --
uniform
- N(0,1/n) --
gaussian
- N(0,1/√n_in) guarantees that for inputs with zero mean and standard deviation of 1 the same mean/standard deviation would remain
- N(0,√2/(n_in+n_out)) -- so-called Xavier initialization (
glorot
), it helps to keep the signals in range during both forward and backward propagation
Batch Normalization¶
Even with proper weight initialization, weights can get arbitrary big or small during the training, and they will bring signals out of proper range. We can bring signals back by using one of normalization techniques. While there are several of them (Weight normalization, Layer Normalization), the most often used is Batch Normalization.
The idea of batch normalization is to take into account all values across the minibatch, and perform normalization (i.e. subtract mean and divide by standard deviation) based on those values. It is implemented as a network layer that does this normalization after applying the weights, but before activation function. As a result, we are likely to see higher final accuracy and faster training.
Here is the original paper on batch normalization, the explanation on Wikipedia, and a good introductory blog post (and the one in Russian).
Dropout¶
Dropout is an interesting technique that removes a certain percentage of random neurons during training. It is also implemented as a layer with one parameter (percentage of neurons to remove, typically 10%-50%), and during training it zeroes random elements of the input vector, before passing it to the next layer.
While this may sound like a strange idea, you can see the effect of dropout on training MNIST digit classifier in Dropout.ipynb
notebook. It speeds up training and allows us to achieve higher accuracy in less training epochs.
This effect can be explained in several ways:
- It can be considered to be a random shocking factor to the model, which takes optimiation out of local minimum
- It can be considered as implicit model averaging, because we can say that during dropout we are training slightly different model
Some people say that when a drunk person tries to learn something, he will remember this better next morning, comparing to a sober person, because a brain with some malfunctioning neurons tries to adapt better to gasp the meaning. We never tested ourselves if this is true of not
Preventing overfitting¶
One of the very important aspect of deep learning is too be able to prevent overfitting. While it might be tempting to use very powerful neural network model, we should always balance the number of model parameters with the number of training samples.
Make sure you understand the concept of overfitting we have introduced earlier!
There are several ways to prevent overfitting:
- Early stopping -- continuously monitor error on validation set and stopping training when validation error starts to increase.
- Explicit Weight Decay / Regularization -- adding an extra penalty to the loss function for high absolute values of weights, which prevents the model of getting very unstable results
- Model Averaging -- training several models and then averaging the result. This helps to minimize the variance.
- Dropout (Implicit Model Averaging)
Optimizers / Training Algorithms¶
Another important aspect of training is to chose good training algorithm. While classical gradient descent is a reasonable choice, it can sometimes be too slow, or result in other problems.
In deep learning, we use Stochastic Gradient Descent (SGD), which is a gradient descent applied to minibatches, randomly selected from the training set. Weights are adjusted using this formula:
wt+1 = wt - η∇ℒ
Momentum¶
In momentum SGD, we are keeping a portion of a gradient from previous steps. It is similar to when we are moving somewhere with inertia, and we receive a punch in a different direction, our trajectory does not change immediately, but keeps some part of the original movement. Here we introduce another vector v to represent the speed:
- vt+1 = γ vt - η∇ℒ
- wt+1 = wt+vt+1
Here parameter γ indicates the extent to which we take inertia into account: γ=0 corresponds to classical SGD; γ=1 is a pure motion equation.
Adam, Adagrad, etc.¶
Since in each layer we multiply signals by some matrix Wi, depending on ||Wi||, the gradient can either diminish and be close to 0, or rise indefinitely. It is the essence of Exploding/Vanishing Gradients problem.
One of the solutions to this problem is to use only direction of the gradient in the equation, and ignore the absolute value, i.e.
wt+1 = wt - η(∇ℒ/||∇ℒ||), where ||∇ℒ|| = √∑(∇ℒ)2
This algorithm is called Adagrad. Another algorithms that use the same idea: RMSProp, Adam
Adam is considered to be a very efficient algorithm for many applications, so if you are not sure which one to use - use Adam.
Gradient clipping¶
Gradient clipping is an extension the idea above. When the ||∇ℒ|| ≤ θ, we consider the original gradient in the weight optimization, and when ||∇ℒ|| > θ - we divide the gradient by it's norm. Here θ is a parameter, in most cases we can take θ=1 or θ=10.
Learning rate decay¶
Training success often depends on the learning rate parameter η. It is logical to assume that larger values of η result in faster training, which is something we typically want in the beginning of the training, and then smaller value of η allow us to fine-tune the network. Thus, in most of the cases we want to decrease η in the process of the training.
This can be done by multiplying η by some number (eg. 0.98) after each epoch of the training, or by using more complicated learning rate schedule.
Different Network Architectures¶
Selecting right network architecture for your problem can be tricky. Normally, we would take an architecture that has proven to work for our specific task (or similar one). Here is a good overview or neural network architectures for computer vision.
It is important to select an architecture that will be powerful enough for the number of training samples that we have. Selecting too powerful model can result in overfitting
Another good way would be to use and architecture that will automatically adjust to the required complexity. To some extent, ResNet architecture and Inception are self-adjusting. More on computer vision architectures