Techniques to Improve Neural Network
Explore effective techniques to improve neural networks, including smart weight initialization, advanced gradient descent optimizers like momentum and RMSprop, and powerful regularization methods such as dropout and batch normalization. Understand how these strategies help stabilize training, prevent common issues, and enhance accuracy in deep learning models.
We'll cover the following...
It is crucial to pick the right activation functions, but when we design a neural network, we face plenty more. We decide how to initialize the weights, which GD algorithm to use, what kind of regularization to apply, and so forth. We have a wide range of techniques to choose from, and new ones come up all the time.
It would be pointless to go into too much detail about all the popular techniques available today. We could fill entire volumes with them. Besides, some of them might be old-fashioned and quaint by the time we complete this course.
For those reasons, this lesson is not comprehensive. We’ll learn about a handful of techniques that generally work well — a starter’s kit in our journey to ML mastery. At the end of this chapter, we’ll also get a chance to test these techniques firsthand.
Let’s start with weight initialization.
Better weight initialization
In Initializing the Weights, we learned that to avoid squandering a neural network’s power, initialize its weights with values that are random and small.
However, that random and small principle does not give us concrete numbers. For that, we can use a formula such as Xavier initialization, also known as Glorot initialization. (Both names come from Xavier Glorot, the man who proposed it.
Xavier initialization comes in a few variants. They all give us an approximate range to initialize the weights, based on the number of nodes connected to them. One common variant gives us this range:
The core concept of Xavier initialization is that the more nodes we have in a layer, the smaller the weights. Intuitively, that means that it does not matter how many nodes we have in a layer. The weighted sum of the nodes stays about the same size. Without Xavier initialization, a layer with many nodes would generate a large weighted sum, and that large number could cause problems like dead neurons and vanishing or exploding gradients.
Even though we did not mention Xavier initialization so far, we already used it as the default initializer in Keras. If we want to replace it with another initialization method, of which Keras has a few, use the kernel_initializer argument. For example, here is a layer that uses an alternative weight initialization method called he_normal:
model.add(Dense(100, kernel_initializer='he_normal'))
Changing the Gradient descent
If something stays unchanged through this course, it’s the gradient descent algorithm. We changed the way we compute that gradient, from simple derivatives to backpropagation, but so far, the “descent” part is the same as we introduced in the first chapters: multiply the gradient by the learning rate and take a step in the opposite direction.
However, Modern GD can be subtler than that. In Keras, we can pass additional parameters to the SGD algorithm:
model.compile(loss='categorical_crossentropy',
optimizer=SGD(lr=0.1, decay=1e-6, momentum=0.9),
metrics=['accuracy'])
This code includes two new hyperparameters that tweak SGD. To understand decay, remember that the learning rate is a trade-off, the smaller it is, the smaller each step of GD. It makes the algorithm more precise, but also slower. When we use decay, the learning rate decreases a bit at each step. A well-configured decay causes GD to take big leaps at the beginning of training when we usually need speed, and baby steps near the end, when we would rather have precision. This twist on GD is called learning rate decay.
The momentum hyperparameter is even subtler. When we introduced GD, we learned that this algorithm has trouble with certain surfaces. For example, it might get stuck into local minima, or, holes in the loss. Another troublesome situation can happen around canyons like the one shown in the following diagram:
GD always moves downhill in the direction of the steeper gradient. In the upper part of this surface, it is steeper than the path toward the minimum, so GD ends up bouncing back and forth between those walls, barely moving toward the minimum at all.
For this example, we draw a hypothetical path on a three-dimensional surface. However, cases such as this one are common in real life on higher-dimensional loss surfaces. When they happen, the loss might stop decreasing for many epochs in a row, which leads us to believe that GD has reached a minimum, and gives up on training.
That’s where the momentum algorithm enters the scene. That algorithm counters the situation described by adding an acceleration component to GD. That makes for a smoother, less jagged path, as shown in the diagram here:
Momentum can speed up training tremendously. In some cases, it may even help GD zip over local minima, propelling it toward the lowest loss. The result is not only faster training, but also higher accuracy.
In Keras, decay and momentum are additional parameters to the standard SGD algorithm. However, Keras also comes with entirely different implementations of GD, which it calls optimizers. One of those alternatives to SGD is the RMSprop optimizer, which implements a concept similar to momentum. RMSprop makes the network’s training radically faster and more efficient than SGD:
model.compile(loss='categorical_crossentropy',
optimizer=RMSprop(lr=0.001),
metrics=['accuracy'])
That’s it about optimizers. If we want to do some research on our own, another optimizer worth checking out is called Adam. It’s very popular these days and merges momentum and RMSprop into one mean algorithm.
Advanced regularization
When it comes to overfitting, deep neural networks need all the help they can get. In the previous chapter, we learned about the classic L1 and L2 regularization techniques. However, more modern techniques often work better. One, in particular, called dropout, is very effective and also somewhat weird.
In dropout, we can reduce overfitting by randomly turning off some nodes in the network. We can see dropout as a filter attached to a layer that randomly disconnects some nodes during each iteration, as illustrated by the following diagram:
Disconnected nodes do not impact the next layers and are ignored by backpropagation. It’s like they cease to exist until the next iteration.
To use dropout in Keras, we add Dropout layers on top of regular hidden layers. We can specify the fraction of nodes to turn off at each iteration: in this example, 25%:
We have just learned how dropout works — but not why it works. It’s hard to understand intuitively why dropout reduces overfitting, but we can say that: dropout forces the network to learn slightly differently at each iteration of training. In a sense, dropout reshapes one big network into many smaller networks, each of which might learn a different facet of the data. Where a big network is prone to memorize the training set, each small network ends up learning the data its own way, and their combined knowledge is less likely to overfit the data.
That’s only one of a few possible ways to explain the effect of dropout. Whatever our intuitive understanding is, dropout works, and that’s what counts. It’s one of the first regularization techniques we reach for in the presence of overfitting.
Speaking about things that work, although it’s hard to see why there is one last technique we’ll discuss in this chapter.
When it was introduced (around 2015), batch normalization was hailed as a breakthrough. We might say that it’s an advanced technique, but even beginners are keen to use it because it works so well. It often improves a network’s accuracy and sometimes even speeds up training and reduces overfitting. There is no such thing as an easy win in deep learning, but batch normalization comes as close as anything.
Summary
Deep neural networks can be. In this chapter, we learned of a few useful techniques to regularize them.
We started with a lengthy discussion of activation functions. We learned that nonlinear functions are a necessity in neural networks, but we must choose them carefully. So far we have used sigmoids inside the network, but sigmoids can cause a number of problems as the network gets deeper: These problems include dead neurons, vanishing gradients, and exploding gradients. For that reason, we looked at a few alternatives to the sigmoid — in particular, the popular ReLU activation function.
After that discussion of activation functions, we took an overview through a number of other techniques that help us tame deep neural networks:
- Xavier initialization to initialize a neural network’s weights
- A handful of advanced GD algorithms: learning rate decay, momentum, RMSprop, and Adam
- A brilliantly counterintuitive regularization technique called dropout
- The extremely useful technique called batch normalization