The upcoming chapters are very short, so it would be more informative if they are stuffed into this single section.
By understanding the key components of a deep learning model and their interactions, we can gain valuable insights into how they influence model performance and the likelihood of overfitting. Through careful experimentation and analysis, we can optimize our machine learning training workflows to achieve better results.
Without regularization, a deep learning model can become overly specialized to the training data, leading to overfitting. The nature of this phenomenon was further discussed on 26th August, which refers to when a model has memorized its training data too well, making it struggle to generalize or perform well on new data.
Using the two neural networks below as reference, notice their weights (W). The left one has larger weights than the right one. The concept of regularization encourages reducing the magnitude of weights in its models to prevent overfitting. All of this contributes to helping the model better generalize new data.
You will recall two variants of regularization from 18th July:
Lasso Regression (L1 Regularization): encourages sparsity in the model's weights, meaning many weights become zero. Can be used for feature selection, as the regularization term tends to shrink the weights of less important features to zero.
Ridge Regression (L2 Regularization): encourages smaller weights, which can help prevent overfitting and improve generalization. Adds a penalty term proportional to the squared magnitude of the weights. Tends to produce smoother decision boundaries.
For this exercise, we will begin with L1 regression. Simply add a new kernel_regularizer parameter in the layers of the previously mentioned multilayer perceptron (MLP) function template.
Each regularization type will have to run through 4 different and increasingly smaller experiment values. In theory, a lower regularization value for both L1 and L2 would allow the model to learn more complex pattern, improving its performance on the training set.
With reference to the other hyperparameters above, this neural network will run on an SGD optimizer at a learning rate of 0.001 (also 1e-3), batch size of 256, and an active Nesterov momentum of 0.95, for 50 epochs.
Here are the outputs of L1 regularization for each different experimentation level:
If you want to run L2 regression for your layers, just change all instances of L1 to L2 at kernel_regularizer.
Now here are the outputs of L2 regularization for each different experimentation level:
Dropout is a regularization technique in deep learning that randomly sets a fraction of neurons to zero during training, forcing the network to learn more robust features and prevents it from relying too heavily on any particular neuron, thus resulting in overfitting.
In the imaged neural network below, the squares represent dropout masks that randomly set some neurons to zero during training, effectively 'dropping out' the contributions. By doing so, you test if each neuron has been sufficiently trained into resilient and adaptable ones.
Running dropout requires a non-parameter line of code to be written underneath the layer code. Before that, you need to set up a dropout rate variable to determine the percentage of neurons that will be set to zero. You can experiment with other percentages to discover the optimal dropout rate for your model.
Like in the regularization exercise above, we will run our dropout-enabled neural network through different dropout rates.
Here are the outputs of each model at each dropout rate (DRP):
As valuable as regularization is in deep learning, its methods (L1 and L2) still have shared shortcomings in certain areas of the topic:
Sensitivity to Weight Initialization: the effectiveness of regularization can depend on the initial values of the model's weights.
Covariate Shift: changes in the distribution of input data during training can affect the model's performance, even with regularization.
Internal Covariate Shift: variations in the distribution of activations within the network can also impact training.
A technique known as batch normalization can address the mentioned problems by providing the following:
Normalization Activations: despite its name, it standardizes the activations of each layer to have a mean of 0 and a standard deviation of 1.
Reducing Internal Covariate Shift: helps to stabilize the training process and improve convergence.
Regularization Effect: acts as a form of regularization, reducing the sensitivity of the model to small changes in input data.
Batch normalization progresses through these steps:
Calculate mean and variance: for each batch of training data, the mean and variance of the activations are calculated.
Normalize activations: the activations are normalized using the calculated mean and variance to have a mean of 0 and a standard deviation of 1 (aka standardization).
Scale and shift: the normalized activations are scaled and shifted using learnable parameters.
More on the misleading name of batch standardization, while the nouns normalization, standardization, and scaling have recognized typical meanings, they are used interchangeably in machine learning, where their definitions change according to local contexts.
The method of implementing batch normalization is similar to that of dropout. You need to contain the dense layer code into a variable, apply it into the function, and refresh said variable's value to the post-normalized variant.
For this exercise, we will refer to 5 different batch sizes to determine its effects on an SGD neural network with Nesterov momentum's progression and output.
Here are the outputs from each batch size (BS):