My mind is unable to forget major mistakes, though it does have the awareness of not being halted by them – even if it does not want to.
When training deep learning models, it is often beneficial to have tools that provide real-time feedback and allow you to intervene or make adjustments as needed.
The Keras callbacks are a tool series that allows programmers to customize the behavior of their model during training, evaluation, or inference. Specifically, they can be used to track a model's training progress, periodically save its weights for later training, stop training early to prevent overfitting, dynamically adjust hyperparameters based on its performance, and visualize its training process.
Several callbacks functions include:
EarlyStopping: stops training if the validation loss doesn't improve for a certain number of epochs.
ModelCheckpoint: saves the model at regular intervals or when a performance metric improves.
ReduceLROnPlateau: reduces the learning rate when the validation loss plateaus.
EarlyStopping monitors the validation loss of a model during training, stopping the training process if the validation loss does not improve for a specified number of epochs. Doing so saves time while preventing the model from memorizing the training data too well and improves its generalization performance.
In practice, EarlyStopping does not improve the model and its outputs and merely stops it from outputting worse results.
An attractive feature of EarlyStopping is it being able to be dictated to measure which continuous variable's change, then stopping the current run to prevent future loss function increase and accuracy decrease.
A callback function is activated as a parameter during model fitting.
Compare the abruptly ended output on the left with the callback-less model outcome on the right. The output with EarlyStopping active converges less but is close to the one from the 50-epoch model in a much shorter time.
ModelCheckpoint allows you to save the model at regular intervals or when a performance metric improves. If the training process is ever interrupted, you can resume training from the last saved checkpoint.
You can also save checkpoints for different hyperparameter configurations and compare their performance.
This exercise involves 2 tasks: comparing the outputs two ModelCheckpoint functions with save_best_only at True or False, and testing the saved models on the CIFAR-10 dataset with different seeds.
Here are the validation results of the best saved model for two different seeds from the same CIFAR-10 dataset:
The not-the-best saved model performs very minimally less accurate, but the inferiority in accuracy is nevertheless present.
ReduceLROnPlateau automatically reduces the learning rate during training if the validation loss plateaus (stagnant convergence) to prevent overfitting and improve convergence.
ReduceLROnPlateau and the learning rate decay parameter both adjust the learning rate during training, but they do so differently:
Condition: ReduceLROnPlateau reduces the learning rate when the learning rate plateaus, while learning rate decay does the same on a predefined schedule.
Flexibility: ReduceLROnPlateau offers more flexibility in terms of defining the reduction criteria and the amount of reduction, while learning rate decay follows a fixed reduction factor at regular intervals.
Influence: unlike ReduceLROnPlateau, learning rate decay does not factor model performance during training.
Much like EarlyStopping, ReduceLROnPlateau requires a variable to measure and a patience number (of epochs). Two other parameters in the function are:
Factor: determines the intensity of the learning rate reduced.
Minimum learning rate (min_lr): specifies the lowest learning rate that the optimizer will use.
From a surface-level perspective, the changes in factor and patience of the callback have barely any visible influence on the outputs from an SGD-optimized neural network.
With Adam, only the outputs at 0.75 factor differ from the others.
Finally, RMSprop's outputs remain relatively the same when plotted with curves.
Looking at past and present exercise results on CIFAR-10, despite the usage of different optimizers and ReduceLROnPlateau, such outputs could be caused by:
Sufficient Learning Rate: the initial learning rate might already be well-suited for the model and dataset, making ReduceLROnPlateau less necessary.
Optimizer Behavior: the chosen optimizer (SGD, Adam, RMSprop) might inherently handle learning rate adjustments effectively, reducing the impact of ReduceLROnPlateau.
Plateau Length: the patience parameter might be too short, preventing the callback from triggering a learning rate reduction when necessary.
Data Characteristics: the CIFAR-10 dataset might have specific characteristics that make it less sensitive to learning rate adjustments.