When you relax, are you attuned to your desires of relaxing? This might not always be true in real-life practice.
At the heart of training complex systems to learn from data lies a critical component that quantifies the efficacy of their predictions. This metric, often expressed as a numerical value, serves as a guiding star, directing the learning process towards refinement.
In machine learning, every algorithm aims to optimize an objective function, which measures the model's performance. The objective function is typically minimized, and in this context, it is referred to as a loss function. The loss function evaluates the model's predictive ability by quantifying the difference between predicted outputs and actual targets.
Loss functions can be broadly categorized into two types:
Classification Loss Functions: used for classification problems, where the goal is to predict a discrete label or class. Examples include cross-entropy loss, binary cross-entropy loss, and hinge loss.
Regression Loss Functions: used for regression problems, where the goal is to predict a continuous value. Examples include mean squared error (MSE), mean absolute error (MAE), and Huber loss.
We expect the model to predict values that match the actual targets as closely as possible. However, there will always realistically be a discrepancy between the predicted values (ŷ) and the actual values (y).
In regression problems, this discrepancy is measured as residuals, representing the difference between predicted and actual continuous values. In classification problems, this discrepancy is measured as the error rate, representing the proportion of misclassified instances.
As mentioned earlier, the loss function quantifies this discrepancy by calculating the difference between actual and predicted values. Its primary goal is to minimize this difference, ensuring the model's predictions are as accurate as possible.
The mean squared error (MSE), also known as the least squares method, measures the average squared difference between predicted values (ŷ) and actual values (y). It is suited for regression problems, where the goal is to predict continuous numerical results. Other related functions include:
Mean Absolute Error (MAE): measures the average absolute difference between predicted and actual values.
Mean Absolute Percentage Error (MAPE): measures the average absolute percentage difference between predicted and actual values.
Mean Squared Logarithmic Error (MSLE): measures the average squared difference between the logarithms of predicted and actual values.
The cross-entropy loss function measures the difference between predicted probabilities (ŷ) and actual labels (y). When the predicted value is closer to the actual value, the loss function value decreases, and vice versa. It is well-suited for classification problems, where the goal is to predict probabilities rather than continuous values.
Several variants of cross-entropy exist and have optimal scenarios for when they should be used:
Sparse Categorical Cross-Entropy
Use when the target variable is an integer-encoded categorical label (e.g., 0, 1, 2, etc.)
Useful when the number of classes is large and the labels are sparse (i.e., most labels are zero).
Loss function only considers the probability of the true class, ignoring the probabilities of the other classes.
Categorical Cross-Entropy
Use when the target variable is a one-hot encoded categorical label (e.g., [1, 0, 0], [0, 1, 0]).
Useful when the number of classes is small to moderate, and the labels are one-hot encoded.
Loss function considers the probabilities of all classes, including the true class and the incorrect classes.
Binary Cross-Entropy
Use when the target variable is a binary label (0 or 1).
This variant is useful for binary classification problems, such as logistic regression.
The loss function measures the difference between the predicted probability and the true binary label.
The hinge loss, also known as the maximum-margin loss, is a type of unilateral loss function. This loss function is suitable for problems where the goal is to maximize the distance between classes. It is also the default loss function for maximum-margin classification algorithms such as SVMs (support vector machines).
Two variants of hinge loss are:
Squared Hinge Loss (L(y, ŷ) = max(0, 1 - y * ŷ)^2)
Squares the loss value.
Makes the loss function more sensitive to large errors, as the squared term amplifies the effect of mistakes.
Used to increase the penalty for misclassifications and encourage the model to find a larger margin between classes.
Categorical Hinge Loss
Calculates the loss for each class separately and sums them up.
For each negative class, it adds the maximum of 0 and 1 plus the product of the true label and the predicted probability.
Used to handle more than two classes and account for the relationships between classes
In real-world problems, a standard loss function may not accurately capture the nuances of the situation. Therefore, it is essential to design a custom loss function that aligns with the specific problem requirements.
Suppose we are predicting sales volume, and the costs and profits are asymmetric. Overestimating sales (predicting higher sales than actual) incurs a loss due to excess production costs, while underestimating sales (predicting lower sales than actual) means lost profit opportunities.
In this case, a custom loss function – with the goal of penalizing the model for errors in a way that mirrors real-world consequences – can be designed to reflect the unequal costs and profits.
A deep learning model without a loss function is akin to a ship without a rudder – it lacks a clear objective and direction. Without it, there is no way to guide (or stop) the training process. Without it, model parameters might generate unpredictable and potentially useless outputs. Without it, we would find it impossible to assess the model's quality.
With the significance of the loss function briefly reinforced, let us begin today's exercise by creating a multi-layered Sequential model. To recap the purpose of deep learning model layers learnt on 8th August:
Convolutional layers extract features from input data (i.e., images) by applying a set of learnable filters to the input.
Pooling layers pooling layers reduce the dimensionality of data while preserving essential information.
While we are at it, here is a quick reminder of the common parameters and what they do in deep learning models:
Dropout introduces noise during training to improve generalization.
Flatten converts multi-dimensional input (e.g., image) into a one-dimensional vector.
Dense connects all neurons in one layer to all neurons in the next layer.
While there is a general correlation between deep learning model complexity and accuracy, it is not a straightforward one-to-one relationship, much like a lot of noisy relationships in real life.
Increasing complexity (more layers and/or neurons) often leads to higher accuracy up to a certain point. But beyond that point, increasing complexity can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on new data.
Now is the time to put your model into action. First, we will using mean squared error (MSE) as the loss function for this attempt. The optimization algorithm stochastic gradient descent (SGD) – which calculates the gradient based on a single random data point (or a small batch) rather than the entire dataset – will be used to update Sequential network parameters for the next run as well.
Looking at the plot below, the outputs of using MSE as a loss function are rather low. Not only that, but the validation accuracy is notably higher than the training accuracy, which is an unexpected trend for typical deep learning models. To be fair, though, 12 epochs is WAY too little amount of frames to train advanced neural networks.
Continuing with the setting of 12 epochs, this run will use binary cross-entropy as our Sequential model's loss function. To recap, it calculates the error between the predicted probability and the true label (0 or 1), then encourages the model to output probabilities closer to the true label.
Disappointingly, binary cross-entropy not only outputted an even lower (not that much) accuracy for both sets, but also a larger gap between the training and validation sets, the former being consistently higher than the latter. Another observation that sets this plot apart is the minimal improvement in accuracy over the epochs.
The most likely reasons for the abysmal (probably exaggerating) performances of the two runs is the low number of permitted epochs. Other reasons for the outputs are also worth investigating into.