Any intelligent object humans make, they tend to have some cognitive correlation to their creators — whether in conscious design or otherwise. This could be applicable to creator bias in AI development.
Today's topic is on the consequences of training a machine learning model with overspecialized or underspecialized data, as well as introductions to several techniques that address said consequences.
A key objective in machine learning is to train a model that can accurately distinguish between two categories. Imagine data points representing the two classes scattered in a 2D space. Our goal is to find a decision boundary — often visualized as a line — that effectively separates the classes.
While the ideal scenario would be a line that perfectly separates the two classes with no data points on the wrong side, real-world data often presents challenges. There might be inherent overlap between the classes, or the data might not be perfectly linear.
Consider how the choice of a decision boundary (cut line) can impact the model's performance on unseen data.
Look at the model with the red dotted line below. While it might appear to perform well on the training data, this could be a case of overfitting. The model has memorized specific details and patterns within the training set, potentially becoming too sensitive to slight variations.
This would result in:
Poor Generalizability: when presented with new/unseen data, particularly points near the decision line, the red line model might struggle to make accurate predictions. It is focused on replicating the training data rather than capturing the underlying relationships between features and classes.
Unreliable Performance: the red line's accuracy might fluctuate depending on the specific training data it encounters. This lack of consistency makes it less reliable for real-world applications.
From the new example below, we can see that when training a model, it is not the error of the training set that is as small as possible. We must also use the test set to verify the prediction power of the model, since the goal is to get as close as possible to the average error between the training set and the test set.
While a more complex model (red line) might fit the training data very well, having low training error, it can become unstable on unseen data. Conversely, a simpler model might not perfectly capture all the nuances in the training data but could generalize better to unseen data due to its less complex decision boundary.
Overfitting occurs when a model becomes too specialized to the training data and fails to generalize well to unseen data. Here are some methods on how to tackle it:
Diagnosing the Problem: while a high training error suggests underfitting a simple model, a very low training error with a significant drop in performance on unseen data indicates overfitting.
Gather More Informative Data (if possible): increasing the amount and quality of training data can help reduce variance without affecting bias. More diverse data exposes the model to a wider range of patterns, making it less reliant on specific details in the training set.
Regularization Techniques: introduce penalties for overly complex models, discouraging the model from fitting random noise in the data. This helps prevent the model from becoming too specialized to the training data.
Cross-validation: divides the data into training and validation sets. The model is trained on the training set and evaluated on the validation set. This helps identify models that perform well on unseen data, reducing the risk of overfitting.
Early Stopping: monitors the model's performance on a validation set during training. If the validation error stops improving or starts increasing, training is stopped. This prevents the model from memorizing noise in the training data after a certain point.
Ensembling: combining predictions from multiple models trained on different subsets of the data (ensemble methods) can lead to better generalization compared to a single model. This approach leverages the strengths of different models, potentially reducing overfitting.
The phenomenon of underfitting — opposite to overfitting — occurs when a model is too simple or lacks the flexibility to capture the underlying relationships within the data. Underfitting can happen due to:
Limited Model Complexity: a simple model might not have enough capacity to learn the complexities of the data.
Excessive Regularization: regularization techniques can prevent overfitting, but using too much regularization can restrict the model's ability to learn effectively.
Some symptoms of underfitting include:
High Bias: the model fails to learn the true patterns in the data, leading to consistently large errors across the entire dataset.
Low Variance: since the model is not learning much from the data, its predictions will tend to be similar, resulting in low variance (the variability of the model's predictions, not the overall error).
When a model suffers from underfitting, its predictions are consistently inaccurate because it is too simple to capture the complexities of the data. Here are some strategies to tackle underfitting:
Increase Model Complexity: move from a simple model (e.g., linear regression) to a more complex one (e.g., polynomial regression, decision trees with deeper structures). This allows the model to learn more intricate relationships between features and the target variable.
Feature Engineering: create new features from existing ones that might better represent the underlying patterns in the data. This can involve feature selection or feature creation.
Careful Regularization: while regularization is generally used to prevent overfitting, using too little regularization can contribute to underfitting. Finding the right balance is crucial.
Bias refers to the systematic difference between a model's average predictions and the true values. This systematic error indicates a bias in the model, suggesting it is under-representing the true relationship between features and target values. A simple model like linear regression might not be complex enough to capture the underlying patterns in a non-linear dataset.
Variance refers to the variability of a model's predictions for a given input. A high variance model is very sensitive to the specific training data it is exposed to. An overly complex model can try to fit every detail in the training data, including random noise or outliers, which can lead to the model performing well on the training data but failing to generalize to unseen data.
The ideal scenario is a model with low bias and low variance, but there is some relationship between bias and variance. Reducing one of them might lead to an increase in the other. The goal is to find a model that achieves a good balance between them, which can be achieved through techniques such as:
Choosing the right model complexity: select a model with enough capacity to learn the data without being overly complex.
Tuning hyperparameters: hyperparameters control the learning process of a model. Adjusting them can help reduce bias/variance.