Too much of anything is a bad thing for us humans. How boring is that?
Overfitting is a common challenge in machine learning, particularly for regression models. To address this issue, regression models can leverage powerful techniques known as regularization. These methods penalize overly complex models, encouraging them to learn simpler patterns that are more likely to generalize well on new data.
Linear models aim to find a line or hyperplane (in higher dimensions) that best fits the data. This is achieved by minimizing an objective function, which typically involves minimizing a loss function. But as the number of features (independent variables) in a linear model increase, the risk of encountering collinearity grows.
Collinearity occurs when features are highly correlated, meaning they contain redundant information. This can negatively impact the model's performance in several ways:
Difficulty in Interpreting Coefficients: when features are correlated, it becomes difficult to isolate the independent effect of each feature on the target variable.
Increased Variance: collinear features can inflate the variance of the model's coefficients, leading to unstable estimates.
Overfitting: models with collinear features can be more susceptible to overfitting from memorizing the training data too well, rather than use it as a base of prediction.
The objective function of a machine learning model often combines two key elements: loss function and regularization. Minimalizing loss function guides the model towards more accurate predictions and regularization discourages the model from overfitting via penalties.
Normalization is a common technique used in conjunction with regularization to ensure features are on a similar scale, further aiding model training. Briefly put, it scales the data to a specific range, typically between 0 and 1. This is useful when dealing with data that has different units or scales.
While linear regression models typically use mean squared error (MSE) or mean absolute error (MAE) as loss functions, these alone do not prevent overfitting or collinearity. To address these issues, we can incorporate regularization techniques into the objective function.
Regularization adds a penalty term to the objective function that discourages models from becoming overly complex. This is often achieved through L1 or L2 regularization:
Lasso regression (L1 penalty): uses the sum of absolute vector values (L1 norm) of the coefficients as the penalty term. Doing so can shrink some coefficients to 0, effectively removing them from the model. This not only reduces overfitting but also performs feature selection by identifying unimportant features. Though it might set all coefficients to 0 in some cases. Use if feature selection is desirable or you suspect some features might be irrelevant or redundant.
Ridge regression (L2 penalty): uses the sum of squared vector values (L2 norm) of the coefficients as the penalty term. This shrinks the coefficients towards zero but does not necessarily set them to 0. Ridge regression reduces overfitting but does not perform explicit feature selection like lasso. Use if feature selection is not a priority or you want to keep all features but reduce their influence on the model.
Both lasso and ridge regression involve a hyperparameter, often denoted by lambda (λ), which controls the strength of the penalty term. A higher λ value leads to stronger regularization. In sklearn, this parameter is often called alpha. When λ is 0, the model is equivalent to regular linear regression.
It is important to note that normalization is a separate technique often used in conjunction with regularization. Normalization improves the stability and convergence of the model during training, but it does not directly address overfitting or collinearity.
Returning to sklearn's diabetes dataset, we start with approximately training MSE of 2937.07 and test MSE of 2588.02, plus a list of non-zero coefficients (see below).
Using lasso regression, the MSE of the training set to around 3296.99 and the test set to around 3019.33. The decreased gap between the sets' MSE indicate a lower risk of overfitting.
Several coefficients in the model being zero indicates that lasso regression has performed feature selection, although these likely have minimal impact on predicting disease progression (the target variable). For features with non-zero coefficients, a positive coefficient suggests a positive correlation with disease progression, while a negative coefficient suggests a negative correlation.
Compared to the MSEs produced by lasso regression, ridge produced a training MSE of 3163.28 and a test MSE of 2939.56. Both are slightly lower than lasso's MSEs but higher than linear regression's MSEs. Of all three regression methods, ridge regression has the lowest risk of overfitting.
Unlike lasso, ridge regression retains all coefficients with non-zero values. This allows the model to consider all features but reduces their influence through shrinkage. Same as above, positive coefficients suggest a positive correlation, while negative coefficients suggest a negative correlation. However, interpreting individual coefficients in isolation might be challenging.