Who says the pursuit of happiness is the ultimate goal of humankind? Where would all our other ones fit into the frame?
When confronted with a dataset characterized by complex patterns, noise, and a need for precise predictions, a sequential approach that builds upon previous errors can yield superior results – sometimes more so than traditional methods that give fresh results each run.
Gradient boosting is an ensemble technique that sequentially builds models, each correcting the errors of its predecessor. Unlike random forest, which operates independently on subsets of data, gradient boosting is a series of models that learn from the mistakes of previous models. The core idea is to minimize the loss function – like mean squared error for regression or log-loss for classification – by adding weak learners (typically decision trees) iteratively. The term 'gradient' refers to the use of gradient descent to minimize the loss function.
Returning to the topic of bagging and boosting, they are powerful ensemble techniques for improving model performance. Whereas bagging excels at reducing variance by creating diverse models, boosting focuses on reducing bias by iteratively correcting errors. The choice between the two depends on the specific characteristics of the dataset and the desired outcome. Often, a combination of both techniques can yield even better results.
Below this paragraph is a visual representation of the gradient boosting decision tree (GBDT). Before deconstructing its components, we need to first understand the equation on the top.
F_n+1(X) = F_n(X) + Y_n * H(x, dL / dF_n(x)) represents how the next model F_n+1(X) is built upon the previous model Fn(X). Y_n is a learning rate or step size, controlling the influence of the new model on the overall prediction. H(x, dL / dF_n(x)) is a weak learner that aims to predict the negative gradient of the loss function with respect to the previous model.
Here is a more detailed breakdown of the technique's process:
Start with a simple model, often the mean of the target variable.
Determine the errors between the predicted values and the actual values.
Build a decision tree to predict the residuals from the previous step.
Add the prediction of the new tree to the overall prediction, scaled by a learning rate.
Go back to step 2 and iterate until a stopping criterion is met (e.g., maximum number of trees).
The digits dataset from sklearn contains of 8*8-pixel images of handwritten digits from 0 to 9. Each image is represented as a flattened array of 64-pixel values; the corresponding target value is the digit that the image represents. It is a common benchmark for image classification algorithms.
The results of the gradient boosting model are very similar to the random forest model from 31st July: a perfect training set score and a higher test set score. Perhaps this model is also overfitted and complex enough that it captured noise in the training data.
The make_regression dataset from sklearn generates a random regression problem, making it particularly useful for creating synthetic datasets for testing and experimentation purposes.
For this scenario, we will generate 100 samples with 1 feature, 1 target variable, and noise (standard deviation) of 6.
Once again, just like the results of random forest regression, the R2 scores from both training and test sets are above and around 90, indicating that the training data has fit well into the model while still generalizing well to unseen data.
However, the high testing MSE indicates larger prediction errors when predicting unseen data, suggesting potential overfitting despite the high R2 score of around 0.92.
The few gradient boosting-predicted data points are relatively close to the rising cluster of training data points. We can see that the model has done an adequate job in capturing the positive relationship between X and y.