Just as most tropes are formed from combining more defined and segregated concepts, ensemble learning combines multiple learning models to create a more robust and accurate prediction model.
Imagine a scenario where you are tasked with predicting the likelihood of a specific event. The data is complex, with numerous factors influencing its occurrence. A single model – relying solely on one set of rules – would be akin to a myopic view of the situation, as it could easily overlook crucial patterns or be misled by outliers. Instead, consider a strategy that leverages the lenses of different multiple models. By combining the insights from these diverse perspectives, a more comprehensive and accurate prediction can be achieved.
Random forest is an ensemble learning algorithm that creates multiple decision trees. By combining the predictions from these individual trees, random forest effectively reduces overfitting and improves predictive accuracy compared to a single decision tree.
The steps in random forest regression are:
Randomly select a subset of data (with replacement) from the original dataset to create a new training set.
Randomly choose a subset of features to consider at each node of the decision tree.
Build a decision tree using the sampled data and selected features.
Repeat steps 1~3 multiple times to create an ensemble of decision trees.
For each data point to be predicted, pass the data point through each decision tree to obtain a predicted value, and finally average the predicted values from all trees to get the final prediction.
The steps in random forest classification are:
Randomly select a subset of data (with replacement) from the original dataset to create a new training set.
Randomly choose a subset of features to consider at each node of the decision tree.
Build a decision tree using the sampled data and selected features.
Repeat steps 1~3 multiple times to create an ensemble of decision trees.
For each data point to be classified, pass the data point through each decision tree to obtain a predicted class, and finally determine the most frequent predicted class among all trees. This is the final prediction.
Random forest introduces randomness in two key ways to build a diverse ensemble of decision trees:
Bootstrap Aggregation (Bagging): each tree in the forest is constructed using a random subset of the original training data. This involves sampling data with replacement, meaning some instances may appear multiple times in a given subset while others might not be included at all.
Random Feature Selection: when building each decision tree, only a random subset of features is considered at each node for splitting. This helps to decorrelate the trees and improve the overall model's generalization ability.
Random Forest is a powerful and versatile algorithm with several advantages, especially in handling complex datasets and reducing overfitting. Its key strengths are:
Robustness: random forest is less prone to overfitting compared to individual decision trees due to its ensemble nature and random feature selection.
Versatility: random forest can handle both classification and regression problems effectively.
Efficiency: random forest training and prediction can be parallelized across multiple cores, improving computational efficiency.
Feature Importance: random forest provides insights into the relative importance of features in the dataset.
But do be aware of its shortcomings in machine learning:
Complexity: random forests can be computationally expensive to train, especially with large datasets and a high number of trees.
Interpretability: while feature importance can be determined, understanding the exact decision-making process of the entire random forest model can be challenging due to its black-box nature.
Bias: if the underlying data is biased, the random forest model will inherit that bias.
Random forest algorithm can be used for both classification and regression problems. For the former, we will be using the iris dataset from sklearn to test the random forest's capabilities.
Remember when we learnt about decision trees back in 28th July? Since random forest is a type of decision tree, you need to pick an impurity measure. To briefly recap, impurity quantifies the disorder or randomness of the data points in a specific node. Below, the Gini index measures the probability for a random instance being misclassified when chosen randomly.
Analyzing the results below, the training set score of 1.0 tells us that the iris random forest model is overfitted, where it completely memorized the training dataset. On the other hand, the high testing set score of about 0.88 indicates this model generalizes less well to unseen data, although not by a lot due to the medium difference between both set scores.
As for the feature importance array, it tells us the influences each feature has towards the final scores listed above. Investigating the relationship between the most important features and the target variable in numerical form will provide insights into which of them affect the model's predictions most prominently.
Going far back to 17th July, you can use scatter plots to compare the accuracy of your machine learning models adjacently. The subplot on the left contains the actual data point classifications used for training, while the right one contains the results of your model's predictive work post-training. While both datasets below have many identically classified data points, there are some noticeable outliers in the mixed cluster of versicolor and virginica data points.
Besides manually reading scatter subplots, a less numerically visual method to plot your model's accuracy is to create a confusion matrix.
Random forest regression differs from overall regression (e.g., logistic regression) in that it combines multiple decision trees to form an output, is more resistant to overfitting, and can handle complex relationships and non-linear patterns in data more effectively. The former's drawbacks are that it is more computationally expensive to train and less interpretable than the latter model.
Based on the training set's R2 score of roughly 0.99 and test set's R2 score of 0.94, we could argue that the model is overfitted what with how close they are to reach 1.00. About their MSEs, scoring a low 7.7 on the training set indicates a reasonably good fit, although a relatively higher test MSE of 53.34 suggest that the model has potential for improvement in terms of prediction accuracy.
The scatter plot below shows a positive correlation between the X and y variables. Additionally, the proximity of the predicted values to the actual values suggests that the random forest regression model is performing well in capturing this positive relationship.