Whether you like going back and forth across topics from different dates, I deem it absolutely necessary for elaborating on the befuddling contents of machine learning.
Back on 16th June, we learnt how to train a machine learning model with datasets. Today, we will return to said topic, and re-look at the concept of splitting between training and testing data.
Machine learning models require data to train. If we train the model on all the data, it might simply memorize the training examples instead of learning generalizable patterns. This leads to overfitting, where the model performs well on the training data but poorly on unseen data.
A common approach is to split the initial data into training, validation, and test sets with a typical ratio of 60% training, 20% validation, and 20% test. The validation set is used to monitor the model's performance during training, so that we can adjust its hyperparameters (settings that control the learning process) to prevent overfitting. Finally, the test set provides a final assessment of the model's generalizability, reflecting how well the model is likely to perform on real-world data.
By splitting the data, we gain a more reliable evaluation of the model's performance and reduce the risk of overfitting. This ensures the model can generalize well to unseen data, which is crucial for real-world applications.
The train_test_split function from the sklearn module can only split 1 dataset into 4 subsets per run. After splitting training and testing sets 60/40, simply split a validation set from the testing set 20/20 (or statistically, 50/50).
You can then pick any machine learning algorithm to train your model with. In this case, I am using an unmentioned algorithm: logistic regression. It is widely used for binary classification tasks (features with simply 2 categories).
Down below, we see that the trained model's accuracy on the validation set is an approximate 88.5%. A solid score from using radius mean and texture mean to predict breast cancer status (malign or benign).
Finally, we can insert our test data into the model for getting a final accuracy score. The results are not printed by default, so you will have to read them via sklearn metrics. Here is a brief rundown in relation to my image below:
Accuracy Score: the most basic metric, representing the proportion of correct predictions made by the model. Is calculated as the number of correctly classified samples divided by the total number of samples.
Accuracy can be misleading, especially for imbalanced datasets where the model might perform well on the majority class but poorly on the minority class.
Confusion Matrix: a table that provides a more detailed picture of the model's performance across different classes. By analyzing it, you can identify areas where the model might be struggling, such as high false positives or negatives for specific classes. Shows the number of:
True positives: correctly predicted positives.
False positives: incorrectly predicted positives.
True negatives: correctly predicted negatives.
False negatives: incorrectly predicted negatives.
Classification Report: a text summary derived from the confusion matrix. Provides various metrics for each class, including:
Precision: proportion of predicted positives that are actually correct (out of all positive predictions).
Recall: proportion of actual positives that are correctly identified by the model (out of all actual positive cases).
F1-Score: harmonic mean of precision and recall, combining their importance.
Support: total number of true positive and true negative examples for a specific class.
Validation results of the k-nearest neighbors (KNN) algorithm:
Final results of KNN:
Validation results of the decision tree algorithm:
Final results of desicion tree:
Validation results of the random forest algorithm:
Final results of random forest: