We welcome the ascension of our artificial overlords!

Back

12th July 2024 - Evaluation Metrics

With modern assistive technologies, newbies can spend more time learning why things work in existing AI products, instead of firstly figuring out how they do what they were made for from the start.

Evaluation metrics for regression

In machine learning, evaluating regression models goes beyond simply measuring how 'correct' predictions are. Here, we focus on metrics that assess how well the model captures the underlying relationship between input features and the continuous target variable.

Below are the key evaluation metrics:

Mean Absolute Error (MAE): calculates the average of the absolute differences between the predicted and actual values. It provides a good sense of the overall magnitude of errors, without penalizing large errors more severely. Also known as L1 loss.
Mean Squared Error (MSE): MSE squares the errors before averaging them. This emphasizes larger errors, making them contribute more significantly to the overall score. MSE is often preferred for gradient descent optimization as it simplifies calculations. Also known as L2 loss.
Root Mean Squared Error (RMSE): takes the square root of the MSE, putting the errors back on the same scale as the original data. This makes RMSE easier to interpret in the context of your data's units.

R² is calculated using the following concepts:

Total Sum of Squares (TSS): represents the total variance in the target variable. Calculated by finding the squared deviations of each actual value from the mean.
Regression Sum of Squares (SSR): represents the variance explained by the model. Calculated by finding the squared deviations of the predicted values from the mean.

R² is essentially the ratio of SSR to TSS; a higher R² suggests the model captures a greater proportion of the variance in the data.

Evaluation metrics for classification

In machine learning, evaluating the performance of classification models is crucial. ROC curves (Receiver Operating Characteristic Curves) and AUC (Area Under the Curve) are valuable tools for this purpose.

An ROC curve plots the True Positive Rate (TPR) on the y-axis and the False Positive Rate (FPR) on the x-axis. TPR represents the proportion of positive cases correctly identified by the model, while FPR represents the proportion of negative cases incorrectly classified as positive. The ideal scenario is a classifier that perfectly distinguishes positive and negative cases. The closer the ROC curve is to the top-left corner, the better the classifier's performance. It indicates a high ability to correctly identify positive cases while minimizing false positives.

AUC summarizes the ROC curve's performance into a single metric between 0 and 1. An AUC of 1 represents a perfect classifier, while an AUC of 0.5 indicates a model performing no better than random guessing. Generally, an AUC greater than 0.5 suggests the model has some predictive power, while a higher AUC signifies better classification ability.

In classification tasks, especially those dealing with imbalanced datasets, focusing solely on overall accuracy can be misleading. For instance, in defect detection, it is crucial to identify all defects (high recall) even if it means some false positives (imperfect precision). Here is where F1-Score comes in:

Precision: measures the proportion of positive predictions that are actually correct (out of all positive predictions). In defect detection, it tells you how many of the samples flagged as defective are truly defective.
Recall: measures the proportion of actual positive cases that are correctly identified by the model (out of all actual positive cases). In defect detection, it tells you what proportion of all actual defects were identified by the model.

Just like the confusion matrix, the F1-Score matrix has four values — true positive, false positive, true negative, and false negative. See the graph below.

Speaking of the confusion matrix, it is also a viable evaluation metric for classification problems — as introduced in the last section.

Running evaluation metrics

To further demonstrate calculating F1- and F2-Scores in practice, the examples below will calculate both from two 100-index ground truth and prediction arrays containing random values.

Every run, the ratio between precision and recall is around 4:6 to 5:5 to 6:4. If you want, you can split your original training datasets to calculate F-Scores.

Compared to F1-Score, F2-Score gives recall more weight than precision. More weight is more recommended to be given to recall for cases where false negatives are considered worse than false positives.

Page updated

Google Sites

Report abuse