Today is a great day to persevere.
The last few sections contained a lot of written practices, demonstrative graphs, and algebra-centered equations. Today, we will practice non-pandas methods of measuring the accuracy of linear and logistic regression models. The different natures of classification and regression problems warrant different metric conditions.
Borrowing from sklearn.datasets, all features in the sklearn's dataset are represented numerically in their own data variable. Each have their own target variable; the diabetes target represents the measure of disease progression one year after baseline.
When gauging model accuracy with mean squared error (MSE), a lower MSE score signifies a better fit between the model's predictions and the actual target values. Here are some traits of this model's results to be mindful of:
Potential for Overfitting: the training MSE of 2937 is lower than the test MSE of 2588. Perhaps the model memorized the training data too well and loses its ability to generalize to unseen data like the test set, explaining the noticable MSE gap.
High MSE Value: even the test MSE is a relatively high value. While the exact interpretation of MSE may depend on the scale of the target variable, a value this high suggests the model's predictions might not be very accurate on unseen data.
When evaluating model accuracy with R-squared values, a higher value represents higher variance. Both training (0.03) and test (0.08) R-squared values for the diabetes model are very low. This suggests that the model captures only a very small portion of the variation in disease progression, implying limited explanatory power.
Combined with the MSE scores, here's a more comprehensive picture: the model might have learned some patterns in the training data (lower training MSE), but these patterns don't generalize well to unseen data (higher test MSE and low test R-squared). This further supports the possibility of overfitting.
sklearn's linear regression model utilizes the least squares method to find the best-fitting linear relationship between features and the target variable. The model also provides two useful attributes:
coef_: a NumPy array containing the estimated coefficients (weights) for each feature in the model. These coefficients indicate the strength and direction of the linear relationship between each feature and the target variable.
intercept_: stores the estimated intercept (bias term) of the model. The intercept represents the predicted value when all features are zero.
Understanding the coefficients and intercept is crucial for interpreting a linear regression model, since they can help us understand how changes in each feature can affect the predicted outcome.
The target variable of the iris dataset is the species of the iris flower. The model's goal here is to predict each data point's species with features of numerical data.
To recap, linear regression deals with regression problems while logistic regression deals with classification problems. Using the accuracy score metric, this model scored high marks in training (0.96) and test (0.93). These scores indicate that the model can accurately predict the target variable based on the features in the training data and generalizes reasonably well to unseen data.
If you want a more visual comparison between the training and testing data, how about creating two adjacent subplots?
Both plots look very similar to each other, aside from the minority of miscategorized data points (about 2 from what I can see).