More equations and texts coming your way!
Back on 30th June, we learnt about linear regression. Today, we will re-explore the statistical method of linear regression and its types, then introduce logistic regression — a method I used for training a machine learning back on 11th July.
Simple Linear Regression involves only one independent variable (x) affecting a single dependent variable (y). The classic equation for a simple linear regression line is y = a * x + b, where:
a represents the slope of the line, indicating the amount of change in y for every 1-unit increase in x.
b represents the y-intercept, the point where the regression line crosses the y-axis (when x is zero).
Multiple Linear Regression is used when there are 2 or more independent variables (x1, x2, …, xn) influencing y. This allows us to model more complex relationships between multiple factors and a single outcome.
In linear regression problems, there are two main approaches for estimating parameters (coefficients):
Closed-form Solutions: provides a direct mathematical formula to calculate the optimal parameters for a linear model. This formula considers all the data points at once and offers an efficient solution, especially for models with few features.
Least Squares Method: a closed-form solution that minimizes the overall discrepancy between predicted and actual y values.
Gradient Descent: starts with an initial guess for the model parameters and then progressively updates them in the direction that minimizes the difference between predicted and actual y values. This update process continues until the error reaches a minimum, or a stopping criterion is met.
Linear regression aims to find the line with the least overall error (measured by MSE or gradient descent) among all lines, with the purpose of finding the line that most accurately represents the linear relationship between data points.
Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an event belonging to a specific category.
Logistic regression aims to find a best-fitting line to model the relationship between variables, but the goal is not necessarily to find a perfect separation line that divides all data points perfectly. Real-world data often has inherent noise and complexities that make a perfect separation challenging.
Logistic regression addresses these real-world conundrums by introducing the concept of probability. It uses a sigmoid function (S-shaped function) to map the linear combination of features to a probability value between 0 and 1 — a value closer to 1 indicates a higher probability of belonging to a specific class, while a value closer to 0 suggests a lower probability.
The notation fw,b(x) can represent the output of the logistic regression model for a given data point x with specific weights (w) and bias (b). This output can be interpreted as the predicted probability of x belonging to a particular class (posterior probability).
Expanding on the sigmoid function section, a logistic regression model assigns weights (w_n) equal to each feature and adds a bias term (b). These values are combined through a linear function to get a score (z). This score is then fed into a sigmoid function, which transforms it into a probability value between 0 and 1.
A common loss function used in logistic regression is cross-entropy. It penalizes the model more for significant errors in prediction and less for smaller ones.
To achieve the best possible fit, we need to find the combination of w's and b's that minimizes the overall loss function across all training data. Through gradient descent, the w value is repeatedly updated until the loss function reaches a minimum, signifying the best possible fit for the model based on the training data. The update rule for weights can be written as:
Logistic regression excels at binary classification (two classes). However, for problems with more than two classes (multi-class), we can employ multinomial logistic regression (MLR). MLR extends the binary logistic regression concept to handle multiple classes by internally building multiple binary classifiers. Here are two common approaches used in MLR:
One-vs-Rest (OvR) Strategy: trains a separate binary classifier for each class. In each classifier, one class is designated as the positive class, and all other classes are combined into a single negative class.
Simple to implement and understand, and works well with imbalanced datasets (unequal class sizes) but…
It may underperform for some multi-class problems (see MvM section) and ignores relationships between classes, assuming independence.
Many-vs-Many (MvM) Strategy: trains k(k-1)/2 binary classifiers for k classes. Each classifier focuses on distinguishing between two specific classes. For example, with 3 classes (A, B, and C), you would train 3 classifiers: A vs. B, A vs. C, and B vs. C.
Can potentially achieve higher accuracy than OvR for some problems and considers relationships between classes but…
It is more complex to implement and computationally expensive for large datasets and is not supported by Python's sklearn.