Warning: this section will be very wordy and technical. Further research is advised.
Machines digest numerical information to learn, if not understand, commands. You know what machine learning models need to learn; now is the time to learn how they learn using what they are fed with.
Back in the temporal data section on 27th June, we learnt that linear regression fills in missing data values with guesstimates.
In basic essence, linear regression is a statistical model used for modeling the relationship between a dependent variable (what you want to predict) and one or more independent variables (features you believe influence the dependent variable).
Let us use the plot below to learn more about linear regression. In the database (D) is a collection of information e.g., (x^i, y^i) coordinates. ^i refers to the i-th row of features. To train y = f(x), an f prediction model that receives x to predict y, observe the data, draw the x^i and y^i features onto their respective axes, and look at the data trend. What kind of function can be used to describe said trend? Is it a linear function?
If the observed trend is indeed a linear function, you will need to adjust the prediction model into a linear model: y = f(x) = w(x) +b. This new model will output the new predicted outcome ^y^i and then labels the data with y^i for comparison. In this formula, the provided data x follows provided data y, and the unknown parts are the model's parameters w and b.
At this point of time, you might wonder how does the model know what the unknown parameters should be?
In most cases with linear plots, we want the slope to best describe the trend of the data, so it is best to fit the line to the data. We achieve this through a process called fitting. We cannot expect all points to be atop the line, but relaxedly, the line only needs to be minimally distant to each data point as much as possible. Said line is the best way to describe the data trend.
But how does one fit to begin with? Here, we will define the error function. First, subtract y^i data by the model predictions ^y^i within a square bracket, in order to avoid the positive and negative values of the error canceling each other out. Finally, by averaging the errors of each piece of data, we get the following formula:
The above formula is our error function, which we call the mean square error, and it calculates the average vertical distance from each data point to the trend line.
If you are unfamiliar with the rotated M-looking symbol in the middle, that is the sigma (Σ) symbol, which is generally used to denote a sum of multiple terms.
The loss function is used to describe the degree of mismatch between a model and the data. The error function can be seen as a type of loss function, so you can write it like this:
You would want the misfit between the model and data to be as low as possible, a state when the former fits the latter best. This can be done using the equation below:
The method above is the least square method. In the equation, argmin returns the input(s) for which the output is minimum, its subscript w and b (weight and bias) being a part that can be changed. During the state when you can freely change the model parameters w and b, the loss function L(y, ^y) is minimized.
The loss function can have many appearances, and depending on the error function, you can take the model parameters w and b to be separately drawn onto the x and y axes, then put the corresponding loss function into the z axis. You can observe the loss function landscape below:
In linear regression, we aim to find the optimal model parameters that minimize the loss function. However, we cannot directly observe the entire landscape of the loss function, only evaluate it for specific parameter values. Additionally, while we can adjust the model parameters, linear regression typically deals with two main parameters: w and b.
To navigate the loss function and find the minimum, we often use an iterative optimization technique like gradient descent. This method takes small steps in the direction that leads to a steeper decrease in the loss function. To illustrate the concept, let us temporarily fix w as constants. This essentially reduces the problem to finding the best value for b that minimizes the loss function for a specific configuration of the weights. In this simplified scenario, the loss function becomes a function of only b, which can be expressed as:
Interestingly, the simplified loss function obtained after fixing the weights resembles the equation of a parabola, one with its opening faced up. This is because b^2 represents a squared term and the constant term c2 does not affect the shape of the function's curve. Although the complete loss function with all parameters might be more complex, this connection to parabolas helps us visualize how the loss function might change as we adjust b.
Imagine a parabola opening upwards, where the bottom of the curve represents the minimum value. By adjusting b (bias), we are essentially shifting this parabolic curve vertically. The goal of gradient descent is to iteratively adjust the bias until the parabolic curve dips to its lowest point, minimizing the loss function.
It is important to remember that this is a simplified analogy. The actual loss function in linear regression with all parameters might not be a perfect parabola, but the core idea of minimizing the function to find the optimal parameters remains the same.
Earlier, we explored the loss function of linear regression. Under certain assumptions (like convexity), linear regression typically has a single minimum point that represents the optimal solution for the model parameters.
Gradient descent is a widely used optimization algorithm in machine learning and deep learning. It helps find the aforementioned optimal solution for the model parameters. Here is a simplified view: imagine you have a ball on a bumpy landscape representing the loss function. The goal is to find the lowest valley (minimum loss). Gradient descent acts like the ball, and it iteratively rolls downhill.
Here is how gradient descent technically works:
The algorithm starts with an initial guess for the model parameters (weights and biases). This guess can be like placing the ball at a random point on the landscape.
The gradient of the loss function tells you the direction of steepest descent (like the steepest slope downhill).
The model takes a small step in the direction opposite the gradient (downhill). The size of the step is controlled by a learning rate.
Steps 2 and 3 are repeated for all data points in the training set (one epoch). The process continues for multiple epochs until the loss function converges (the ball settles at the bottom of the valley) or a stopping criterion is met.
Back to the graph, the point represents the current values of the model parameters, typically denoted by the symbol θ (theta). Given these parameters, we can calculate the corresponding loss function value. The higher the point on the landscape, the higher the loss. The gradient of the loss function, with respect to the parameters ∇L(y, ŷ; θ), tells us the direction of steepest descent at that point.
Although the gradient points downhill towards lower loss, we use the negative value -∇L(y, ŷ; θ) to update the parameters. This is to ensure we move the parameters in the direction that truly minimizes the loss function, since the gradient of the loss function tends to point at the direction of the larger slope (the direction of the higher value of the loss function). The formula for complete gradient descent is θt +1 = θ_t - η∇L(y, ŷ; θ_t).
The updated model's parameter θt +1 is actually from the current model parameter θt taking a step towards a lower direction via -η∇L(y, ŷ; θ_t), the learning rate η representing the size of the step.
While gradient descent is a powerful optimization algorithm, it has some limitations. The guarantee of finding the global minimum only applies to convex loss functions. In many real-world applications, the loss function might not be convex, and there might be multiple local minima (valleys) in the landscape. Gradient descent can get stuck in one of these local minima, leading to a suboptimal solution.
Despite the above, gradient descent remains a fundamental and widely used technique in machine learning and deep learning. By understanding its core principles and being aware of its limitations, you can effectively apply gradient descent to train various models and achieve good results.
To recap and reorganize our thoughts, here is a breakdown of the key components of a machine learning model:
The model: a core structure that identifies patterns and relationships within the data. It acts like a flexible mold that can be shaped to fit the data's underlying trends.
The loss function: serves as a performance metric, measuring how well the model's predictions deviate from the actual values. Different tasks have specific loss functions tailored to their goals (e.g., minimizing squared errors for regression).
The learning algorithm: the engine that drives the model's training process. It iteratively adjusts the model's internal parameters (weights and biases) based on the chosen loss function. The goal is to minimize the loss, leading to better predictions.
By swapping or combining these elements, we can create diverse machine learning models. This approach of understanding machine learning models as an assembly of interchangeable components provides a foundational perspective for exploring different algorithms and their applications.
Machine learning involves using data to train models that can make predictions or decisions. Here's a breakdown of the key steps:
Data (D): the journey starts with a dataset (D) containing the information the model will learn from. This data typically consists of features (independent variables) and target variables (what you want to predict).
Model Selection: we choose a model architecture, which can be thought of as a specific formula or function (f) that relates the features to the target variables. Different machine learning tasks require different model types.
Model Parameters (θ): each model has internal parameters (θ) that act like knobs we can adjust. These parameters control how the model interprets the data and makes predictions.
Training: the heart of machine learning lies in the training process. Here, a learning algorithm uses the data (D) to adjust the model's parameters (θ). The goal is to find the best combination of parameters that allows the model to make accurate predictions on unseen data.
Trained Model: after training, we obtain a "trained model." This essentially refers to the original model architecture (f) with the specific parameter values (θ) learned from the data. This trained model can now be used to make predictions on new data.
Statistical learning theory provides the foundation for this framework. It explores the mathematical concepts behind how models learn from data and generalize to unseen examples.