Tech has advanced to the point where the know-how of algebra has become less necessary than problem-solving in modern working environments.
Back on 30th June, we mentioned what gradient descent is, how it works, and what it can provide for machine learning algorithms. Today, we will look further into the components that moderate the model's performances and prevent it from generating undesirable outputs.
In machine learning algorithms, optimization algorithms play a crucial role in refining model parameters to minimize the loss function. Gradient descent is one of those widely used methods, in that it iteratively adjusts model weights based on the calculated gradient.
To briefly recap gradient descent, try to imagine that you are standing on a mountain, and your goal is to reach the lowest point (the minimum value) in the valley below. The process of gradient descent can be visualized as follows:
Starting Point: you begin at a random point on the mountain (initial parameters).
Looking Around: you observe the terrain around you, calculating the steepness of the slope (gradients) in all directions. The parameter update rule is such that a steep curve results in a large gradient, and that a gentle curve results in a small gradient.
Downhill Step: you take a step in the direction of the steepest descent (negative gradient), moving downhill.
Adjusting Pace: your pace (learning rate η) determines how far you move in that direction. A small learning rate can help the model reach to the bottom but can also partly or fully stop its tracks in a valley (extreme point). A large one can cause the model to overshoot the minimum but can still help the model escape valleys and reach the bottom (convergence).
Reevaluating: you reassess the terrain, calculating new gradients and adjusting your direction.
Iterative Descent: you repeat steps 3-5, iteratively moving downhill, until you reach the valley floor (convergence).
With all we have covered about gradient descent from the top, let us begin creating our own gradient descent model. This is the first time we will use precision value for gradient descent, which refers to the smallest change in the loss function at which the model ceases iterating even before the iteration limit.
Instead of running epochs, which are each reached after many iterations, gradient descent runs calculations per iteration.
It took 66 iterations for the first model to reach an x (local minimum) value of -4.99, rounded to 2 decimals. Convergence has been achieved but it is possible that the model reached a local minimum than a global minimum.
While x quickly decreased at the first couple of iterations, its convergence rate seems to have slowed down greatly afterwards. Overall, the learning rate seems stable.
The plot below says it all: the loss function decreases rapidly over the first 10 iterations, then begins to slow down at around 20 iterations, and finally converging at a local minimum of around -4.99 – iteration stops at precision of 0.000001 – over the duration of more than 60 iterations.
Let us analyze the plot's key traits:
Getting a negative loss value in a gradient descent model can be a sign of overfitting, where the model itself is too complex and fits the training data too closely.
Early convergence suggests the problem is simple and that the model could find its minimum quickly, though it can also imply the model has too high of a learning rate, which can potentially lead to overshooting.
A second gradient descent model has been set up, this time running with a much lower 0.0001 learning rate instead. All other settings retain their previous values.
The numerical results tell us that the model has outputted a higher local minimum at a convergence rate slow enough to reach the 10,000 iterations cap. Reading the numbers tells us that the model is underfitted, where it is not complex enough to capture the underlying patterns in the data.
The plot below tells us the convergence rate has become much slower when operating at a learning rate of 0.0001. Despite outputting a local minimum of around -3.91 at the max limit of 10,000 iterations, suggesting less likely overshooting than the previous result, the global minimum has not yet been reached.
Imagine you are driving a car through a winding mountain road, representing the optimization landscape. The learning rate is like the speed of your car. At first, you accelerate quickly (high learning rate) to cover a lot of ground and make progress, but as you enter the twists and turns (local minima), you need to slow down (decay learning rate) to avoid overshooting and losing control.
As you navigate the turns, you continue to slow down, adjusting your speed (learning rate) to match the terrain. This allows you to precision-craft your path, avoiding obstacles and finding the optimal route.
Finally, as you approach the destination (global minimum), you slow down to a crawl (very low learning rate), ensuring you arrive precisely and smoothly.
The learning rate of a gradient descent model can be updated automatically using a decay factor, which helps converge to the optimal solution more efficiently. Different initial values may lead to different local minima, but if the loss function is convex, gradient descent is guaranteed to find the global optimum.
A smaller decay value results in a slower learning rate decay, with decay = 0 maintaining a constant learning rate. Conversely, larger decay value leads to a faster learning rate decay, with decay = 1 resulting in the fastest decay.
In practice, iterative learning rate decay involves a simple equation as written below.
To learn more about the effects of different learning rates and decay rates have on the convergence rates of their models, we will be generating up to 16 subplots from a convex gradient descent model. The results will not be obvious to the human eye unless a second, overlapping plot displays the 'value intensity' of each update.
For the tamer learning rate categories of 0.1 and 0.3, the convergence rate of each plot is stable but rather slow, showing a balance between convergence speed and stability.
Take the lr=0.1, de=0.01 plot for example. The final point looks like a highly possible candidate for the function's global minimum compared to the outputs on the same row with higher decay rates.
As for the tamer learning rate categories of 0.9 and 0.99, higher decay rates generate drastic updates to local minima after each iteration. Both rows' performances are unstable at any decay rate.
Perhaps the most drastic example of all plots below is the lr=0.9, de=0.9 and lr=0.99, de=0.9 plot. They both reach the presumed global minimum of near 0 in very few iterations.