Compared to the harshness of human life, computer runtimes sure have more lenient errors. Whereas real-life errors can cripple a being completely, for computer runtimes, their errors can more often be salvaged or refined for them to continue running.
Consider a self-driving vehicle equipped with a vision system tasked with identifying objects on the road. The system receives images as input and must accurately classify objects such as pedestrians, traffic signs, and other vehicles. To enhance the system's performance, a learning algorithm needs to be employed – one that iteratively adjusts the model's parameters based on the errors observed during training.
Backpropagation, short for 'error backpropagation', is a fundamental algorithm in neural network training that computes the gradient of the loss function with respect to all model weights. This gradient information is then utilized by optimization methods, such as gradient descent, to update the weights and minimize the loss function.
Backpropagation requires:
Known outputs: for each input value, the corresponding output must be known to calculate the loss function gradient.
Differentiable activation functions: the activation functions used in the artificial neurons (nodes) must be differentiable, enabling the computation of gradients.
As a result, backpropagation is typically employed in supervised learning settings, where labeled data is available. The algorithm efficiently computes gradients for each layer in the network, facilitating the optimization of weights to achieve minimal loss.
Consider a neural network for predicting fruit sales. The network has two input nodes dubbed Part and UnitPrice. These inputs are multiplied by their respective weights and summed, passing through an activation function (represented by the circle) to produce the output Out(1). Out(1) is then multiplied by TAX to calculate the final output f(x), which represents the final price.
When running backpropagation, the difference between the predicted output Out(2) and the true target value is calculated. This error signal is, like the name suggests, propagated backward through the network. Afterwards, the weights associated with the connections between nodes are adjusted based on the error signal and the gradient of the loss function. This process aims to minimize the error and improve the model's accuracy.
In the fruit sale network, assuming the total price was actually 240, if the final output was not 240, the error signal would propagate backward through the network. Afterwards, the weight of the TAX node would be adjusted to minimize the error.
When you modify the initial data of a model, its output changes, resulting in a difference between the predicted output and the expected target output. This difference is quantified as the error rate.
To measure the magnitude of the error rate, a loss function is used. A common choice is the Mean Squared Error (MSE) loss function, which calculates the average squared difference between predicted and target outputs. The MSE loss function is MSE = (1/n) * Σ (target output - actual output)^2, where n is the number of data points.
For this exercise, we will be using sigmoid function and its derivative to create a neural network.
In forward propagation, the data in input layer L0 is passed through the hidden layer L1, and a predicted output L2 is calculated and compared to the target value y. Afterwards, the difference between the L2 and y is stored in l2_error.
Backpropagation occurs when the error signal is propagated backward through the network to update the weights in each layer. Different varaibles represent the error signal for the output and hidden layers; l2_delta for the former and l1_delta for the latter.
The model is off to a low start at around 0.0025 loss function. From 0 to 3,000 iterations runtime, the loss function shows consistent but miniscule decrease over each interval of iterations, outputting a loss function value of around 0.0024.
This trend continues onward to 7,000 iterations, where the loss function stays at a value of around 0.0024.
At 8,000 iterations, however, the loss function decreases by a notably larger portion. It goes from around 0.0024 to around 0.0019 before the run ends at 10,000 iterations, where the final loss function is still at around 0.0024.