I learn and remember best with the story format. Unluckily for me, the inner machinations of neural networking are a saga that is harder for my 'mind gut' to digest informatively than fried squid (anything fried, really).
To harness the full potential of artificial neural networks, careful consideration must be given to the mathematical operations performed within their nodes and the techniques employed to refine their learning process. By strategically selecting and combining these elements, complex patterns can be extracted from vast amounts of data, enabling machines to output intelligent results.
Underneath is a spiral scatter plot, a complex structure in comparison to the concentric circle and clear cluster plots. To create a model with minimal loss for it, we need a neural network with all 7 available features processed by 8 neurons in a single hidden layer.
Using that many features and neurons, the neural network's runtime has provided some interesting insights on the loss output:
Convergence at 0 epochs: starts at around 0.61 test loss and 0.65 training loss.
Convergence around 160 to 290 epochs: from around 0.26 test loss and 0.13 training loss to around 0.07 test loss and 0.03 training loss. Output pattern is erratic but proceeds in a rapid downward trend. Might suggesting effective learning in the early stages, but also potential overfitting, as the model starts to focus on noise in the training data rather than underlying patterns.
Convergence around 900 epochs: around 0.02 test loss and 0.01 training loss. Outputting a total loss around the 3rd decimal place. Might be a premature conclusion.
Convergence around 1,100 epochs and after: around 0.02 test loss and 0.01 training loss. Convergence rate greatly decreases and exhibits continuously increasing test loss and decreasing training loss afterwards. Strongly indicates overfitting.
The learning rate of a neural network determines how far it weights change within the context of optimization while minimizing the loss function. In the context of this plot, a lower learning rate increases the model's stability and ability to converge.
Setting the previous neural network's learning rate of 1 generated the following insights:
Convergence at 0 epochs: around 0.54 test and 0.58 training loss.
Convergence around 250 epochs: around 0.39 test and 0.23 training loss, a substantial drop in loss. Might suggest that the high learning rate allows the model to make significant progress initially.
Convergence around 500 epochs: around 0.40 test and 0.24 training loss. Might suggest that the model is struggling to converge due to the large steps taken during optimization.
Convergence around 1000 epochs: around 0.33 test and 0.21 training loss. Overall, the fluctuating test and training loss values suggest potential overfitting, as the model is likely capturing noise in the training data.
The erratic behavior and inconsistent convergence pattern observed with this high of a learning rate indicates a hyperparameter that is too aggressive for the given dataset and model architecture. For a complex plot such as the spiral dataset, we might require a more conservative learning rate.
To recap, an activation function is a mathematical function applied to the output of a neuron. It introduces non-linearity into the neural network, enabling it to learn complex patterns in the data. Without these, a neural network would simply be a linear model, regardless of its depth.
In further exploration of minimizing loss, we can switch the activation function from Tanh to another. This run will utilize the Sigmoid function, which outputs values between 0 and 1, making it suitable for probability-based outputs or binary classification tasks.
Adjusting the previous neural network's activation function to Sigmoid generated the following insights:
Convergence at 0 epochs: around 0.52 test loss and 0.48 training loss. Comparatively lower than both previous neural networks.
Convergence around 250 epochs: around 0.05 test loss and 0.05 training loss. This strongly indicates effective learning and a suitable learning rate.
Convergence around 500 epochs: around 0.02 test loss and 0.01 training loss, where convergence rate slows down afterwards. Might indicate that the model is approaching its optimal performance, although training loss decreasing faster than test loss suggests a risk of overfitting.
This Sigmoid neural network converges at a lower total loss value sooner than the last model.
Back on 18th July, we talked about what regularization is – a technique used to prevent overfitting. It typically involves adding a penalty term to the loss function which discourages the model from learning overly complex patterns.
To refresh our brains, L1 refers to lasso regression, which encourages sparsity and sets some coefficients to 0, and L2 refers to ridge regression, which shrinks coefficients without setting any to 0.
Starting with L2 on regularization rate 0, here are the key observations:
Convergence at 0 epochs: around 0.56 test loss and 0.55 training loss.
Convergence around 140 to 180 epochs: from around 0.04 test loss and 0.09 training loss to 0.04 test loss and 0.09 training loss.
Convergence at 350 epochs: around 0.04 test loss and 0.08 training loss. Sudden notable decrease than massive increase before reaching optimal loss.
For an L2 network on regularization rate 0.001:
Convergence at 0 epochs: around 0.70 test loss and 0.64 training loss.
Convergence around 30 to 80 epochs: from around 0.30 test loss and 0.21 training loss to around 0.22 test loss and 0.15 training loss. Output diagonally if somewhat spikily drops, then begins to become consistently erratic around the same loss range, and finally and suddenly increases before dropping to pre-ascent values.
Convergence around 120 to 880 epochs: from around 0.23 test loss and 0.23 training loss to 0.01 test loss and 0.01 training loss. Extremely erratic outputs at the start, progressing with smaller and infrequent spikes in loss value, culminating at optimal training convergence and slowly decreasing test loss.
For an L1 network on regularization rate 0.001:
Convergence at 0 epochs: around 0.65 test loss and 0.67 training loss.
Convergence around 40 to 260 epochs: from 0.33 test loss and 0.20 training loss to 0.27 test loss and 0.16 training loss. Output becomes unpredictably erratic by a downward maximum value of 0.1 before reascending by a near-equal amount.
Convergence around 290 to 380 epochs: from 0.28 test loss and 0.15 training loss to 0.28 test loss and 0.14 training loss. Output loss value remains consistent for the epoch being, minus a few short-lived spikes in loss increase.
Convergence around 440 to 630 epochs: from 0.40 test loss and 0.23 training loss to 0.39 test loss and 0.21 training loss. Output loss value skyrocketed to a consistent state until around the end of this range.
Convergence around 700 to 2,500 epochs: from 0.36 test loss and 0.20 training loss to 0.38 test loss and 0.19 training loss. After a slow downward decrease in output loss value, output suddenly displays a consistent pattern of erratic spikes for both increase and decrease. The highest spike of this erratic convergence is around 0.42 test score and 0.27 training score.
For an L1 network on regularization rate 0:
Convergence at 0 epochs: around 0.66 test loss and 0.73 training loss.
Convergence around 30 to 110 epochs: from around 0.40 test loss and 0.24 training loss to around 0.16 test loss and 0.07 training loss. Marks the beginning of a slightly erratic descent in output loss value at short-lived pulses of spikes.
Convergence around 110 to 340 epochs: from around 0.23 test loss and 0.23 training loss to 0.18 test loss and 0.05 training loss. Output loss value remains consistent over the next couple hundred epochs.
Convergence around 350 to 410 epochs: from around 0.17 test loss and 0.05 training loss to 0.16 test loss and 0.06 training loss. Sudden erratic increase of output loss value occurs but returns to previous value after several epochs.
Convergence around 420 to 1,000 epochs: from around 0.15 test loss and 0.06 training loss to 0.15 test loss and 0.05 training loss. Other than a brief increase spike of output loss value, an unchanging optimal convergence has been achieved.
Comparing each L1 and L2 networks' results, the L1 neural network with a regularization rate of 0 generated the most stable convergence, with consistent loss values. They have the lowest optimal loss values of 0.15 for testing and 0.05 for training.