Before neural networks (NNs) start running, hyperparameters control their training process while being external from the models.
Nonlinear activation functions are crucial in NNs because they allow the model to learn and represent complex and nonlinear relationships in data, which would be impossible with only linear transformations.
If an NN has no activation function, the output of a neuron is simply Σ_i w_(i, j) x_i + b_j — a linear classifier. Refer to McCulloch and Pitts' simple neuron below.
Prominent desirable characteristics for an activation function in NNs include:
Vanishing gradient problem: No gradient shifting towards 0.
Zero-centred: Symmetrical to 0 so the gradient does not go in a particular direction.
Computational inexpensive: Activation function computed a lot of times, especially in large DNNs.
Differentiable: To be able to calculate the gradient for the backpropagation in the gradient descent process.
Gradients are calculated in the backpropagation process to update the weights in the desired direction. As they travel back to the start, each weight becomes smaller and smaller in value until some become close to 0. This is the vanishing gradient problem, where very small weights can slow down or stop the learning process.
In polar opposite to vanishing gradients, exploding gradients are when gradients become larger and larger as backpropagation progresses. This event can cause learning to become unstable and diverge.
When applying an activation function for a large NN, the gradients might vanish while the network takes a long time to train. This is because activation functions have small gradients in their 'saturated' regions, leading to minimal weight updates during backpropagation, especially in deeper layers.
Activation functions have a 'saturating' nature, meaning that for very large or very small input values, the output of the function becomes close to 0 or 1, and the derivative (which determines the gradient) becomes very small.
The linear function represents the simplest possible 'activation' or lack thereof, as it outputs the input directly. In NNs using the linear function, stacks of linear layers collapse into a single linear transformation W_1 W_2^x, making deep networks no more powerful than shallow ones.
This limitation forced researchers to develop nonlinear activation functions to unlock the power of neural networks.
Sigmoid takes a real value as input and outputs another value between 0 and 1. As the function is nonlinear in nature, combinations of this function are also nonlinear. The output of the activation function is always going to be in range (0, 1) compared to (-inf, inf) of the linear function, so activations are bound in a range and do not blow up.
However, towards either end of sigmoid function, the y values tend to respond very less to changes in x — a sign of vanishing gradient problem. Since its output is not zero-centered, meaning it can make gradient updates go too far in different directions. 0 < output < 1 makes optimization harder as sigmoid saturates and kills gradients. These all culminate into the tailoring of a network that refuses to learn further or is drastically slow, depending on use case and until gradient or computation gets hit by floating point value limits.
Abbreviation for Rectified Linear Units, ReLU uses the formula f(x) = max(0, z) to generate an output with either nothing or something (0 or any positive value). As a result, the output has a range of 0 to infinite. It avoids the vanishing gradient problem, is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations, and has a comparatively faster learning as the slope does not go to 0.
The most critical weakness of ReLU is the dying ReLU problem, where neurons become permanently inactive during training and receive no further updates during backpropagation. ReLU’s output range [0, ∞) allows activations to grow unbounded, especially in deep networks. It is also unsuitable for output layers in tasks requiring bounded predictions (e.g., classification and regression). Finally, ReLU’s surviving neurons create overcompensate and create non-smooth gradients (e.g., a neuron switching between active [x > 0] and inactive [x ≤ 0]), leading to unstable 'zigzagging' training dynamics.
Abbreviation for Exponential Linear Unit, ELU is a child network of ReLU using the formula R(z) = {z, if z > 0; α(e^z - 1), if z <= 0}. It is built to address the latter's weaknesses by introducing negative values and a smooth curve for negative inputs, potentially mitigating the vanishing gradient problem and improving training.
However, ELU only mitigates the vanishing gradient problem in ReLU and can still be hindered by very negative inputs. Having a more computationally expensive exponential function means ELU trains slower than its parent.
Leaky ReLU is a variant of ReLU that outputs the input directly if it is positive, but for negative inputs, it outputs a small negative value. Its formula is f(x) = {x, if x > 0; α(x), if x <= 0}. The small gradient (α) for x ≤ 0 keeps neurons active, avoids zigzag updates as seen in ReLU by allowing gradient flow for all inputs, and negative values (controlled by α) improve normalization.
Though keep in mind of Leaky ReLU's weaknesses. Users have to tune the hyperparameter α per function instance, though it is often fixed (e.g., α = 0.01). Compared to its parent ReLU, its non-zero gradients for x ≤ 0 comparatively reduce sparsity. The last weakness is a realtively minor one, but since the Leaky ReLU formula does involve an extra multiplication by α, there is a (negligible) decrease performance in practice.
The softmax activation function converts a vector of raw scores (logits) into a probability distribution. It ensures that each output value is in the range (0, 1) and that the sum of all outputs equals to 1. Its nature makes it treat each output independently (e.g., tagging images with multiple labels), which makes it easy to implement and interpret for single-class outputs.
Although not every use case has values with unique classes. Softmax's nature means that it cannot model mutually exclusive classes directly. It faces the same vanishing gradient problem every activation method faces and saturates for large positive/negative inputs. This function's always positive outputs also complicate optimization (e.g., gradient updates oscillate).
Sigmoid and softmax share core similarities in their ability to model probabilities, reliance on exponentials, and challenges with gradient saturation and numerical stability. Their shared properties make them foundational tools for classification tasks in neural networks.
To choose which one to use, the former excels in binary and multi-label tasks with independent outputs but suffers from vanishing gradients and optimization challenges, whereas the latter is tailored for multi-class problems with mutual exclusivity but requires careful handling of numerical stability and computational costs.
Mentioned briefly in ReLU, zigzag learning describes the training instability caused by zigzag updates, manifesting as volatile loss curves and slow convergence. The model settles into a nonlinear, iterative approach to learning and development — characterized by periods of progress, setbacks, and changes in direction — rather than a straight, upward trajectory.
Zigzag learning happens when biased outputs force gradients to correct in alternating directions, extreme initial values push neurons into saturation, inputs or activations with varying scales create conflicting gradient magnitudes, or all the above.
If all weights and biases in a network are initialized to the same value, it suffers from symmetry issues, severely hindering its ability to learn. To summarize its effects for each component in a standard NN:
Forward propagation: All neurons in a layer compute the same output, rendering the network incapable of capturing distinct patterns or features in the data.
Backpropagation: Gradients for all weights and biases in a layer are identical because their contributions to the loss are the same. Even after updates, weights remain the same, and neurons stay redundant.
Network degradation: The rest of the network behaves as if it has fewer unique neurons, drastically limiting its ability to model complex relationships. This lack of diversity in neurons leads to suboptimal local minima, where the network fails to learn meaningful features.
Random initialization improves neural networks' generalization by:
Symmetry breaking: Random initialization assigns unique starting weights to each neuron, ensuring diverse feature learning by specializing neurons in detecting different patterns (e.g., edges and textures) and avoiding duplicated computations to maximize network capacity.
Gradient stability: Techniques like He or Xavier initialization scale weights based on layer size to maintain consistent variance of activations and gradients across layers. Techniques smoothening gradient flow during backpropagation prevents vanishing or exploding gradients, while also enabling efficient optimization to help the network converge to better minima.
Loss landscape exploration: Random starting points allow the network to explore diverse regions of the loss landscape, escaping suboptimal solutions. Stochasticity from random initialization, combined with mini-batch noise, encourages discovery of flatter minima, which generalize better to unseen data.
Implicit regularization: By preventing neurons from co-adapting too strongly, random initialization promotes simpler, more general models. Instead of memorizing noise, the network is forced to learn robust, generalizable features.
Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. As a result, the model is useful in reference only to its initial data set, and not to any other data sets.
Think of the average student painstakingly memorizing answers of learning material examples instead of truly learning the background knowledge. In a topic-generalizing test context, they are unlikely to answer most if not all questions correctly.
One way to deal with overfitting is with regularization. Thus type of technique prevents overfitting by constraining the model’s complexity, either by penalizing large parameter values or altering the network’s structure during training. A key mechanism involves keeping some neurons "close to 0," reducing their influence and forcing the network to rely on a broader set of features.
Limiting weight values in a neural network ensures that the pre-activation values (weighted sums of inputs) remain within the linear region of activation functions like tanh. Doing so preserves gradient flow during backpropagation, stabilizes training, and enhances generalization — striking a balance between model flexibility and robustness.
L1 regularization (or lasso regression) adds the absolute values of coefficients λ * Σ|coefficients| to the loss function, where λ is the regularization coefficient. This tends to force some coefficients to become exactly zero, eliminating less important features to produce models with few non-zero coefficients (sparse models).
L2 regularization (or ridge regression) adds the squared values of coefficients λ * Σ(coefficients²) to the loss function. This shrinks coefficients towards zero, but rarely forces them to be exactly zero, reducing the impact of less important features without eliminating them entirely to help prevent high correlation between features (multicollinearity).
L2 regularization can be combined with L1 regularization into Elastic Net regularization. It can handle situations where features are highly correlated like L2 regularization and force some coefficients to be exactly zero like L1 regularization.
While Elastic Net aims to combine the best of both regularization techniques, it does come with some potential drawbacks compared to using either alone:
Increased complexity: Elastic Net introduces an additional hyperparameter l1_ratio which users need to additionally optimize manually. While still a linear model, the presence of both regularization terms can make it a bit harder to interpret than just one.
Higher computational cost: Needing to optimize two regularization terms is computationally more expensive than addressing just one pure L1 or L2 model.
Potential for overfitting in specific scenarios: In situations where either L1 or L2 is clearly the optimal choice, Elastic Net might introduce unnecessary complexity and potentially lead to slightly worse performance.
Less clear-cut feature selection: While Elastic Net can perform feature selection, it might not be as aggressive as Lasso in forcing coefficients to be exactly zero, especially when the l1_ratio leans more towards L2. This can be a disadvantage if you need a very sparse model with a small number of features.
Dropout involves randomly deactivating components of an AI model throughout the training process. This method prevents overfitting in neural networks by forcing the model to learn more robust and generalizable features, as it effectively trains multiple 'subnetworks' and prevents reliance on specific neurons.
Early stopping prevents overfitting by halting the training process when the model's performance on a validation set starts to degrade, rather than training for a fixed number of iterations.
The hyperparameters for the early stopping method in the Keras module can be adjusted, but here are some factors to help you decide on their value:
Monitor performance: Track validation loss or task-specific metrics (e.g., accuracy, precision, or F1-score) as the primary indicator of model generalization. A rising loss signals overfitting.
Trigger condition: The simplest trigger is increase of loss compared to the last iterations. More elaborate ones include no changes over several epochs, absolute or average change in a metric over several epochs, etc.
Overfitting can happen if there is not enough data to train all parameters. Preprocessing techniques can increase the quantity of training data without explicitly modifying the learning algorithm. These operations aim to increase diversity of data, especially for images, and also include rotating, flipping, scaling, and noising images.
Keep in mind that data augmentation can inversely result in underfitting if the generated data is irrelevant to the task at hand.
Terms:
Optimal error rate: Minimum possible error rate achievable by any model for a given task, determined by inherent noise in the data or fundamental ambiguities (e.g., overlapping classes). This is the unavoidable baseline error; no model can perform better than this.
Avoidable bias: Difference between training error and optimal error rate. Occurs when the model is too simple (e.g., underfitting) or lacks capacity to learn the true patterns in the training data. High value means the model is underperforming even on the training set.
Variance: Difference between validation and training error. Occurs when the model is too complex (e.g., overfitting), capturing noise in the training data that does not generalize. High value means the model performs well on training data but poorly on unseen validation data.
Batch normalization is a technique used in NNs to improve training speed and stability by normalizing the activations of each layer during training. It works by calculating the mean and variance of the activations within a mini-batch and then using these statistics to normalize the activations. To break down its processes:
Compute mean of the layer's inputs μ.
Compute variance of the layer's inputs σ^2.
Compute normalized input value z_norm = (z - μ) / √(σ^2 + ϵ) for each neuron in the layer.
Compute the scaled and shifted output ~z = γ * z_norm + β, where γ and β are learnable parameters.
Batch normalization has a regularizing effect due to its use of mini-batch statistics (mean and variance) rather than the entire dataset. This introduces noise into the neuron and activation values, similar to the effect of dropout. Consequently, batch normalization also reduces the impact of weight initialization on the training process.
The absence of batch normalization in deep neural networks can lead to several significant consequences that hinder training and model performance:
Slower learning: Without normalization, distribution of inputs to each layer can change significantly during training as the parameters of the preceding layers are updated. This phenomenon is known as internal covariate shift. Each layer then has to constantly adapt to a shifting input distribution, making learning process much slower and requiring careful tuning of learning rates.
Vanishing or exploding gradients: Unnormalized inputs can lead to activations that saturate nonlinear activation functions (e.g., sigmoid or tanh). In saturation regions, gradients become vanish and become small or explode to large proportions, making it difficult for the error signal to propagate effectively through the network and for weights to be updated appropriately.
Sensitivity to weight initialization: Performance of networks without batch normalization becomes highly dependent on initial values assigned to weights. Poor initialization can exacerbate either gradient problem, making training unstable or even impossible.
Difficulty in using higher learning rates: Shifting input distributions and potential for exploding gradients limit the use of higher learning rates. Smaller learning rates are necessary to maintain stability, further slowing down training process.
Network stagnation: In some cases, combination of internal covariate shifts and gradient issues can cause the network to get stuck in suboptimal solutions or plateaus, preventing it from learning effectively.
Imagine you are trying to teach a group of students a complex subject, and each student is in a classroom with drastically different lighting conditions that change randomly throughout the lesson. Matching the scenario's factors with each risk:
Slower learning: If lighting keeps changing (internal covariate shift), each student has to constantly adjust their eyes and focus, making it harder for them to concentrate on the actual material being taught. They spend more energy adapting to the environment than learning the subject, slowing down their overall progress.
Vanishing or exploding gradients: If lighting becomes extremely dim or blindingly bright (saturation), students will either struggle to see the board or be overwhelmed, making it impossible for them to absorb information and understand core concepts (gradients become too small or too large to guide learning).
Sensitivity to weight initialization: If some students start with very poor eyesight (bad weight initialization), fluctuating lighting will make it even harder for them to learn compared to students with good initial vision.
Difficulty in using higher learning rates: You cannot speak too quickly or introduce new concepts too rapidly (high learning rate) because students are already struggling with constantly changing and potentially extreme lighting conditions. You have to proceed very slowly and carefully to ensure they do not get completely lost.
Network stagnation: Constant struggle with environment might lead to students becoming frustrated and giving up on learning the subject altogether, even if they had the potential to understand it under more stable conditions.
Learning rate α is often considered the most critical hyperparameter requiring tuning in neural network training, being the one that decides how much a model's parameters (like NN weights) are adjusted during each step of the training process.
Determining the optimal value beforehand is challenging, necessitating empirical exploration through trial and error within a common range (10^-6 < α < 1.0). Techniques such as grid search, random search, and more advanced optimization methods are employed to find a suitable learning rate. Optimal learning rate is intrinsically linked to the curvature or slope of loss function:
Small learning rate: Leads to slow learning and increases risk of getting stuck in local minima.
Large learning rate: Increases risk of overshooting the minimum, especially in regions with steep slopes.
Two ways to adjust learning rates during model training runtime are learning rate decay and adaptive learning rate strategies. TO briefly describe what both can do for NNs:
Learning rate decay: Gradually decreases learning rate over time. For example, linear decay or linear decay followed by a constant rate.
Adaptive learning rate strategy: Monitors model's performance and dynamically adjusts learning rate. Typically reduces learning rate when performance plateaus and may increase it if performance fails to improve for a certain number of iterations.
A common and often beneficial practice is to decay learning rate as training progresses. As the optimization process approaches a minimum in cost function, gradient (slope) typically becomes less steep. Decaying learning rate then allows the algorithm to take smaller, more precise steps, preventing overshooting and facilitating convergence to a better optimum.
Keep in mind that this progression is not straightforward. Two of various strategies propose:
Time-based: Linear decay, exponential decay, etc., each requiring tuning of specific coefficients.
Step-based: Reducing learning rate by a fixed factor (e.g., 50%) after a set number of epochs (e.g., every 10 epochs).
Consequentially, an excessively aggressive decay schedule can prematurely slow down progress towards the optimum, while a slow-paced decay schedule may lead to chaotic updates and only marginal improvements.
Fortunately, adaptive learning rate optimization algorithms can dynamically adjust the learning rate during training, often mitigating the need for manual decay schedule tuning. Such algorithms include Adagrad, RMSProp, AdaDelta, and ADAM. These adaptive strategies generally outperform training with fixed learning rates, especially when said fixed rate is not optimally tuned.
Oscillations during the learning process can significantly slow down training and increase the risk of overshooting the minimum, particularly when the learning rate is set too high. Momentum optimization technique addresses this issue by incorporating an exponentially decaying weighted average of past gradients into weight update, which helps smoothen out oscillations and accelerate convergence, especially when used with mini-batch gradient descent.
Without momentum, weight update w ← w - α∇_w g(w) directly follows the gradient of loss function at the current step, leading to potentially erratic movements (oscillations).
With momentum, the update involves a velocity term v_t that accumulates the gradient over time, using the update formula β * v_(t - 1) +(1 - β)∇_w * g(w), where v_(t - 1) is velocity from the previous time step, β is momentum hyperparameter controlling decay rate of past gradients, and ∇_w * g(w) is current gradient. Weight is then updated with w ← w - α * v_t along the direction of accumulated velocity, resulting in a smoother trajectory towards the minimum.
The central vector diagram in the example below shows how 'momentum step' (scaled previous velocity) combines with 'gradient step' to produce a more direct 'actual step.'
Imagine pushing a heavy ball up a slightly bumpy hill. Without momentum, you push directly uphill at each step. If there is a small bump or you waver, the ball might roll back slightly, making slow, jerky progress.
With momentum, however, you give the ball an initial strong push. Even when you encounter small bumps or briefly ease off the force, said ball's existing momentum helps it roll over them and continue moving upwards more smoothly and directly towards the top. Momentum averages out small inconsistencies in your pushing, leading to a faster and more stable ascent.