Why borrow AIs when you can make them?

Back

Week 2: Shallow and deep neural networks

McCulloch-Pitts neuron

The McCulloch–Pitts (MCP) neuron, alternatively named the perceptron, is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.

In the equation of a perceptron:

Input x_i comes from the output of neuron i to this neuron j (or from 'outside').
Each input link has a weight w_(i, j).
There is an additional fixed input x_0 with bias weight w_(0, j) or b_j.
The total input is in_j = Σ_i w_(i, j) x_i or Σ_i w_(i, j) x_i + b_j.
The output is out_j = g(in_j) = g(Σ_i w_(i, j) x_i ) or g( Σ_i w_(i, j) x_i + b_j).

Another mathematically equivalent method for perceptrons is linear algebra equation (LAE). While harder to grasp conceptually than the sum notation formula, it is more scalable for larger inputs and is better optimized for performance via vectorized operations, performing operations on multiple values at once, instead of looping through each value individually. Hence, it is better suited for implementation and advanced models.

To elaborate on the formulaic components of LAE, left to right:

x_1, x_2, x_3: input features forming the input vector x = [x_1, x_2, x_3]. Each represent the data fed into the perceptron.
z = w^T x + b: computes the weighted sum of inputs and bias to output a scalar pre-activation value z. Represents the 'raw' output before applying the activation function. Defines a hyperplane in 3D space since there are 3 inputs.
- w = [w_1, w_2, w_3]: refers to the weight vector assigned to each input.
- T: denotes transpose operation, converting a column vector into a row vector. Ensures proper dimensionality alignment for the dot product in the perceptron equation.
- w^T x: dot product (linear combination) of weights and inputs.
- b: a scalar bias term that shifts the decision boundary away from the origin.
a = g(z): applies a non-linear function to z to produce the final output a. Modern variants may use sigmoid, ReLU, or other functions for different tasks.
- g(z): a step function in a classic perceptron (e.g., 1 if z > 0, else 0).

Imagine a perceptron as a chef preparing a dish:

The input vector x = [x_1, x_2, …, x_n] represents the ingredients used in the recipe. Each x_i is a specific ingredient (e.g., flour, sugar, eggs) and the number of ingredients (n) determines the complexity of the dish.
The weight vector w = [w_1, w_2, w_3] represents the proportions of each ingredient in the recipe. Each w_i determines how much of ingredient x_i to use (e.g., 2 cups of flour, 1 teaspoon of salt), analogous to a chef's secret recipe.
The bias b is like the base flavor of the dish (e.g., a broth or seasoning). It adjusts the overall taste, ensuring the dish is not too bland or overpowering. Without the bias, the dish might lack depth (e.g., no salt in a soup).
The dot product w^T x is like mixing the ingredients in the right proportions. The chef combines flour, sugar, eggs, etc., according to the recipe. This step ensures all ingredients are properly integrated.
The pre-activation z = w^T x + b is the raw mixture before cooking — the combined result of mixing ingredients and adding the base flavor. At this stage, the dish is not yet ready to serve — it needs further processing.
The activation function g(z) is like the cooking process that transforms the raw mixture into a finished dish.

For a step function: the dish is either fully cooked (1) or not (0).
For sigmoid/ReLU: the dish is partially cooked, with varying degrees of doneness.

The output a = g(z) is the final dish served to the customer — the result of all the previous steps. The customer (or the next layer in a neural network) receives the dish and decides what to do with it.
Training the perceptron is like the chef refining the recipe over time. If the dish does not satisfy the customers' tastes, the chef adjusts the proportions (w) or base flavor (b). This iterative process continues until the dish is perfect (until the perceptron makes accurate predictions).

Neural network architecture

A shallow neural network is a neural network with one or a few hidden layers between the input and output layers. It has limited depth and can only learn relatively simple patterns.

In formula speak, in_z(i_i)^k is the input of neuron i in layer k after input function, while out_(z_i)^k is the output of neuron i in layer k after activation.

Deep neural network architecture

Deep neural networks (DNNs) differ from shallow neural networks in that they have lower neuron density per layer, as neurons are distributed across multiple layers. Each layer typically has fewer neurons but greater depth.

DNNs' abundance of neurons excel at learning hierarchical representations. Early layers detect edges/textures, while deeper layers capture abstract features (e.g., faces, objects).

Neural network properties

The Universal Function Approximation Theorem (UFAT) is a foundational result in neural network theory, asserting that a two-layer neural network (one hidden layer) with a sufficient number of neurons can approximate any continuous function to arbitrary precision. This theorem requires the activation function to be nonconstant, bounded, and continuous (e.g., sigmoid or tanh).

In simpler words, a neural network with at least one hidden layer of a sufficient number of neurons, and a non-linear activation function, can approximate any continuous function to an arbitrary level of accuracy.

Vienna University of Economics and Business academic researcher Kirk Hornik's Theorem #1 generalizes to broader function spaces L_p but requires weaker activation conditions. A practical use of this theorem is approximating discontinuous functions like piecewise-constant signals.

Hornik's Theorem #2 refines UFAT for multilayer networks, ensuring uniform convergence on compact domains. Its scope is the uniform approximation of continuous functions on compact subsets of R_k. A practical use of this theorem is modeling smooth sensor data or physical systems with high precision.

These theorems collectively validate neural networks as universal approximators, with Kirk Hornik’s work extending UFAT’s scope to diverse mathematical settings. The choice of theorem depends on the problem’s requirements:

In essence, the Hornik Theorem generalizes the idea that even simple, nonlinear building blocks (step activation neurons) in a hidden layer can be combined to create a function that can approximate a target function (in this case, a piecewise constant one). It shows that with enough hidden neurons and a suitable nonlinear activation function, a neural network can approximate a very wide range of functions.

Optimizing AI function learning

Learning a function is an optimization problem — specifically the optimization of the parameters (weights) to minimize a cost/loss/error function.

One way to mitigate this problem is through gradient descent, an optimization algorithm used in machine learning to find the minimum of a function (often a cost or loss function) by repeatedly moving in the direction of the steepest descent, guided by the function's gradient.

Too high of a learning rate may cause the AI to miss (overshoot) the point where loss value is minimal (optimal value) and not converge to a good solution, while a learning rate too low will take the AI a very long time to reach that point from the start and fail to capture essential patterns (underfit).

A commonly used analogy for gradient descent is trying to hike down a hill from an initial hilltop peak, while choosing a direction to advance using small hills along the way toward the lowest flat ground.

The variants of GD are:

Batch GD: The foundational method in optimization theory and early machine learning. Provides a precise estimate of the gradient direction but is computationally expensive for large datasets. Computes the gradient of the loss function using the entire dataset in every iteration which guarantees convergence at global minimum but is the most computationally expensive and slow method.
Stochastic GD: Updates parameters using one randomly selected training example per iteration. Developed later to address the inefficiency of batch GD on large datasets by introducing noise into updates to escape local minima and enable faster iterations. Computationally inexpensive and memory-efficient but can lead to noisy and fluctuating convergence and algorithm stuck at local minima, making it unstable.
Mini-batch GD: Uses a small random subset (mini-batch) of the data for each update. Is a hybrid approach introduced to balance the stability of batch updates, and the efficiency of Stochastic GD. Combines the strengths of batch and Stochastic GD, but batch updates are based on a subset of the data which may not lead to global minimum.

Derivatives in AI training

In calculus, a derivative measures how a function changes as its input changes. Mathematically, for a function f(x), the derivative dx/df represents the rate of change of f with respect to x. This concept is foundational in AI training for optimizing models. Most components that train and optimize neural networks use derivatives.

Derivatives outside neural networks rely on foundational calculus rules. Mastery of these guidelines ensures accurate computation for optimization, physics, economics, and more.

Automatic differentiation

Automatic differentiation (AD) is a computational technique used to evaluate the derivatives of functions efficiently and accurately. It computes derivatives by breaking down functions into a sequence of elementary operations and systematically applying the chain rule. Key features of AD include:

Computational graphs: AD represents functions as computational graphs, where nodes correspond to elementary operations (e.g., addition, multiplication, trigonometric functions) and edges represent data flow. As an example, for f(x) = x^2 + sin(x), the graph includes nodes for squaring x, computing sin(x), and adding the results.
Forward mode: Computes derivatives alongside the function evaluation by propagating perturbations from inputs to outputs. Efficient for functions with few inputs and many outputs.
Reverse mode: Computes derivatives by traversing the graph backward from outputs to inputs. Efficient for functions with many inputs and few outputs. Common in deep learning.
Chain rule application: AD decomposes complex functions into elementary operations and applies the chain rule iteratively to compute gradients.

AD works in neural networks with:

Forward pass: Compute function's output (e.g., predictions) while recording operations in a computational graph.
Backward pass: Propagate gradients from loss backward through the graph using reverse-mode AD. At each node, compute the local derivative and multiply it by the incoming gradient (chain rule).

For an example of AD, consider f(x, y) = x^2(y) + y + 2 at x = 3, y = 4 using reverse-mode AD (backpropagation) to compute ∂f/∂x and ∂f/∂y.

The AD technique can handle functions with millions of parameters (e.g., deep learning models) and flexibly work with control flow (loops, conditionals) as long as operations are differentiable. It is also embedded in frameworks like TensorFlow, PyTorch, and JAX for seamless gradient computation.

Back- and forward propagation in neural networks (ft. chain rule)

The backpropagation algorithm is used to train artificial neural networks by adjusting weights to minimize errors, working backward from outputs to inputs.

Before backpropagation, forward propagation begins with feeding input data into the neural network. As input data travels through the network, layer by layer, at each of them is the input multiplied by weights, added to biases, and passed through an activation function. Forward pass continues until output layer is reached, which produces the network's predictions for given input.

Once the network has made its predictions, these predictions are compared to the actual, correct values (i.e., 'ground truth' or 'labels') with a loss function. This loss function quantifies the difference between predicted values and actual values. If the network is processing a batch of data, loss function calculates error for each data point in the batch and averages (or sums) them all into total error — a single scalar value that represents the overall error of network's predictions; how 'wrong' they are.

Referencing the total error, the output layer first calculates the partial derivative of loss function with respect to the output node's activation, then the partial derivative of output node's activation with respect to its input, and finally, the partial derivative of that input with respect to the weight.

After the above, the chain rule is applied to multiply the partial derivatives calculated via ∂E_tot/∂w_5 = (∂E_tot/∂out_z1[out]) * (∂out_z1[out]/∂in_z1[out]) * (∂in_z1[out]/∂w_5). The calculated ∂E_tot/∂w_5 will be used to update weight w_5 using an optimization algorithm (e.g., gradient descent). With all that done, backpropagation propagates the error gradients backward through the network, layer by layer, until it reaches the input layer. One instance of complete backpropagation pass is complete.

To summarize, changing AIs' weights changes how they learn, so making them stay static makes them 'braindead'. The repetitive updating of weights backpropagation performs can gradually minimize the final total error — the chance of giving wrong outputs — of AIs.

Overall, the chain rule is used to compute gradients of the loss with respect to inputs of a function within a neural network. This is essential for backpropagation, as it allows the network to determine how much each input (and ultimately, each weight and bias) contributes to overall error. These gradients are then used to update the network's parameters to minimize loss.

Practice: PyTorch

Text

Practice: Autograd

PyTorch comes with an implementation of auto-differentiation called Autograd (Automated gradients). It can be used to compute the derivative of a function, i.e., its gradient. For an enabled Autograd to be performed on a tensor, you have to set requires_grad = True when creating it.

Text

Evaluation strategies for models

Depending on the size of your model's database, in modern practice standards, the ratio of fed data varies. In terms of training, validation, and test set ratio:

For small datasets (100, 1,000, 10,000 samples): 60%/20%/20%
For large datasets (> 1,000,000 samples): 98%/1%/1%

The training set and validation/test set usually need to come from same distribution, although it is not a huge issue if the ratio varies a bit when gathering a lot of training data. Just make sure validation and test sets come from the same distribution.

Should the dataset be too small, we can use cross-validation instead of holdout sets. Holdout sets use a single train/test split, while cross-validation uses multiple splits to train and test — providing a more robust estimate of model generalization.

Leave-one-out cross-validation (LOOCV) is a specific type of cross-validation where each data point is used as a test set exactly once, while the remaining data points form the training set, repeated for every data point in the dataset.

Loss functions for neural networks

The loss function is measuring how well the model does by comparing predicted outputs to expected ones. It is usually used as the objective function, usually to be minimized, of the parameter optimization (i.e., model training). Commonly used types of loss functions are:

Mean Squared Error (MSE or L2 loss): Measures average squared difference between predicted values and true values. Is sensitive to outliers due to squaring, which penalizes large errors heavily. Used for regression tasks (i.e., predictions). Slope tends to be convex and smooth, enabling efficient gradient-based optimization. Outputs are unbounded, so loss can grow very large.
- y_i: Predicted value for the i-th sample.
- /y_i\: Target value for the i-th sample.
- N: Number of instances.

Mean Absolute Error (MAE or L1 loss): Measures average absolute difference between predictions and true values. Is robust to outliers compared to MSE, as it does not square errors. Also used for regression tasks, although more suitable for sets where outliers are problematic (e.g., financial forecasting). Less sensitive to outliers but has non-smooth gradients at zero. Outputs are in the same units as the target variable.
- y_i: Predicted value for the i-th sample.
- /y_i\: Target value for the i-th sample.
- N: Number of instances.

Binary Cross-Entropy loss (Log loss): Measures discrepancy between predicted probabilities and true binary labels. Penalizes confident incorrect predictions heavily (e.g., predicting 0.9 when the true label is 0). Used for binary classification tasks (e.g., spam detection). Requires predictions to be probabilities; use sigmoid activation in the final layer. Equivalent to log loss for binary outcomes.
- y: True binary label — 1 for the positive class, 0 otherwise.
- p: Probability of the positive class, between 0 and 1, predicted by the model.

Categorical Cross-Entropy loss (Softmax loss): Measures discrepancy between predicted class probabilities and true one-hot encoded labels. Penalizes models when the predicted probability for the true class is low. Used for multi-class classification tasks (e.g., image classification with C classes). Requires predictions to be probability distributions; use softmax activation in the final layer. Generalizes BCE to multiple classes.
- C: Number of classes.
- y_i: Binary indicator (0 or 1) specifying if label i is the correct classification.
- p_i: Predicted probability for the instance to be of class i.

Each loss function is tailored to specific problem types, balancing sensitivity to errors, robustness, and mathematical properties. To summarize on how to choose a regression loss function:

Use MSE if outliers are rare and large errors should be heavily penalized.
Use MAE if outliers are common and robustness is critical.

As for classification loss functions:

Use BCE for binary tasks (e.g., yes/no predictions).
Use CCE for multi-class tasks (e.g., digit recognition with 10 classes).

Practice: training a DNN for regression

Evaluation metrics for neural networks

Whereas loss functions guide model training by quantifying prediction errors, evaluation metrics measure the final performance on unseen data. Both aim to optimize model quality but at different stages (training vs. evaluation). The former can act as surrogates for evaluation metrics when the latter are not directly optimizable. As an exception, regression tasks often use the same metric (e.g., MSE, MAE) for both.

The confusion matrix is a foundational tool in binary classification tasks to evaluate model performance by comparing actual labels with predicted labels. The model predicts:

True positives (TP): Correctly predicted positive instances.
True negatives (TN): Correctly predicted negative instances.
False positives (FP): Negative instances incorrectly predicted as positive.
False negatives (FN): Positive instances incorrectly predicted as negative.

For multi-class problems, the matrix expands to an N * N table for N classes, where each cell C_(i, j) represents instances of class i predicted as class j.

Key metrics derived from the confusion matrix include:

Accuracy: Measures overall correctness, but is misleading for imbalanced datasets (e.g., 95% negative class). Formula form is (TP + TN) / (TP + TN + FP + FN).
Precision: Total true positives out of all predicted positives. Focuses on reducing false positives (e.g., spam detection). Formula form is TP / (TP +FN).
Recall (sensitivity): Total correct predictions out of actual positives. Critical in scenarios which require minimizing false negatives. Formula form is TP / (TP + FN).
Specificity: Measures proportion of actual negatives correctly identified. Formula form is TN / (TN + FP).
F1 score: Balances precision and recall into harmonic mean. Ideal for imbalanced datasets. Formula form is 2[(Precision * Recall) / (Precision + Recall)].

Reading model results: bad fitting

Bad fitting occurs when a model fails to generalize effectively to new data. Underfitting (high bias) arises when the model is too simplistic, missing key patterns in training data, leading to poor performance on both training and test sets. Overfitting (high variance) happens when the model is excessively complex, memorizing noise or outliers in training data and performing well on training data but poorly on unseen data.

Both scenarios result in unreliable predictions, highlighting the need to balance model complexity, use regularization, or gather more data to achieve optimal generalization.

In classification models, underfitting manifests in the form of linear decision boundaries that simplistically classify items. On the other hand, boundaries from overfitting can capture outliers but are noisy, inflexible, and too specialized (molded) to capture test data accurately.

For regression models, underfitting manifests as a straight line fitting a sinusoidal (wave-like) trend. In cases of overfitting, this line appears as a high-degree polynomial passing through every point erratically.