Less building, more thinking. History demands greater intellect for the future.
The quest to imbue machines with human-like intelligence has driven advancements in computational paradigms. From its nascent stages as a simplified model of the brain to its current status as a transformative technology, this field has evolved through iterative refinement and exploration of complex architectures.
The genesis of artificial neural networks can be traced back to the perceptron, introduced by Frank Rosenblatt in the 1950s. While these early models were limited in their capabilities, they laid the groundwork for subsequent advancements.
Despite initial enthusiasm, the limitations of perceptrons and the subsequent inability to solve complex problems led to a period known as the "AI winter" in the 1970s and 1980s. Research in neural networks declined significantly during this time.
The development of the backpropagation algorithm in the 1980s marked a turning point. This algorithm enabled the training of multilayer perceptrons (MLPs), which are neural networks with multiple layers of interconnected neurons. This breakthrough reignited interest in neural networks and laid the foundation for modern deep learning.
The term "deep learning" – including deep neural networks (DNNs) – emerged in the 2000s to describe neural networks with multiple hidden layers. Advances in computational hardware, particularly GPUs, and the availability of large datasets fueled the rapid growth of deep learning.
The core principle of deep learning involves training artificial neural networks to approximate underlying functions that map inputs to desired outputs, effectively solving a given problem.
Deep learning excels in handling complex patterns and large datasets that often overwhelm traditional machine learning models. While shallow networks can achieve reasonable accuracy on certain tasks, deep architectures with multiple layers enable the extraction of hierarchical features, leading to superior performance.
This is particularly evident in domains like image recognition, natural language processing, and speech recognition where intricate data structures require sophisticated modeling capabilities.
First, in a neural network, each layer of neurons receives input from the previous layer, performs computations W*x + b, and produces an output vector a1. This output becomes the input for the next layer.
The output vector a1 represents a function that transforms the input x. Changes in x, W (weights), or b (bias terms) alter the function's behavior, affecting the entire neural network.
A neural network with more neurons has a larger set of candidate functions, enabling it to better approximate complex relationships and explain the problem at hand.
The input layer consists of a vector [x1, x2, ..., xn], where each xi represents a feature of the input data. The neurons in the subsequent layer process this input by receiving the input vector [x1, x2, …, xn], multiplying the input values by their corresponding weights [w1, w2, ..., wn], adding a bias term (offset) b, and applying an activation function to produce an output value Y.
The calculation process can be represented in matrix form as a1 = W × X + b, where W is the weight matrix [w1, w2, …, wn], X is the input vector [x1, x2, …, xn], b is the bias term (offset), and a1 is the output vector, the last one representing a functional transformation of the input data.
The output of each layer serves as the input to the next layer, creating a cascading effect. The weight matrix [w1, w2, …, wn] is applied to the input vector to compute the output.
The loss function measures the overall performance of the neural network by quantifying the difference between predicted outcomes and true results.
A commonly used loss function for classification tasks is Cross-Entropy, also known as Kullback-Leibler (K-L) divergence. It evaluates the difference between true labels (Y), the actual outcomes represented as one-hot encoded vectors (e.g., [1, 0, 0] for a three-class problem), and predicted probabilities (Y-hat), the model's output represented as a probability distribution over all classes (e.g., [0.7, 0.2, 0.1] for a three-class problem).
The cross-entropy loss function calculates the difference between these two, resulting in a loss value that indicates the model's performance.
The gradient descent method is employed to iteratively adjust the weights (W) in the neural network, minimizing the errors generated during training.
Without going too much into gradient descent again (read 30tth June 2024), its ultimate goal in deep learning networks is to find the optimal weights (W) that combine with the model to form the best function, which minimizes the loss value and maximizes the prediction accuracy.
Deep neural networks have revolutionized various fields due to their ability to learn complex patterns from data. Key advantages include:
Versatility: capable of handling diverse data types (images, text, audio) and tasks (classification, regression, generation).
Feature Learning: automatically extract relevant features from raw data, reducing manual feature engineering.
High Accuracy: often outperform traditional machine learning methods on complex problems.
Handling Large Datasets: can effectively process massive amounts of data.
While offering significant advantages, deep neural networks also present challenges:
Computational Cost: require substantial computational resources for training and inference.
Data Hunger: typically require large amounts of data to achieve optimal performance.
Overfitting: prone to overfitting, requiring regularization techniques.
Black Box Nature: difficult to interpret the decision-making process of deep models.
Training Time: can be computationally expensive and time-consuming.
For this exercise, we will be mporting the mnist dataset. For your information, it contains a large database of handwritten digits that is commonly used for training various image processing systems.
This Sequential network only runs through 2 Dense layers on ReLU and softmax respectively. Compared to 14th July's exercise, here are the expanded parameters that you can set for each Dense layer.
The resulting model is simple, shallow. has little parameters, and no specialized layers. It is more suitable for simple classification problems with small- to medium-sized datasets.
Over the 10-epoch runtime, the Sequential model has produced a consistent and rapid increase in accuracy scores and decrease in losses. Comparing training and validation sets, however, there is a noticeable gap between both sets' values, the latter outputting 0.0207 less accuracy and 0.0908 more loss.
If you remember the line plot above the TensorFlow Playground output since 9th August, you might remember how most neural networks produce more accurate results with training sets than testing/validation sets. That is a semi-constant observation since the model is optimzied for estimating the latter based on the info from the former.
The line plot below displays a difference of around 0.2. gap between the training and validation accuracy scores at the 10th epoch. A good score for both, but improvements can be made to close the gap.
As for loss, both sets have relatively low loss values. While this implies the model has outputted very little wrong predictions, the 0.0908 gap between the training and validation losses is much more prominent than the accuracy gap.
After tunning through a final evaluation, the Sequential model scored a high accuracy of 0.9735 (97%), although it also had a corresponding loss of 0.1030 (10%). The former demonstrates the model's strong performance on the test/validation set, while the latter suggests the model is making confident predictions.
All and all, even if the Sequential model has outputted high results, further analysis can be done to identify potential areas for increasing accuracy.