Men can either be shackled or be inspired by the past regardless of who's. Let me know if you can think of another thing the past can do.
Deep Neural Networks (DNNs) have revolutionized the field of artificial intelligence (AI), enabling breakthroughs in various domains. The journey from early DNNs to the sophisticated Convolutional Neural Networks (CNNs) we know today is marked by significant advancements, particularly driven by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), which has played a pivotal role in accelerating development and adoption of CNNs, pushing the boundaries of computer vision and artificial intelligence.
The field of DNNs has evolved significantly since the introduction of AlexNet, with newer architectures like ResNet and Inception-v4 demonstrating substantial improvements in both accuracy and efficiency.
The plot below presents a comparative analysis of various CNN models, DNNs designed to process image data, specifically focusing on their top-1 accuracy and computational complexity. Each axis represents:
Top-1 Accuracy: the y-axis represents the top-1 accuracy of each model, indicating the percentage of test images correctly classified into their respective categories.
Operations (G-Ops): the x-axis represents the computational complexity of each model, measured in giga-operations (G-Ops). This metric quantifies the number of floating-point operations required to process an image.
In 1998, Yann André LeCun would propose the LeNet family of CNN architectures, LeNet-5 being the most prominent of them. It introduces the concept of convolutional layers and pooling layers, which have become essential components of modern CNNs.
LeNet-5 used multiple convolutional kernels (5*5 and 10*10), average pooling, and a radial basis function (RBF) in the output layer. RBF is a kernel function that calculates the distance between two points in a high-dimensional space.
AlexNet, introduced in 2012, was a groundbreaking deep CNN architecture that significantly outperformed previous approaches on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC).
AlexNet was deeper than previous CNNs, in that it:
Included 8 layers, including 5 convolutional layers and 3 fully connected layers, which allowed it to learn more complex features from image data.
Used ReLU (Rectified Linear Unit) activation functions, which helped to address the vanishing gradient problem and improve training efficiency.
Introduced dropout as a regularization technique to prevent overfitting by randomly dropping out neurons during training.
Employed data augmentation techniques, such as cropping and flipping, to increase the size and diversity of the training dataset.
Visual Geometry Group (VGG) was a family of CNN architectures introduced in 2014 by Karen Simonyan and Andrew Zisserman. These models are characterized by their use of multiple convolutional layers with small 3*3 filters, followed by pooling layers.
Comparing VGG to the 2012 champion AlexNet:
Architecture: VGG networks typically have multiple convolutional layers stacked on top of each other, while AlexNet has fewer convolutional layers but uses larger filters.
Filter Size: VGG networks use smaller 3*3 filters throughout the architecture, while AlexNet uses larger filters in the early layers.
Depth: VGG networks can be deeper than AlexNet, allowing them to learn more complex features.
Performance: VGG networks generally outperform AlexNet on image classification and object detection.
To summarize these CNNs for the more analogous learners, you can imagine them as detectives with different specializations:
LeNet: a detective investigating a simple crime scene. It starts with a basic understanding of the scene (input image) and gradually builds a more complex picture by analyzing different parts (features) of it.
AlexNet: a more experienced detective with advanced tools. It can analyze the crime scene in greater detail and identify more complex patterns, thanks to its deeper architecture and larger filters.
VGG: a meticulous investigator who examines the crime scene with a magnifying glass. It uses smaller filters to analyze the scene at a finer level of detail, capturing subtle nuances that may be missed by other detectives.
The choice of kernel size in a CNN is influenced by the distribution of information within input images. Images with more global info (i.e., large-scale patterns or objects) often benefit from larger kernels, which can capture broader spatial relationships. The inverse applies to smaller kernels which can focus on capturing local features.
The Inception (GoogLeNet) architecture addresses this by combining multiple convolutional kernels of different sizes within a single layer. This allows the network to extract features at multiple scales, capturing both global and local information. By combining these kernels' outputs, the Inception module can effectively process images with varying levels of information distribution.
Inception modules consist of multiple parallel branches, each applying different convolutional filters to the input feature map.
1*1 Convolution: used to reduce dimensionality of feature maps, improving computational efficiency and introducing nonlinearity.
3*3 and 5*5 Convolution: extract features at different scales, capturing both local and global information.
Max Pooling: downsamples feature maps, reducing their spatial dimensions while preserving most important information.
To ensure that the output feature maps have the same size as the input, padding is used. In this case, the 1*1 convolutional layers use 'valid' padding, which means no padding is added, while the 3*3 and 5*5 convolutional layers use 'same' padding to maintain the original size.
As opposed to creating fewer larger feature maps with larger kernels, using tiny 1*1 kernels to perform convolutions results in fewer parameters. This is due to the reduced spatial dimensionality introduced by the 1*1 convolutions. However, keep in mind that they are linear transformations and do not capture spatial relationships between features.
Imagine a kitchen mixer's components. A large mixing bowl can hold and process a variety of ingredients (features), but it may be inefficient for smaller tasks. Meanwhile, smaller mixing attachments can blend specific ingredients (features) more efficiently and precisely.
Auxiliary classifiers were introduced in Inception networks to supervise the training of intermediate layers and prevent the vanishing gradient problem. They are trained with their own classification loss, which is added to the main classification loss of the network to provide extra supervision to the intermediate layers.
This extra supervision can act as a form of regularization, preventing overfitting by encouraging the network to learn more generalizable features. They can help to address the vanishing gradient problem by providing additional gradient signals to the earlier layers of the network.
In 2015, Google researchers introduced Inception-v2, which improved the computational efficiency and accuracy of the original model. These key advancements and changes included:
Replaced 5*5 convolution in Inception module with two 3*3 convolutions with a stride of 1. Reduced number of parameters and improved computational efficiency.
Incorporated batch normalization after each convolutional layer to improve training stability and reduce need for careful initialization. Acted as a regularizer, eliminating the need for dropout.
In the same year, Inception-v3 was released, further improving upon the enhanced-once model by:
Converting, for example, a 3*3 convolution in the Inception block into 1*3 then followed by 3*1 convolution, making it 33% less computationally complex as compared to one 3*3 convolution. However, paper states it does not work well in few earliest layers.
Introducing label smoothing, a regularization technique that assigns a small probability to incorrect classes, preventing the model from becoming too confident in its predictions.
Reintroducing auxiliary classifiers trained with lower learning rate to prevent them from dominating training process.
In the ILSVRC 2015, ResNet (residual neural network) won by a record-breaking top5 error of 3.57%, setting the standards for deep learning accuracy and the way AIs learn.
Before ResNet, training deep neural networks was challenging due to the vanishing gradient problem. To counteract this, ResNet introduced residual connections, which allow the network to learn residual function – the difference between desired output and input – instead of identity function. That way, the network can learn to add a small residual to the input, rather than the entire output from scratch.
Imagine a hiker (neural network) trying to climb an increasingly steep mountain (DNN cursed with vanishing gradient). The hiker could take a direct route (train directly) but the journey would be harder and harder as they go on. Instead, the hiker could take a shortcut with steps (residual function), allowing the hiker to reach the summit (learn necessary info) more easily.
Returning to the Inception model, the v4 variant utilized a form of residual learning. The parallel branches within the Inception modules effectively combine features from different paths, allowing the network to learn the residual function.
1 year after the original ResNet's proposal, ResNet-v2 was released. This variant implemented:
Pre-activation: in ResNet-v2, batch normalization and ReLU activation functions are applied before the convolutional layers, rather than after. Helps stabilize training and improve convergence.
Bottleneck Architecture: reduces dimensionality of feature maps before applying main convolutional layers. Improves computational efficiency without sacrificing accuracy.
A hybrid model, Inception-ResNet, harnessed both Inception and residual modules to combine the strengths of the two models in its name. To elaborate, it is an Inception architecture that incorporated ResNet-style residual connections into its Inception modules.
Imagine a hybrid car which combines the pros of an electric engine with the power of a gasoline engine. In automobile history, the gasoline engine (Inception) was its architectural foundation. Next came the electric engine (residual connections), adding to the vehicle's power. Combining both results in a hybrid car (Inception-ResNet) more powerful than its predecessors.
To explore the Inception-ResNet architecture with more depth, let us look at the key technical differences between Inception-ResNet(-v1) and Inception-v4:
Complexity: Inception-v4 has a more complex stem structure, allowing it to extract richer shallow semantic information.
Pooling Layers: Inception-v4 uses both max pooling and average pooling, while Inception-ResNet-v1 only uses max pooling.
Convolutional Layers: Inception-v4 uses a variety of convolutional layers with different sizes and strides, while Inception-ResNet-v1 primarily uses 3*3 convolutional layers.