The latest chapters of this marathon so far are simple recaps of previous CNN topics, so I will be documenting my personal, in-depth findings about said chapters' main topics for today.
At the heart of CNNs lies a carefully crafted architecture composed of various layers, each serving a unique purpose. In this exploration, we will delve into the key components of CNN pooling, flattening, and batch normalization, and understand their contributions to deep learning models' ability to extract meaningful features from images.
To recap, pooling layers reduce the dimensionality of feature maps while preserving essential information and introducing an invariance to small transformations. This reduction is achieved by downsampling the input image while retaining the most relevant features.
Imagine yourself as a news editor reviewing a lengthy article. The editor chooses the most relevant sentences (features) to represent the article's main ideas. By selecting key sentences, the editor effectively summarizes the article, reducing its length while preserving essential information. If the article is slightly modified (e.g., a sentence is rearranged), the editor might still be able to identify the main points, showing invariance to small changes.Â
The first variant, max pooling, selects the max value within a pooling region. Its purpose is to preserve the most salient (important) features. Since this variant focuses on the max value, it can be sensitive to noise and outliers.
The second variant, average pooling, calculates the average value within a pooling region. Its purpose is to reduce noise and smoothen feature maps. Output images from this variant have blurred features.
The third and formerly unheard variant in this blog, global average pooling, calculates the average value across the entire feature map. Its purpose is to be used in the final layers of CNNs for classification tasks. When used there, it converts feature maps into a single vector that can be directly fed into a fully connected layer.
Flattening reshapes the multi-dimensional tensor into a two-dimensional matrix, where the first represents batch size (number of images being processed simultaneously) and the second represents the flattened feature vector obtained by concatenating all elements of the feature maps into a single feature vector, which can be processed by the fully connected layer to make predictions or classifications.
Imagine a librarian scanning a stack of books (feature maps). To digitize the content, they first flatten the books into a single pile. This flattening process allows the scanner to read the text sequentially.
Hailing from the 2015 paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, batch normalization is a regularization technique that helps to stabilize the training process of deep neural networks by normalizing the activations of a layer to have zero mean and unit variance.
Imagine a classroom full of students taking a test. The difficulty (activation values) of the test can vary greatly, affecting students' (neurons) scores. To ensure a fair comparison, the teacher might adjust (normalize) the difficulty of the test based on student performance.
Batch normalization helps to address the gradient vanishing problem in deep neural networks, a phenomenon that occurs when gradients become very small during backpropagation, making it difficult for the network to learn.
Imagine a group of people playing a whispering game in a long hallway. The first person whispers a message (gradient) to the second person, who then whispers it to the third, and so on. By the time the message reaches the person at the end of the hallway (network layers), it is often distorted or lost (shrunken) entirely.
Here is how normalization helps your models:
Scaling: by scaling the activations to have unit variance, the magnitudes of the activations are brought to a more reasonable range. This prevents the activations from becoming too large or too small, which can lead to vanishing gradients.
Shifting: shifting the activations to have zero mean ensures that the distribution of activations is centered around zero. This helps to prevent the network from getting stuck in local minima and improves the convergence of the training process.