If you are just noticing an abundance of analogies in my blogs, know that I am a creative person trying to learn a mathematical topic.
Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision, demonstrating exceptional capabilities in tasks such as image classification, object detection, and image segmentation. Their ability to extract meaningful features from images has led to significant advancements in various domains, from autonomous driving to medical imaging.
A topic introduced back on 8th September, CNNs are a type of artificial neural network specifically designed for processing and analyzing image data. They are inspired by the biological processes of the visual cortex in the brain, where neurons are organized in a hierarchical structure to process visual information.
Imagine a CNN as a detective investigating a crime scene. The detective examines the scene from different perspectives (e.g., close-up, from afar, under different lighting conditions), much like a CNN applying filters at different scales. The detective identifies clues (features) and pieces them together to form a picture of what happened (make a prediction).
The key components of a CNN include:
Convolutional Layers: apply filters to the input image, extracting features at different levels of abstraction. They are essentially small matrices that slide across the image, performing element-wise multiplications and summations.
Pooling Layers: downsample the feature maps, reducing the spatial dimensions while preserving the most important information. This helps to reduce computational complexity and make the network more invariant to small translations and rotations.
Non-Linear Activation Functions: introduce non-linearity into the network, allowing it to learn complex relationships between features. Common activation functions include ReLU and tanh.
Fully Connected Layers: final layers of a CNN which combine the extracted features into a final output. They can be used for classification, regression, and more.
To further illustrate the strengths of the CNN, we will be comparing it to the fully connected neural network (FCNNs) template. To summarize the former's triumphs over the latter:
Spatial Invariance: CNNs are inherently invariant to spatial transformations, such as translations and rotations. This means that they can recognize objects in different positions and orientations within an image. This is particularly important for tasks like object detection and image segmentation.
Feature Learning: CNNs are capable of automatically learning relevant features from the input data. This eliminates the need for manual feature engineering, which can be time-consuming and error-prone.
Hierarchical Representation: CNNs learn hierarchical representations of images, capturing increasingly complex features as the network deepens. This allows them to recognize patterns at different levels of abstraction, making them suitable for tasks like object classification and scene understanding.
Parameter Efficiency: CNNs typically have a lower number of parameters compared to FCNNs, making them more efficient to train and deploy. This is due to the parameter sharing that occurs within convolutional layers.
This exercise compares the complexity between a CNN model and FCNN model, then explores the formulas that determine said layers' number of parameters.
The convolutional layer's formula for parameters is kernel count * kernel size + biases. Each kernel/filter has a bias term, which adds 32 more parameters in the example below, totaling in 320 parameters.
As for FCNNs, since they connect each neuron to every single input feature, its parameter formula is neuron count * input shape. In this case, each neuron has a bias term. In this example below, 288 neurons individually connect to 784 input features, and adding up 288 biases, results in the FCNN having 226,080 parameters.
If you return to the September 10th section, you will recall the topic of padding in CNN convolution. Simply put, it is a technique that adds a border of zeroes around the image grid to increase the number of times the outer pixels are touched and scanned by the convolutional filter/kernel, which moves from left to right and top to bottom.
The other topic is stride, the number of pixels the filter moves after each calculation. A higher stride value creates feature maps faster, though at the cost of less captured details.
This exercise explores how the stride value influences the CNN model output. The experiment will be run on a 13 * 13 image with same padding (adds a border of zeroes around) which will be scanned by a 6 * 6 kernel.
Starting with the default stride value (1, 1), 32 feature maps – each the size of 13*13 – are generated.
Referencing the basic logic of mathematical division, the feature map size is reduced to (7, 7) when the kernel is set to stride by 2 pixels per scan, rounded up from the formula (input size + 2 * padding - kernel size + 1) / stride = (13 + 2 * 2 - 6 + 1) / 2 = 14 / 2 = 7.
A kernel would not overstep its image's boundaries, with or without padding, stopping just by the rightmost border like a hiker halting at the edge of a pothole to avoid probable falling. Padding in no way changes the fundamental behavior of the convolutional operation.
The relationship between feature map size and information content is not always straightforward. It depends on the specific features extracted, the downsampling techniques used, and the requirements of the task.
Just like how a larger aperture telescope is analogous to a large feature map, collecting more light and potentially capturing more details, a higher magnification telescope is analogous to a smaller feature map, which focuses on specific features but may lose some global context.