Our desire to create machines that do what we cannot in certain conditions implies a secondary desire for our work to surpass us – consciously or not.
Traditional computer vision methods often relied on handcrafted features, requiring domain experts to carefully design and engineer specific features for image analysis. This manual process was time-consuming, labor-intensive, and limited in its ability to capture complex patterns and relationships within images. The limitations of traditional computer vision created a need for a new approach that could automatically learn features from data and handle more complex tasks.
Traditional computer vision relies heavily on manual feature engineering. Convolutional Neural Networks (CNNs) are, on the other hand, a type of deep learning neural network specifically designed for automatically processing and analyzing structured data (i.e., images).
The key components of a CNN include:
Convolutional Layers: apply filters to the input data, extracting features at different levels of abstraction.
Pooling Layers: reduce the dimensionality of feature maps while preserving important information. It does this by downsizing the input image while retaining the most relevant features. Reduces computational complexity and improves invariance to small shifts or distortions.
Flatten Layers: transform the multi-dimensional feature maps generated by convolutional and pooling layers into a one-dimensional vector, whichly fully connected layers accept. They do not introduce any new parameters; only reshape the data.
Fully Connected Layers: combine the extracted features to make predictions or classifications, similar to traditional neural networks.
The image below shows the performance of several image recognition models in the 2015 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The winner is the ResNet (residual network), a specific architecture of CNNs that can train very deep networks without suffering from the vanishing gradient problem, producing outputs with an error rate of 2.991% compared to the higher human-level error rate of 5.1%.
Convolution is a mathematical operation that involves combining two functions to produce a third function. In the context of deep learning, convolution is used to extract features from input data, such as images or audio signals.
The image demonstrates the process of convolution in a CNN. Here is a breakdown of the components:
Input Image: the original image is a 5*5 matrix of numerical values.
Filter: a 2*2 matrix, representing the filter or kernel used for convolution, is applied to the input image. Filter size determines the size of the output feature map.
Element-wise Multiplication: the filter elements are multiplied with the corresponding elements of the input image.
Summation: the resulting products are summed to produce a single value.
Sliding Window: the filter is slid across the input image, applying the same operation at each position.
Feature Map: the final output is a 4*4 feature map, representing the activation of the filter at different locations in the input image.
In practice, CNNs often use multiple filters to extract different features from the input data. They allow the network to identify regions within an image that share similar characteristics to the applied filter.
As illustrated in the figure, different filters can extract distinct features from the input image, resulting in diverse feature maps. The convolutional process proceeds in the following order:
Input Image: the original image serves as the input to the convolutional layer.
Filter Application: the first filter, defined by the matrix, is applied to the input image through a sliding window operation.
Element-wise Multiplication: the filter elements are multiplied with the corresponding elements of the input image.
Summation: the resulting products are summed to produce a single value.
Sliding Window: the filter is moved across the input image, repeating the above steps for each position.
Feature Map: the final output is a feature map, which represents the activation of the filter at different locations in the input image.
By examining the filters at different layers in a convolutional neural network, we can observe how the model learns to extract progressively complex features. Lower-level features might represent basic elements like lines and colors, mid-level features might represent shapes and contours, and higher-level features might represent more abstract concepts such as objects or scenes.
Neural networks view images in a fundamentally more different way than their human creators. Humans rely on top-down processing and contextual understanding, while neural networks use bottom-up processing and learn features automatically.
To better describe the convolution process, below is a visual representation of it coupled with a list of steps:
Filter Placement: the 3*3 filter is placed over the top-left corner of the 10*10 input image.
Element-wise Multiplication: the elements of the filter are multiplied with the corresponding elements of the input image.
Summation: the resulting products are summed to produce a single value.
Sliding Window: the filter is shifted one position to the right and the process is repeated for each position of the filter within the input image, resulting in an 8*8 feature map.
In CNNs, convolution is typically applied to multi-channel images such as RGB color images. For these images, the input data is a 3D tensor, representing the height, width, and channels of the image.
The filter used for convolution must have the same depth as the input image, so for a 3-channel image (RGB), the filter must also have 3 channels. Convolution is then performed independently for each channel of the input image. For the creative thinker, think of convolution as a game where you are moving a filter (a small window) across an image (the room). The goal is to make the box touch every pixel (tile) in the image at least once.
The multi-channel convolution process differs from convolution with a single channel in:
Amount of Channels Needed: the input image (red, green, and blue) and the filter have the same amount of channels.
Channel-wise Convolution: the filter is applied to each channel of the input image separately.
Bias: a bias term is added to the output feature map, which can help to shift the activation values.
Compared to the fully connected layer of a deep neural network (DNN), convolutional layers in a CNN share weights across the spatial dimensions of the input, reducing the number of parameters significantly. Less parameters equals less overfitting and greater training efficiency.
In the example below, the fully connected layer on the left has 9 * 4 = 36 parameters, as each neuron is connected to every neuron in the previous layer. Meanwhile, the convolutional layer has only 4 parameters (W11, W12, W21, W22), as the same weights are shared across different spatial locations.
As mentioned in this section, CNNs are well-suited for processing image data, which often contains three-dimensional information such as height, width, and color channels. Unlike traditional DNNs, CNNs can preserve spatial information, allowing them to extract features based on said spatial relationships between pixels.
Consider two images of a bird below as an example: one with the beak in the upper left corner and the other with the beak in the middle. While their beaks are in different positions, a CNN can extract the relevant features using convolution, regardless of the object's location within the image.
This exercise aims to analyze the effects of convolution (convolutional and pooling layers) on a neural network, mainly the models' number of parameters.
Each convolutional layer will use the same following parameters: 32 filters, a kernel size of 3*3, and an input shape of 28*28*1.
Looking at the summary below, the CNN is a relatively small model for its input shape value. Fewer parameters can help prevent complete training dataset memorization (overfitting), improve generalization, contain less content to read for understanding and evaluation, and lower computational costs for deployment and usage.
Based on the results below, pooling layers do not introduce additional parameters into a CNN, as pooling operations are purely computational and do not involve any learnable weights or biases.
Dense layers do not have a filter parameter, only units and input shape. Instead of using softmax, as we did for the CNN output layers, the DNNs will use ReLU instead. This is because ReLU is generally preferred in DNNs due to its computational efficiency and ability to avoid the vanishing gradient problem.
Looking at the summary below, the number of parameters in this 'pure' DNN is notably greater than the CNNs from before. Despite the detrimental implications of increased parameters in a neural network, having less of them might limit its learning ability (underfitting), causing it to be limited in performing generalization.