Images are inherently 2D matrices of pixels.
Artificial neural networks (ANNs) require vectorized inputs, so one must convert any 2D image into a 1D vector by concatenating rows (e.g., a 28*28 grayscale image becomes a 784-element vector). However, doing so results in the loss of spatial relationships between pixels as neighboring pixels in the original image are no longer adjacent in the vector.
For massive images, this will trigger exponential growth in number of trainable parameters when using fully connected (dense) layers with high-dimensional inputs (e.g., 784-pixel vector * first hidden layer with 1,000 neurons +1,000 = 785,000). High parameter counts make models memorize noise in training data, reducing generalization. Training also becomes slower and memory intensive.
The convolutional neural network (CNN) is specially designed for grid-like data (e.g., images, videos, audio spectrograms). It is built with convolutional layers, pooling layers, and optionally dense layers. Without needing to flatten images into 1D vectors, it processes 2D/3D data directly (e.g., images as height × width × channels).
This processing method generates far fewer parameters via weight sharing in convolutional filters. It also captures local patterns (i.e., edges and textures) and spatial hierarchies, being able to detect features regardless of position, since the processed inputs remain 2D.
Edge detection in CNNs involves using specialized filters (kernels) to identify abrupt intensity changes in images, which correspond to edges. Here is how it works with example:
Image with a sharp vertical edge between columns 3 and 4 (transition from 10 to 0) is input into CNN.
In CNN is kernel, a vertical edge detector. It highlights vertical transitions by contrasting left (positive weights) and right (negative weights).
Convolution output creates non-zero values (30) wherever kernel aligns with vertical edge (columns 3–4 of input).
There are several different techniques for edge detection. Below is a breakdown of their characteristics and typical outputs:
Sobel operator: Detects horizontal and vertical edges using gradient approximations. Facilitates simple, fast, directional edge detection, but is sensitive to noise and might miss diagonal edges.
Laplacian: Detects edges by highlighting regions of rapid intensity change using second derivative. Detects edges regardless of orientation, but is also noise-sensitive, producing double edges.
Canny: Multi-stage algorithm for robust edge detection, including noise reduction, gradient calculation, non-max suppression, hysteresis thresholding. Balances noise reduction with edge preservation and detects continuous edges but is computationally intensive than last two techniques and requires parameter tuning.
Learn
Flaws
The convolutional layer
To preserve edge information in images, padding adds extra pixels around the input image or feature map to control spatial dimensions of output after convolution. This helps prevent size reduction of feature map and improves performance.
Variants of padding include:
Valid (no padding): No pixels added. Output size is smaller than input. For example, input size 6*6 + kernel 3*3 → output 4*4.
Same padding: Adds pixels to ensure output size matches input size (for stride 1). Achieved by symmetrically padding zeros around input. For example, input 5*5 + kernel 3*3 → pad 1 pixel on all sides.
For super large images, you can program kernels to stride (take long steps) through the 2D matrix, outputting a smaller feature map instead. This practice reduces spatial dimensions, expends less computational resources from lower workload, and increases effective area each neuron 'sees' in input.
A 1*1 convolution is a specialized layer in CNNs that operates on the channel dimension of input data while preserving spatial dimensions. The model applies a weighted sum across all input channels at each spatial location (pixel) and produces one output channel per filter.
Imagine a 1*1 convolution as a bartender mixing drinks. Using multiple ingredients (input channels), they combine precise amounts (weights) of each ingredient to create new flavors (output channels) in custom cocktails (feature maps) tailored to the recipe (task).
CNNs can also take volumetric (3D) inputs. Convolution operations process these volumes using filters (kernels) that slide across spatial dimensions while interacting with all channels. To break down the process in terms of RGB color channels:
Adding multiple kernels within a single convolutional layer (aka stacking kernels) allows for detection of wide range of features. Multiple kernels can handle variations in input data (e.g., lighting, orientation) by detecting redundant or complementary features.
For example, first kernel detects vertical edges, second detects horizontal edges, and third detects diagonal edges. Each kernel also operates on input concurrently, improving computational efficiency.
Common ideal use cases for kernel stacking include shallow networks or tasks requiring diverse low-level feature extraction (e.g., texture classification).
On the other hand, adding multiple sequential convolutional layers to a network (aka stacking convolutional layers) introduces hierarchical feature training — where early layers detect simpler features (e.g. edges, corners) and deeper layers combine them into complex patterns (e.g., shapes, objects).
Higher layers of such a feature training method allows the model to learn semantic concepts (e.g., "car wheels" or "animal faces"). Each layer aggregates information from a broader spatial region of input.
The pooling layer in a CNN serves to downsample spatial dimensions (height and width) of feature maps while retaining critical information. Doing so makes the model produce fewer parameters, which consequently reduces the risk of memorizing noise in training data. It also makes said model more robust to small translations (e.g., slight shifts in object position).
Imagine reading a detailed report and condensing each section into bullet points. Just as bullet points capture critical ideas without verbosity, pooling extracts dominant features (e.g., edges, shapes) while discarding less relevant details. This summary is easier to process for subsequent layers, much like a condensed report aids faster decision-making.
Two variants of pooling are:
Max pooling: Selects maximum value from each sub-region (kernel window) of input. Preserves strongest activations (e.g., edges, textures), enhancing feature detection. Common in early CNN architectures for sharper feature retention.
Average pooling: Computes average value of each sub-region. Smooths features, reducing sensitivity to noise. Useful for downsampling while retaining contextual information (e.g., background regions).
For a more in-depth explanation of the processes of the pooling layer:
The fully connected layer is the final stage in a CNN, responsible for classification by transforming high-level features (from preceding layers) into class probabilities.
Successful use cases of CNNs:
LeNet-5 1998: describe
AlexNet 2012: describe
(photoshop later)
VGGNet 2014: describe
GoogLeNet 2014: describe
(photoshop)
ResNet 2015: describe
Feature maps
(photoshop)