Worth drives the actions of the living.
In the vast landscape of image processing, the ability to extract meaningful features is paramount. These features, like the unique fingerprints of an image, hold the key to unlocking its secrets. From object recognition to image retrieval, feature extraction serves as the cornerstone of countless computer vision applications.
The SIFT (Scale-Invariant Feature Transform) algorithm is a computer vision technique used for feature detection and description. It is based on local appearance features and is designed to be invariant to scale and rotation, meaning that SIFT features can be reliably extracted from images even when they are scaled or rotated.
By being unchanging in the face of geometric transformations, SIFT features are generally more robust to noise and variations in lighting conditions than other feature detection methods.
Imagine exploring a foggy mountain range to find the highest peaks (keypoints). You vary the fog's thickness (scale) and compare the heights, the differences revealing the most prominent peaks.
The first stage of SIFT, scale-space extrema detection, involves identifying locations and scales that are keypoints in the image that are invariant to (unaffected by) scale changes. The first stage of SIFT does the following in descending order:
Build Gaussian Pyramid Octave: create a series of blurred images at different scales (σ) using Gaussian filters. Each subsequent octave begins with a blurred version of the previous octave's image, downsampled by a factor of 2.
Build Levels: within each octave, create multiple levels by applying Gaussian filters with different σ values.
Calculate Differences of Gaussians (DoG): subtract adjacent levels within each octave to obtain the DoG images.
Identify Extrema: find the local extrema (minima and maxima) within each DoG image, comparing each pixel with its 26 neighbors – 8 neighbors in the current image, 9 in the scale above, and 9 in the scale below.
Refine Locations: use sub-pixel interpolation (e.g., quadratic interpolation) to refine the location of each keypoint.
Discard Weak Keypoints: remove keypoints with low contrast (low DoG value), high edge response (along edges), and proximity to image boundaries.
After finding some promising peaks (keypoints) in the foggy mountains, to ensure they are truly significant, you would have to pinpoint each of their exact summits (sub-pixel interpolation), discard those with unstable and narrow ridges (edge points) as they are likely to collapse and shorten in height, then archive the ones that are and will remain notably higher than the others (constant thresholding) over the ages.
SIFT's keypoint localization stage aims to refine these keypoints detected in the previous stages and eliminate those that are not suitable for further processing. This is accomplished in descending order by:
Sub-pixel Interpolation: refine keypoint locations using sub-pixel interpolation by fitting a 3D quadratic function to the neighborhood and finding its local extremum. This enhances location accuracy.
Edge Response Test: discard keypoints on edges by calculating the principal curvature. If one curvature is significantly larger than the other, the keypoint is considered an edge and discarded.
Contrast Thresholding: discard keypoints with low contrast (i.e., low gradient magnitude). This eliminates noise and artifacts.
When you have found that promising mountain peak (keypoint) and want to now describe its orientation, you could measure the steepness (gradient) of its slopes around the peak in different directions and single out the steepest one, which is the peak's dominant orientation.
In SIFT, orientation assignment aims to determine the dominant orientation of each keypoint, making the feature descriptor invariant to image rotation. Here is a breakdown of how it does the above:
Gradient Calculation: compute gradient magnitude and orientation for pixels in a circular neighborhood (typically 16*16 or 24*24) around keypoint. Can be done using the Sobel operator or other edge detection techniques (i.e., Gaussian derivative filters).
Orientation Histogram: create histogram with specified number of bins (typically 8 or 12) to represent distribution of gradient orientations. Each bin corresponds to a range of angles.
Histogram Filling: for each pixel in the neighborhood, calculate gradient magnitude and orientation, and add the gradient magnitude to the corresponding bin in the histogram.
Peak Finding: find bin with the highest value in the histogram. This bin represents the dominant orientation of the keypoint.
Interpolation: if the peak is not at the center of a bin, refine orientation using parabolic or quadratic interpolation.
The circular shape in SIFT orientation assignment ensures rotation invariance. Because a circle is a rotationally symmetric shape that remains unchanged when rotated by any angle, it treats all directions around the keypoint equally.
After you have found a promising peak (keypoint) and determined its orientation (direction), you would want to capture its unique features to fully describe it.
The SIFT keypoint descriptor is a 128-dimensional feature vector that captures the local image appearance around a keypoint. It is designed to be invariant to changes in scale, rotation, and illumination. This stage outputs by running the following steps:
Orientation Assignment: determine dominant keypoint orientation using gradient orientation histogram.
Grid Division: divide region around keypoint into 16*16 grid of 4*4 sub-regions (not cells).
Gradient Magnitude and Orientation Calculation: for each sub-region, calculate gradient magnitude and orientation of pixels.
Histogram Creation: create 8-bin histogram for each sub-region to represent gradient orientation distribution.
Normalization: normalize histograms to unit length to ensure illumination invariance.
Feature Vector: concatenate 16 histograms (4*4 sub-regions) into a single 128-dimensional feature vector.
This exercise will go through the basic setup of the SIFT algorithm in a Python-Jupyter Notebook space and their outputs on grayscale images with different noises.
The original SIFT model invented by David G. Lowe in 1999 typically converts images to grayscale before doing any feature extractions. As of today, extensions for building a colored SIFT are available for use.
The SIFT_create() function is simple on a surface level, requiring only your image (grayscale or not) and a Boolean mask parameter to fill in. While it can handle color images, converting the image to grayscale beforehand is often recommended for efficiency. The function will automatically convert the image to grayscale if necessary.
Here is a comparison between the SIFT'd images with different noise:
Original image detected 1,100 keypoints, while noisy image detected 1,184. May indicate that SIFT is relatively robust to noise, as number of keypoints detected did not change by a lot.
Keypoints detected in both images are similar, with some variations due to the noise.
Keypoints are distributed throughout the images, focusing on areas with high contrast or texture (i.e., the fur above the hat, arch).
Noisy image has a higher number of keypoints detected, likely due to introduction of noise artifacts. These extra keypoints may not be as informative as keypoints detected in original image.