Imagine what we could perceive with a computer hooked to our brain.
Computer vision object detection is a powerful technique that enables computers to identify and locate objects within images or videos. It has a wide range of applications, from self-driving cars and medical image analysis to surveillance systems and augmented reality. By understanding the principles and techniques behind object detection, you can unlock new possibilities for computer vision applications.
Object detection is a computer vision task that involves identifying and localizing objects within an image. Unlike basic CNN classifiers that only predict the class of an entire image, object detection algorithms aim to pinpoint the exact location of objects within the image and classify them.
Of course, such an arduous task invites numerous challenges:
Bounding Box Localization: must accurately localizing the bounding box that encloses the object.
Multiple Object Detection: must be able to handle scenarios where multiple objects of different classes appear in the same image. Techniques like non-maximum suppression are used to filter out redundant detections.
Scale Invariance: should be able to detect objects at different scales, from small objects to large objects. This requires the ability to handle objects of varying sizes and aspect ratios.
Object detection algorithms can be broadly categorized into two main approaches:
One-Stage Detectors: directly predict object bounding boxes and class probabilities from the input image, without requiring a separate region proposal step. Includes YOLO (You Only Look Once) and single shot detector (SSD).
Two-Stage Detectors: first generate potential regions of interest within the image, known as region proposals, which are then classified, and their bounding boxes are refined to accurately localize the detected objects. Includes faster R-CNN (will introduce later) and R-FCN.
The boxes you see framing around objects on screens linked to object detection algorithms are region proposals. They serve as a starting point for object detection algorithms, reducing search space and improving efficiency. Various methods can be used to generate region proposals, including:
Selective Search: groups similar image regions based on color, texture, size, and shape compatibility. Consists of two main parts, namely hierarchical grouping algorithm and diversification strategies.
Region Proposal Networks (RPNs): uses CNNs to directly predict region proposals from the input image.
Sliding Windows: slides a window of a fixed size across the image and checking for objects within each window.
The Region-Based Convolutional Neural Network (R-CNN) is a type of object detection algorithm that uses region proposals to identify and localize objects within images. It has the same capital letters as CRNN, but it is not designed for sequence-to-sequence tasks.
R-CNN is like a detective investigating a crime scene. They collect evidence (features) from the scene (image), then list out potential suspects or areas of interest (region proposals), and finally determine if a suspect is guilty (classify object). At its core, R-CNN analyzes visual features to recognize and locate objects within an image.
The components in R-CNN, executed in descending order, include:
Region Proposal Generation: generate region proposals with Selective Search or any other method (see above), considering color, texture, and size compatibility.
Feature Extraction: use a pre-trained CNN to extract features from each region proposal.
Fixed-Size Input: resize region proposals to a fixed size to ensure consistent input to the CNN.
Object Classification: multiple Support Vector Machines (SVMs) are trained to classify each region proposal into different object categories.
Bounding Box Regression: a linear regression model is trained to predict the ground-truth bounding box for each detected object.
Despite the above, the R-CNN has several shortcomings when performing object detection:
Slow Inference Speed: the two-stage nature of R-CNNs can lead to slow inference times, making them less suitable for real-time applications.
Computational Cost: the region proposal step, typically based on Selective Search, can be computationally expensive.
Non-End-to-End Training: R-CNNs require separate training for the region proposal network and the object classification network, which can introduce additional complexity and potential inefficiencies.
If R-CNN is a basic detective, its successor Fast R-CNN would be an all-human detective team. The team's informant (Selective Search) who provides a list of potential suspects (region proposals) based on their appearance and behavior. Next, the team send all evidence to a forensic expert (RoI pooling) who can extract key evidence (features) from each potential suspect. Finally, the team analysts scan all evidence to determine suspect's guilt (object class) and how to apprehend them (refine bounding box).
Proposed 2 years after R-CNN, this first faster variant of it introduced several significant advancements over the traditional R-CNN in terms of:
Speed: is up to 100 times faster than R-CNN. Achieved by sharing convolutional features across region proposals, eliminating need for disk storage of feature maps, and efficient RoI (region of interest) pooling for feature extraction.
Improved Accuracy: from fine-tuning CNN for detection and RoI pooling for refined feature extraction.
Simplified Training: by applying single-stage training instead of multi-stage, as well as end-to-end training of entire network.
Reduced Overfitting: by sharing features across region proposals and using dropout and regularization techniques.
But the improvements brought forth by Fast R-CNN also come with a new list of shortcomings:
Sensitivity to Background Clutter: decreased classification performance in cluttered scenes, creating false positives and negatives. Caused by similarity between object and background features, insufficient contextual information, and/or limited robustness to occlusions.
Noisy Proposals: decreased classification performance due to low-quality proposals. Caused by Selective Search algorithm limitations and/or insufficient proposal refinement.
Scale Variance: struggles with objects of varying scales. Small objects tend to get lost in feature maps or be misclassified, while large objects are hard to be precisely located. Caused by fixed-scale feature extraction and/or insufficient contextual information
Region Proposal Reliance: model performance depends on computationally expensive region proposal techniques, which can be bottlenecks in processing speed and a source of errors (noisy proposals).
Class Imbalance: model becomes biased, favoring background classification, resulting in reduced object detection accuracy. Caused by the presence of too many negative samples.
Soon came Faster R-CNN. Continuing with our analogy series, imagine a detective team with a canine partner. Compared to an informant (Selective Search), the canine (RPN) can sniff out potential suspects quicker with the same criteria. After that, forensic expert (RoI pooling) extracts key evidence from suspects, and team analysts analyze said evidence to determine suspect's guilt (object class) and how to apprehend them (refine bounding box).
An even quicker variant of R-CNN developed in the same year as its predecessor, it enhanced object detection in terms of:
Improved Speed: introducing RPN, using it generates proposals in a single pass, reducing computation time – about 10~100 times faster than Fast R-CNN.
Enhanced Accuracy: RPN and anchor boxes improve proposal quality and handle varying scales, resulting in improved region proposal quality and feature extraction without being limited by Selective Search and fixed-scale features.
Efficient Region Proposal Generation: RPN shares convolutional features with Fast R-CNN, eliminating redundant computations to generate proposals in a single pass.
Simplified Architecture: RPN integrates proposal generation and feature extraction, resulting in a streamlined architecture with fewer components than Fast R-CNN, which has separate proposal and detection networks.
Better Handling of Small Objects: anchor boxes with unfixed, varying scales serve as locations at which the RPN is going to search for objects, improving feature extraction and proposal quality for small objects.
Reduced Hyperparameter Tuning: fewer hyperparameters to adjust as model's RPN and shared convolutional features reduce hyperparameter dependence.
Improved Robustness: RPN's learned features and anchor boxes reduce sensitivity to noise and clutter. The former generates region proposals using learned features, which the latter refines into high-quality proposal boundaries.
Coincidentally, 2 years after Faster R-CNN's proposal, Mask R-CNN was released. Imagine the Faster R-CNN detective team investigating a complex crime scene with many suspects. The jobs done by their canine partner (RPN), forensic expert (RoIAlign), and analysts are the same. However, a sketch artist (mask generation) joins the team and draws detailed masks (instance masks) to identify suspects' precise movements and associations.
Mask Generation: A skilled sketch artist creates detailed masks (instance masks) to identify suspects' precise movements and associations.
Used for both object detection and instance segmentation tasks in computer vision, it is actually an extension of Faster R-CNN, which was primarily designed for object detection. The extension Mask R-CNN contains the components:
Instance Segmentation: the core capability of extension. Can not only detect objects in an image but also accurately segment each individual instance of an object, creating a pixel-level mask around it.
Object Detection: as an extension of Faster R-CNN, extension inherits its ability to detect objects in images, providing bounding boxes and class labels for each detected object.
Feature Pyramid Network (FPN): incorporates FPN to improve its performance on objects of varying sizes, which creates a feature pyramid that allows model to effectively detect objects at different scales.
RoIAlign: to address the issue of misalignment between features and proposed RoIs, extension uses RoIAlign instead of RoI pooling, as former preserves spatial information more accurately, leading to better segmentation results.
Being a professionally proposed descendant of Faster R-CNN, Mask R-CNN did some things better than the former:
Instance Segmentation: Mask R-CNN is capable of instance segmentation, which means it can not only detect objects in an image but also accurately segment each individual instance of an object. Faster R-CNN only provides bounding boxes.
Improved Segmentation Quality: Mask R-CNN's RoIAlign layer helps to preserve spatial information more accurately, leading to better segmentation results compared to Faster R-CNN's RoI pooling.
Single-Stage Training: Mask R-CNN is trained in a single stage, which can be more efficient than the two-stage training process of Faster R-CNN.
Unified Framework: Mask R-CNN provides a unified framework for both object detection and instance segmentation, making it a more versatile tool.
First introduced by Joseph Redmon et al. in 2015, YOLOv1 (You Only Look Once) is a single-stage object detection algorithm that directly predicts the bounding boxes and class probabilities for objects in an image without region proposal-based techniques. It is composed of the following components:
Input Image: input to YOLO-v1 network is a single image.
Feature Extraction: image is processed through a convolutional neural network (CNN) to extract features.
Grid: feature map is divided into a grid of cells.
Anchor Boxes: each cell is associated with a set of predefined anchor boxes of different sizes and aspect ratios.
Predictions: network predicts bounding box coordinates and class probabilities for each anchor box in each cell.
Non-Maximum Suppression (NMS): after predictions are made, applied to filter out redundant detections and select most confident ones.
The grid approach by YOLOv1 offers a series of significant advantages:
Single-Stage Detection: being a single-stage detector, it performs object detection and classification in a single pass through the network. Makes it significantly faster than two-stage methods like R-CNN, which require a separate region proposal stage.
Global Context: allows YOLO-v1 to consider entire image during object detection. Help model better understand context of objects and their relationships to each other.
Fixed Number of Predictions: predicts fixed number of bounding boxes for each grid cell. Simplifies training process and makes inference more efficient.
Unified Framework: YOLO-v1 provides a unified framework for both object detection and classification, which can be beneficial for certain applications.
That said, if YOLOv1 were flawless, it would not be further refined. Here are its shortcomings compared to older object detection algorithms:
Localization Accuracy: struggles with localizing small objects or objects that are close together. Result of fixed grid size and that each grid cell can only predict a limited number of bounding boxes.
Background Errors: The model may sometimes predict background regions as objects, leading to false positives.
Difficulty with Occlusions: can have difficulty dealing with occluded objects, as grid-based approach may not capture full context of object.
Single Shot MultiBox Detector (SSD) is a single-stage object detection algorithm that directly predicts bounding boxes and class probabilities for objects in an image. It is similar to YOLO-v1 in that it avoids two-stage process of region proposal and classification, making it faster. The key differences between the two include:
Multiple Feature Maps: SSD uses multiple feature maps at different scales to detect objects of varying sizes, in contrast to YOLO-v1 which only uses final feature map.
Default Boxes: SSD uses default boxes (similar to anchor boxes in YOLO-v1) at each feature map level to predict bounding boxes. Number and aspect ratios of default boxes vary across different feature map levels.
Convolutional Prediction: SSD directly predicts bounding box offsets and class probabilities using convolutional layers applied to feature maps.
SSD addresses the problem of multi-scale object detection by utilizing multiple feature maps at different scales, making it more sensitive to small objects than methods that rely on a single feature map. Despite this, SSD also faces some challenges:
Class Imbalance: number of positive and negative samples can be highly imbalanced, especially in datasets with low object density. Can lead to model focusing too much on easy-to-classify background regions.
Online Hard Example Mining (OHEM): while OHEM can help address class imbalance, it can also lead to disappearance of easy-to-classify samples, which can degrade performance.
Shallow Feature Maps for Small Objects: although SSD uses multiple feature maps for multi-scale detection, shallow feature maps used for detecting small objects may not contain enough rich information. This can limit its sensitivity to small objects compared to methods (i.e., Faster R-CNN) which use deeper feature maps.
Just as Mask R-CNN is an extension to Faster R-CNN, RetinaNet is the same to SSD. It is a single-stage object detection algorithm that builds upon SSD to enhance its ability to detect objects of different sizes and address class imbalance problem. Here are its key components and what they contribute to the problems:
ResNet: backbone feature extractor network. Typically uses a ResNet architecture to extract features from input image.
Feature Pyramid Network (FPN): combines feature maps from different levels of ResNet to create a feature pyramid. Allows RetinaNet to detect objects of varying sizes more effectively.
Class Subnet: responsible for predicting class probabilities for each anchor box. Typically consists of a series of convolutional layers and a final classification layer.
Box Subnet: responsible for predicting bounding box coordinates for each anchor box. Typically consists of a series of convolutional layers and a regression layer.
Focal loss is a modified cross-entropy loss function used in RetinaNet to address the class imbalance problem. It accomplishes this by introducing a modulating factor that focuses training on hard examples and downweights loss for easy-to-classify samples, making the model focus more on the difficult samples that contribute more to the overall performance.