Just recovered from a migraine. Blogging has been notably delayed.
YOLO (You Only Look Once) revolutionized the field of object detection with its innovative approach and impressive performance. Since its inception, YOLO has undergone several iterations, each building upon the strengths of its predecessors.
YOLO loss is a custom loss function designed specifically for its namesake algorithm. It combines several loss terms to optimize YOLO's performance in terms of both classification and localization.
YOLO loss function is designed to penalize errors in both localization of bounding box and classification of the object. The center coordinates of the bounding box are used to calculate the localization loss. Here is how they contribute to the task:
Localization loss measures difference between predicted center coordinates (^b_xi, ^bi_yi) and true center coordinates (b_xi, bi_yi) of bounding box.
Larger difference in center coordinates indicates greater localization error.
Localization loss is typically calculated using mean squared error (MSE) or a similar metric.
In addition to center coordinates, YOLO loss function includes terms to measure the difference between predicted and true bounding box dimensions. This helps to ensure that the model not only accurately predicts the object's location but also its size and shape. Here is how they contribute to calculating YOLO loss:
Localization loss also includes terms to measure the difference between predicted (^b_wi, ^b_hi) and true width and height (b_wi, b_hi) of the bounding box. A larger difference in dimensions indicates a greater localization error.
Objectness loss may also consider the size and shape of bounding box when determining whether a grid cell contains an object. For example, a larger bounding box might be more likely to contain an object.
The objectness score is a predicted value that indicates the likelihood of a grid cell containing an object. It is calculated as the squared difference between predicted and true objectness scores.
A type of YOLO loss, classification loss, is designed to penalize the model for incorrectly predicting the class of an object. The loss is only calculated for grid cells that contain objects, and for each that does contain an object, the model predicts that object's class.
This exercise explores how we can prepare training data for object detection models, including how computers detect objects' locations in images.
Object detection models use ground truth bounding boxes as training data, same as every other standard artificial network. They are manually annotated around objects to tell models what they look like and what they are via labels.
The YOLO tensor shape defines the output structure of the object detection model, specifically the number of grid cells and number of classes that can be detected. Whereas classes refer to predefined categories of objects (i.e., dog), objects refer to specific instances of said objects (i.e., dog 2).
In the context of this model, the total number of 20 classes is a multiple of the number of 3 classes to allow for more flexibility in the model's predictions. If we use the default number of classes, it would be less generalizable to new, unseen objects and scenarios.
Object detection models read images as 0s and 1s in a list of arrays. In this one, array values that are 1 indicate the presence of a bounding box's center coordinates in that grid cell.
The array data above should typically match with the plotted output.
We briefly covered about the strengths and shortcomings of YOLOv1 on 10th October. What do the future variants of YOLO have over their predecessor?
YOLOv2 incorporated several enhancements to improve accuracy and speed, these including:
Batch Normalization: uses batch normalization to stabilize training and improve generalization.
High-Resolution Classifier: trains a high-resolution classifier (448*448) before fine-tuning the detector, leading to improved performance.
Multi-Scale Training: uses multi-scale training, where the network is trained on images of different sizes to make it more robust to objects of varying scales.
Weighted Bounding Box Loss: uses a weighted bounding box loss that penalizes errors in predicting the center coordinates more heavily than errors in predicting the width and height.
New Network Architecture: introduces a new network architecture, Darknet-19, which is faster and more accurate than the network used in YOLOv1.
The table below compares the performance of models from Fast R-CNN to YOLOv2 on the PASCAL VOC 2007 + 2012 dataset. Although YOLOv2 with input size 544*544 ran slower than its predecessor YOLOv1 by 5 frames per second (FPS), it had the highest mean average precision (mAP) of 78.6 – a substantial improvement from 63.4.
YOLO9000 is an extension of YOLOv2 that extends the latter's ability to detect a larger number of object categories. Comparing both YOLO variants:
Number of Object Categories: YOLOv2 is trained on the COCO dataset with 80 object categories, whereas YOLO9000 is trained on a combination of COCO and ImageNet, encompassing over 9000 object categories.
Hierarchical Classification: unlike YOLOv2, YOLO9000 employs a hierarchical classification strategy called WordTree to handle the vast number of categories.
Fine-Grained Classification: YOLOv2 can perform basic object detection, but YOLO9000 is capable of fine-grained classification, distinguishing between different breeds of dogs or types of cars.
Returning to the 'mainstream' YOLO line, YOLOv3 improves upon these specific traits from YOLO9000:
Multi-Scale Predictions: introduces a multi-scale prediction strategy, where predictions are made at three different feature map resolutions. Allows model to detect objects of varying sizes more effectively.
Improved Backbone: uses a Darknet-53 backbone, which is deeper and more powerful than Darknet-19 in YOLO9000. Improves accuracy.
Dimension Clustering: uses dimension clustering to select a smaller set of anchor boxes. Improves efficiency and accuracy.
Convolutional Layer Modifications: introduces modifications to convolutional layers, such as using residual blocks and adding more layers, to enhance feature extraction.
Class-Specific Predictions: predicts class probabilities for each anchor box, allowing for more accurate classification.
While YOLOv3 is not solely focused on achieving the highest possible speed, it does prioritize real-time performance as a key design goal. This means that YOLOv3 aims to maintain a balance between accuracy and speed, ensuring that it can be used in real-time applications while still achieving competitive accuracy.
Tiny YOLOv3 is a a smaller and faster model that is designed for applications where speed is more important than absolute accuracy. It is particularly well-suited for real-time applications on embedded devices or mobile platforms.
In more technical details, Tiny YOLOv3 utlilizes:
The older Darknet-19 (Darknet-53 in YOLOv3).
Three residual blocks in the 256-channel layer (eight in YOLOv3), one in the 512-channel layer (four in YOLOv3).
One upsampling layer (two in YOLOv3).
2 detection layers with 2 scales each (3 with 3 in YOLOv3).
Has fewer convolutional layers for feature extraction than YOLOv3.