"Being a square keeps you from going around in circles."
– John Vernon McGee
In the realm of object detection, where algorithms strive to accurately identify and locate objects within images, the efficacy of a model can be evaluated based on the quality of its generated bounding boxes. Metrics such as Intersection over Union (IoU) and loss functions play a pivotal role in assessing this performance.
IOU (Intersection over Union) is a metric used to measure the overlap between two bounding boxes. It is commonly used in object detection tasks to evaluate the accuracy of predicted bounding boxes. The formula is IOU = intersection_area / union_area.
The values of IOU can be interpreted as:
IOU = 1: perfect overlap between two bounding boxes.
IOU = 0: no overlap between two bounding boxes.
IOU ≈ 1: better overlap and higher accuracy.
During training, IOU is calculated between predicted bounding boxes and ground truth bounding boxes. This helps to measure model's performance and guide the training process.
At inference time, IOU is used to filter out redundant detections. NMS is a common technique that uses IOU to select the most confident bounding box for each object. This helps to reduce the number of overlapping predictions.
This exercise displays how we can write programs that calculate IoU and relevant variables. Two premade bounding boxes are implemented since building an object detection algorithm is costly work.
Matplotlib's patches function takes in coordinate variables of a top-left and bottom-right corner to draw a rectangular frame. In this exercise's context, add width and height values to x- and y-coordinates of the top-left corner accordingly.
We humans can see the overlap between bounding boxes clearly, but how can we let our machines detect their overlapping? Can we also get numbers on how much they overlap in total?
When calculating area of overlap between bounding boxes, using max(0, …) ensures that width and height are not negative. If intersection rectangle is empty (i.e., boxes do not overlap), width and height will be negative, and max(0, …) will return 0.
To repeat myself, IoU is a metric for evaluating the accuracy of object detection and tracking algorithms.
Bounding box loss is a metric used to quantify the difference between predicted and ground truth bounding boxes in object detection tasks. Whereas IoU measures overlap between said boxes and is useful for evaluation and comparison, bounding box loss measures coordinate difference and is used to guide training process. They are also reversely interpreted; higher IoU and lower loss are preferred.
Common bounding box loss functions include:
L1 Loss (Mean Absolute Error): calculates element-wise absolute difference between predicted and ground truth bounding box coordinates.
L2 Loss (Mean Squared Error): calculates squared difference between predicted and ground truth bounding box coordinates.
Smooth L1 Loss: combines advantages of L1 and L2 losses, using L1 loss for small errors and L2 loss for large errors.
IOU Loss: directly measures Intersection over Union (IOU) between predicted and ground truth bounding boxes.
This exercise explores building a calculator function for bounding box loss on Python, including a brief exploration on their formulas.
We will borrow the last exercise's hyperparameters for generating our example ground truth and prediction bounding boxes.
Offsets in positive/negative values for the top-left corner coordinates typically indicate that the region proposal corner is located to the left/right of or higher/lower than the ground truth corner.
To test our bounding box loss function, we can create a fake quartet of offsets. The formula below is MSE or L2 loss. Different loss functions have different results, each with their own traits and varying accuracies.
After a backbone network extracts features from an input image, the region proposal network (RPN) slides a window over the feature map, then predicts objectness scores and bounding box coordinates for each anchor box. Without an RPN, the Faster R-CNN model would just be a Fast R-CNN – more expensive and complex while less accurate.
Imagine a team of detectives investigating a crime scene. Without a trained canine partner (RPN), they would need to effort- and time-exhaustingly examine every inch of the scene (image), looking for clues and potential suspects (features).
The architecture of RPN is primarily composed of two branches:
Classification Branch: predicts objectness score for each anchor box, indicating whether it is likely to contain an object (foreground) or not (background).
Regression Branch: predicts bounding box coordinates [x, y, w, h] for each anchor box.
We can use the figure below as a narrative base to describe what each component of a simple RPN does to generate region proposals:
RPN receives a 3*3 feature map as input from a backbone network.
Feature map is processed by a 1*1 convolutional layer at the start of classification and regression branch, reducing map's dimensionality.
In the classification branch:
Output of convolutional layer is reshaped to a 1*18 tensor.
Output is passed through a softmax activation function to obtain probabilities.
Output is reshaped again.
In the regression branch:
Output of convolutional layer is reshaped to a 1*36 tensor.
Region proposals are generated based on predicted objectness scores and bounding box coordinates.
Final output includes predicted bounding box coordinates and potentially other information, including objectness scores.
Occasionally named but never explained in the last section, anchor boxes are predefined bounding boxes used in the RPN of object detection models like Faster R-CNN. They serve as a starting point for the RPN to generate region proposals, which are potential locations of objects in an image.
Did we mention that bounding boxes are simply refined anchor boxes?
Faster R-CNN originally used 9 anchor boxes per pixel. Their characteristics included:
Scales and Aspect Ratios: scales are typically divided into categories of small, medium, and large. For aspect ratios, 1:1, 2:1, and 1:2. This helps the model cover a variety of object shapes and sizes.
Center Points: placed at center of each grid cell in feature map.
Scaling: resized to match size of corresponding region in original image, based on stride of feature map.
This exercise explores how we can build a basic RPN for a CNN output with specific hyperparameters.
The CNN for this exercise will output a 64*64*512 feature map.
Each RPN branch has a number of output channels equal to the number of anchor boxes multiplied by the number of items to predict. For classification, it predicts probability distribution over the classes foreground and background. For regression, it predicts the regression values x, y, w, and h.
If we follow the RPN template from before, the depths of the basic RPN should be 18 for the classification branch and 36 for the regression branch.