The less we use something, the more we are likely to forget about it.
While MobileNetV1 marked a significant advancement in lightweight neural networks, it was not without its limitations. Despite its efficiency, MobileNetV1 often struggled to achieve the same level of accuracy as larger, more complex models. This limitation hindered its ability to tackle certain demanding tasks, such as object detection in challenging environments. This would lead to the development of MobileNetV2.
At the setting of the sun, MobileNet is still a neural network designed to be more portable at the expense of performance and complexity. The first version especially had a list of moderate shortcomings:
Limited Accuracy: while a lightweight and efficient model, it may not achieve the same level of accuracy as larger, more complex models, especially on challenging datasets.
Sensitivity to Hyperparameters: its performance can be sensitive to the choice of hyperparameters. Finding optimal hyperparameter values can be time-consuming.
Difficulty in Detecting Small Objects: due to its smaller size and reduced feature extraction capability, it may struggle to detect small objects accurately.
Limited Adaptability: designed for a specific range of tasks, it may not be as flexible as other architectures for more complex applications.
Not only that, commonly using ReLU activation function in neural networks can result in a dying ReLU problem. This phenomenon occurs when the input to a ReLU unit is negative, outputting zero, leading to neurons in a network becoming permanently deactivated and cease to contribute to the learning process.
Imagine a classroom full of students. If some students (neurons) are not paying attention (dying ReLUs), they will not be able to learn the material. Though in ReLU information loss, the sleep part is more permanent.
A neural network layer we have yet to cover, the bottleneck layer is designed to reduce computational costs of a neural network, especially for deep architectures. It sounds like the max pooling layer, but bottleneck reduces dimensionality of feature maps directly, rather than downsample maps to reduce their spatial dimensions.
While bottleneck layers can be beneficial for improving the overall performance of a neural network, they are not a direct solution to the dying ReLU problem. Bottleneck layers only help to improve the flow of information through the network, not the trend of neurons deactivating.
An improvement over the original MobileNet architecture, MobileNetV2 incorporates several new key features to improve accuracy and efficiency:
Inverted Residual Blocks: uses inverted residual blocks, which consist of a dimensionality expansion layer, a depthwise separable convolution, and a dimensionality reduction layer. Allows model to learn more complex features while maintaining computational efficiency.
Linear Bottlenecks: inverted residual blocks in model use linear bottlenecks, which reduce number of channels in the middle of the block to further improve efficiency.
Pointwise ReLU: applies ReLU activation functions after pointwise convolution layers, but before the depthwise convolution layers. Helps to prevent the vanishing gradient problem.
This exercise explores the basic architecture of MobileNetV2 and a detailed look at its components' contributions to lightweight and accurate object detection.
The first set of layers, the expansion layer, is essential for:
Increased Representation Capacity: by expanding the number of channels, the network can learn more complex and nuanced representations of input data.
Improved Feature Reuse: the expanded feature map can be used to compute multiple features in the subsequent depthwise separable convolution layer, reducing need for redundant computations.
Reduced Computational Cost: although expansion layer increases number of channels, the subsequent depthwise separable convolution layer can be computed more efficiently due to the reduced spatial dimensions.
The next set of layers, the depthwise separable convolution layer, helps by separating the standard convolution operation into two smaller operations we covered in the last section:
Depthwise Convolution: reduces spatial dimensions of feature map by applying the filter only in the depth (channel) dimension. Allows network to capture spatial information within each channel while reducing the number of parameters and computations.
Pointwise Convolution: reduces channel dimensions by applying a filter to each output channel. Allows network to capture information across channels, while further reducing the number of parameters and computations.
After that, the projection layer (linear bottleneck) is important for:
Dimensionality Reduction: by projecting the input feature map to a lower-dimensional space. Helps to reduce number of parameters and computations required in the subsequent layers.
Efficient Downsampling: reducing number of channels in the output feature map when the network downsamples the spatial dimensions. Helps to reduce the computational cost and memory usage of the network.
Reduced Computational Cost: reducing number of channels in the feature map also helps to reduce number of parameters and computations required in subsequent layers. Makes network more efficient and scalable.
Improved Scalability: making network more scalable by reducing number of parameters and computations required. Allows network to be deployed on devices with limited computational resources.
Maintaining Feature Information: maintaining feature information in the input feature map by applying a linear transformation. Ensures that network can still capture important features and patterns in the input data.
Finally, the model's residual connection helps to deal with vanishing gradient problem by:
Preserving Gradient Flow: when output feature map is added to input feature map, gradient of loss function with respect to the input feature map is preserved. While gradient flow is not interrupted, gradients can flow backwards through the network more easily.
Reducing Gradient Magnitude Decay: when gradients are backpropagated through a deep network, their magnitude tends to decay exponentially with the depth of the network. Adding output feature map to input feature map, gradient magnitude is preserved, and decay is reduced.
Providing Alternative Path For Gradients: provides an alternative path for gradients to flow through network. This allows gradients to bypass convolutional layers and flow directly from output feature map to input feature map. Helps to reduce impact of vanishing gradients and improve stability of training process.
Enabling Deeper Networks: by reducing impact of vanishing gradients, residual connections enable training of deeper networks. This is because gradients can flow more easily through the network, and the training process is more stable.
While an already outdated neural network for mobile devices with limited power, MobileNetV2 is an interesting model to study for its efficiency-increasing additions after V1 and simpler architecture before V2.