Due to the heavily outdated syntaxes of the exercise materials, requiring me to invest a lot of time into rewriting them, they will be briefly unavailable in recent blog sections.
Recurrent Neural Networks (RNNs) have revolutionized the field of sequence modeling, enabling us to process and understand sequential data such as text, time series, and speech. Coupled with the Connectionist Temporal Classification (CTC) loss function, RNNs become even more powerful tools for tasks involving variable-length sequences, like speech recognition and handwriting recognition.
Before we move onto the main topic of the day, it would be helpful for us to explore a series of CNN case studies from the past first. CAPTCHA recognition, street view door number recognition, and license plate recognition are all challenging tasks that can be effectively addressed using CNNs. The key challenges for these tasks are:
Variable String Length: the length of the string to be recognized can vary, making it difficult to design a fixed-size CNN architecture.
Variable Word Width: the width of individual characters within the image can vary, making it challenging to isolate and recognize each character independently.
A proposed solution to this dilemma is to make a CNN extract features from the X-axis of each input image to helps the network capture the spatial relationships between pixels and characters. After all features have been extracted, they will be treated as a sequence of time points and use a recurrent neural network (RNN) to model the temporal dependencies between the features.
Imagine trying to translate a foreign language text with unfamiliar words. You might first break down the text into individual characters (extract features), then analyze the sequence of characters to understand the meaning (using an RNN to model the temporal dependencies).
Whereas the CNN is designed to process images, the recurrent neural network (RNN) is designed to process sequential data like text, time series, and speech. It focuses on capturing long-term dependencies, the challenge of capturing connections between inputs separated by large intervals of time, in said data.
An RNN contains the following components:
Hidden State: a vector that stores information about the network's previous state. Updated at each time step based on current input and previous hidden state. Used to influence network's output and future hidden states.
Input: a sequence of input vectors, representing data to be processed. Each input vector is processed at a specific time step.
Output: a sequence of output vectors, representing the network's predictions or classifications. Output at a given time step can depend on current input and previous hidden state.
Weights: parameters that determine the network's behavior. Learned during training to optimize the network's performance.
Recurrent Connections: connections that allow information to flow from previous time steps to the current time step. They help the network to capture long-term dependencies in the input data.
One paper combined CNN with RNN to build the Convolutional Recurrent Neural Network (CRNN). It combines the strengths and components of its 'parents' to perform tasks related to both spatial and sequential data, such as optical character, scene text, and handwritten text recognition.
The components of a CRNN include:
Convolutional Layers: extract features from the input image, capturing spatial information.
Recurrent Layers: model the temporal dependencies between the extracted features, similar to an RNN.
Transcription Layer: converts the output of the RNN into a sequence of characters or symbols.
In relation to our CAPTCHA recognition problem, CRNN can solve it by providing the following benefits:
End-to-End Learning: CRNN allows for end-to-end learning, where the network can learn to extract features and recognize the captcha text directly from the input image.
Handling Variable Length Sequences: CRNN can handle CAPTCHA images with varying lengths, as the RNN component can process sequences of different sizes.
Robustness: CRNN is relatively robust to variations in CAPTCHA images, such as different fonts, sizes, and distortions.
CTC (Connectionist Temporal Classification) loss is a loss function designed to handle sequence-to-sequence tasks where the length of the input and output sequences may vary. It does the following tasks:
Case-by-Case Error Calculation: calculates error on a character-by-character basis, allowing it to handle sequences of different lengths.
Blank Symbol: introduce a 'blank' symbol to represent absence of a character or word. This allows the model to handle cases where characters are repeated or missing.
Optimization: minimize the edit distance between the predicted output sequence and the true output sequence.
In a speech recognition application using a cross-entropy loss, the input signal needs to be segmented into words or sub-words. However, using CTC loss, it suffices to provide one label sequence for input sequence and the network learns both the alignment as well labeling.