Why borrow AIs when you can make them?

Back

Week 4: Learning with sequences

Sequential data

Sequential data refers to data where the order or sequence of elements is crucial, meaning each data point is dependent on, or influenced by, others in the sequence (e.g., time series data or text).

An example of sequential data is translating a sentence from French to English, where the model processes the French sentence (a sequence of words) and outputs the equivalent English sentence (also a sequence of words).

Sequential data handling problems arise when the order of data points matters, and traditional machine learning models struggle to capture these dependencies. Sequential data also has a specific order that influences its meaning and analysis. This includes the timing of their occurrences (temporal relationship).

Recurrent neural networks

Recurrent neural networks (RNNs) process sequential data by using feedback loops to maintain a 'memory' of previous inputs, unlike traditional neural networks that process each input independently. This memory, or hidden state, allows RNNs to capture temporal dependencies and patterns within sequences.

To break down the architecture of a typical RNN for sequential data processing:

Hidden state (a^<t>): Captures information from previous time steps. Updated at each step using a^<t> = g_1(W_aa)(a^<t - 1>) + W_ax(x^<t>) + b _a, where:
- W_aa and W_ax: Weight matrix for recurrent connection (hidden state to hidden state) and input-to-hidden state.
- g_1: Activation function (e.g., tanh, ReLU, sigmoid).
- a^<t - 1>: Hidden state from the previous time step t.
- x^<t>: Input at the current time step t.
Output (/y\^<t>): Generated at each time step t by transforming hidden state a^<t> into an output vector, which can be interpreted as probabilities (for classification) or raw values (for regression).

To break down the process of the RNN:

Initialization: Initialize hidden state a^<0> (often to 0s).
Hidden state update: With the formula a^<t> = tanh(W_aa * a^<t−1> + W_ax * x^<t> + b_a), combine previous hidden state a^<t - 1> with current input x^<t>, where W_aa refers to hidden-to-hidden weights and W_ax to input-hidden weights.
Calculate output: Done through /y\^<t> = softmax(W_ya * a^<t> + b_y), where W_ya refers to hidden-to-output weights.
Weight sharing: Same weights W_aa, W_ax, and W_ya are reused across all time steps.

Think of the RNN as a conveyor belt where each worker (hidden state a^⟨t⟩) updates their task based on the previous worker’s output and new materials (x^⟨t⟩), passing results downstream.

Typical RNNs cannot consider the future states/cells and can suffer the vanishing gradient problem.

Variants of RNN

RNNs are designed for sequential data processing, with architectures tailored to different input-output relationships. Several key variants of these relationships include:

1-to-1 (vanilla): 1 input produces 1 output. Not inherently sequential; included for completeness. Can be used for image classification — image in, class label out.
1-to-many: 1 input produces sequence of output. Initial input (e.g., image features) triggers sequential generation (e.g., words). Can be used for image captioning — image in, descriptive text sequence out.
Many-to-1 (vanilla): Sequence of inputs produces 1 output. Aggregates sequential inputs (e.g., words) into a final prediction. Can be used for sentiment analysis — sentence in, positive/negative score out.
Many-to-many (aligned): Sequence of inputs produces sequence of output with the same length. Each input step directly maps to an output step (e.g., POS tagging aka assigning grammatical tag to texts). Can be used for video frame-by-frame labeling — video frames in, labels per frame out.
Many-to-many (encoder-decoder): Sequence of inputs produces sequence of output with different length. Can be used for machine translation — English sentence in, French sentence out.
- Encoder: Compresses input sequence into a context vector.
- Decoder: Generates output sequence from context vector.

RNN process (via character-level language model)

input

input to hidden

hidden to output

test1

test2

test3

test4

backpropagation

Visualizing RNN

Problem with vanilla RNN

Computing backpropagation of the gradient involves lots of multiplications by weight matrices.

Variants of RNN cells

Long short-term memory (LSTM) is a type of RNN aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs.

GRU

Practice: Time series and RNN

Lag

(photoshop)

MAE to MAPE

Time series dataset

Toy time series

(photoshop)

Batch effect

(photoshop)

Scaling values

Convert to time series set

Flattening a multi-dimensional array, like the output of convolutional layers, into a 1D vector (a 'flattened' vector) is done before feeding it into a fully connected (or 'dense') layer because these layers expect a 1D input for their calculations.

Re-scale back

Practice: Simple RNN for rider time series forecasting

Result

(photoshop)

Practice: Multi-layer RNN for the same purpose as above

Multivariate

Tweak

(photoshop)

Practice: Increase forecast speed

Concatenate

Practiec: LSTM

Practice: GRU

Transformers for RNNs

Traditional RNNs suffer drawbacks such as

The Transformer architecture addresses these issues by using the self-attention mechanism.

Do not confuse Transformers for RNNs — they have no sequential processing and use embedded representations to encode positions in the sequence.

(photoshop original and full)

Self-attention mechanism

Self-attention allows the model to weigh the importance of different parts of the input sequence when making predictions.

No learnable parameters.

Word embeddings

Scaled dot-problem

The Transformer architecture uses an implementation of self-attention called scaled Dot-Product attention. A core component of Transformer models, it calculates attention weights by taking the dot product of query and key vectors, then scaling the result and applying a SoftMax function, allowing the model to focus on relevant parts of the input.

Scaling prevents the dot products to grow large, thus avoiding vanishing gradients The application of SoftMax to large values would yield small gradients.

Multi-head attention

Multi-head attention is a mechanism that enhances attention by running the attention process multiple times in parallel, each with a different set of learned parameters, and then concatenating the outputs to capture diverse relationships within the input data.

Multi-head attention uses multiple scaled dot-product attention mechanisms in parallel, each with its own query, key, and value projections. Each 'head' attends to different aspects of the input, allowing the model to capture a wider range of relationships and dependencies. The outputs of all the heads are then concatenated and passed through a linear transformation to produce the final output.

Page updated

Google Sites

Report abuse