Where is the Parallelism?

Comparing RNN (Sequential) vs Transformer (Parallel) processing

RNN (Sequential)

O(N) Time

"I can't start Word 3 until Word 2 tells me what happened."

Transformer (Parallel)

O(1) Time*

"I see all words at once. I just need to multiply two matrices."

The "100 Decoders" Misconception

Myth: 100 Decoders

...

D100

"Hiring 100 teachers to grade 1 exam each."

❌ Inefficient & redundant

Reality: 1 Matrix + GPU

1 Decoder Stack

Zeroes = Causal Mask
(No looking ahead)

💡 The "Aha!" Moment

1. Training vs Inference

During training, we have the full sentence. Instead of feeding words one by one, we feed the whole matrix. The hardware (GPU) calculates all positions in one clock cycle.

2. The Context Vector

In an RNN, the context is a "summarized bucket" passed forward. In a Transformer, the Self-Attention Score is a matrix of word-to-word relationships calculated all at once.

3. Matrix Math (Not 100 Decoders!)

You don't clone the decoder 100 times. You pack all 100 tokens into one single block of data (a matrix). The GPU uses linear algebra to sweep across the entire grid simultaneously. Think of it as one super-teacher grading all exams at once.

4. The Causal Mask

If we feed all 100 words at once, how do we stop Word 5 from looking at Word 6? The model applies a Mask (a triangle of zeros) over future words. Word 5 can only multiply with Words 1-5. It can't "cheat" by looking ahead.