Self Attention Mechanism

Interactive step-by-step architecture flow

Step 0: The Input Sequence

The process begins with our input sequence. Each word ("I", "like", "studying") is converted into a continuous vector representation called an embedding, forming the input matrix X.

Deep Dive: Softmax Transformation & Search Analogy

Search Engine Analogy

Self-attention mimics a retrieval system:

Query (Q): Your search terms.
Key (K): Video titles/tags.
Value (V): The actual video content.

Pro-Tip: In the canvas above, hover over matrix rows in Step 1 to see how these analogies apply to specific words.

1. Raw Dot Products (QKᵀ)

studying

2.0

0.8

0.1

4.0

1.0

studying

0.5

4.0

0.5

SOFTMAX

2. Attention Weights (A')

studying

69%

21%

10%

91%

studying

94%

Notice: Softmax forces each row of the matrix to sum exactly to 1.0 (100%).

Step 0: The Input Sequence

Deep Dive: Softmax Transformation & Search Analogy

Search Engine Analogy

Softmax Calculation