Self Attention Mechanism

Interactive step-by-step architecture flow

Step 0: The Input Sequence

The process begins with our input sequence. Each word ("I", "like", "studying") is converted into a continuous vector representation called an embedding, forming the input matrix X.

Deep Dive: Softmax Transformation & Search Analogy

Search Engine Analogy

Self-attention mimics a retrieval system:

  • Query (Q): Your search terms.
  • Key (K): Video titles/tags.
  • Value (V): The actual video content.

Pro-Tip: In the canvas above, hover over matrix rows in Step 1 to see how these analogies apply to specific words.

1. Raw Dot Products (QKᵀ)
I
like
studying
I
2.0
0.8
0.1
like
4.0
1.0
1.0
studying
0.5
4.0
0.5
SOFTMAX
2. Attention Weights (A')
I
like
studying
I
69%
21%
10%
like
91%
4%
5%
studying
3%
94%
3%

Notice: Softmax forces each row of the matrix to sum exactly to 1.0 (100%).