Self Attention Mechanism
Interactive step-by-step architecture flow
Step 0: The Input Sequence
The process begins with our input sequence. Each word ("I", "like", "studying") is converted into a continuous vector representation called an embedding, forming the input matrix X.
Deep Dive: Softmax Transformation & Search Analogy
Search Engine Analogy
Self-attention mimics a retrieval system:
- Query (Q): Your search terms.
- Key (K): Video titles/tags.
- Value (V): The actual video content.
Pro-Tip: In the canvas above, hover over matrix rows in Step 1 to see how these analogies apply to specific words.
1. Raw Dot Products (QKᵀ)
I
like
studying
I
2.0
0.8
0.1
like
4.0
1.0
1.0
studying
0.5
4.0
0.5
SOFTMAX
2. Attention Weights (A')
I
like
studying
I
69%
21%
10%
like
91%
4%
5%
studying
3%
94%
3%
Notice: Softmax forces each row of the matrix to sum exactly to 1.0 (100%).