Self-Attention: Encoder vs Decoder

Interactive Tutorial: Understanding Bidirectional vs. Causal Masking

The
cat
sat
on
mat

Attention Matrix

💡 Softmax applied: Each Target row sums to 1.00

Step-by-Step Logic

Hover or tap over any cell in the Attention Matrix to see how tokens gather information based on the current architecture.

Application: Masked Language Modeling

The cat [MASK] on mat

BERT (Encoder-Only)

  • Architecture: Stack of Transformer Encoders.
  • Attention Mechanism: Unmasked Bidirectional Self-Attention. Each token computes its representation by actively attending to all past and future tokens in the sequence simultaneously.
  • Training Objective: Masked Language Modeling (MLM). The model learns to reconstruct hidden words by deeply analyzing the surrounding global context.
  • Best For: Natural Language Understanding (NLU) tasks such as text classification, sentiment analysis, and extractive question answering.

GPT (Decoder-Only)

  • Architecture: Stack of Transformer Decoders.
  • Attention Mechanism: Causal (Masked) Self-Attention. An upper-triangular mask restricts tokens from seeing future information, preventing data leakage during generation.
  • Training Objective: Autoregressive Next-Token Prediction. The model is trained to probabilistically predict the upcoming word using only the historical context.
  • Best For: Natural Language Generation (NLG) tasks such as document summarization, coding assistance, and conversational AI.