Interactive Tutorial: Understanding Bidirectional vs. Causal Masking
The
cat
sat
on
mat
Attention Matrix
💡 Softmax applied: Each Target row sums to 1.00
Step-by-Step Logic
Hover or tap over any cell in the Attention Matrix to see how tokens gather information based on the current architecture.
Application: Masked Language Modeling
Thecat[MASK]onmat
Application: Autoregressive Generation
Thecat
BERT (Encoder-Only)
Architecture: Stack of Transformer Encoders.
Attention Mechanism: Unmasked Bidirectional Self-Attention. Each token computes its representation by actively attending to all past and future tokens in the sequence simultaneously.
Training Objective: Masked Language Modeling (MLM). The model learns to reconstruct hidden words by deeply analyzing the surrounding global context.
Best For: Natural Language Understanding (NLU) tasks such as text classification, sentiment analysis, and extractive question answering.
GPT (Decoder-Only)
Architecture: Stack of Transformer Decoders.
Attention Mechanism: Causal (Masked) Self-Attention. An upper-triangular mask restricts tokens from seeing future information, preventing data leakage during generation.
Training Objective: Autoregressive Next-Token Prediction. The model is trained to probabilistically predict the upcoming word using only the historical context.
Best For: Natural Language Generation (NLG) tasks such as document summarization, coding assistance, and conversational AI.