BERT vs GPT Attention Visualizer

💡 Softmax applied: Each Target row sums to 1.00

Hover or tap over any cell in the Attention Matrix to see how tokens gather information based on the current architecture.

The cat [MASK] on mat

Architecture: Stack of Transformer Encoders.
Attention Mechanism: Unmasked Bidirectional Self-Attention. Each token computes its representation by actively attending to all past and future tokens in the sequence simultaneously.
Training Objective: Masked Language Modeling (MLM). The model learns to reconstruct hidden words by deeply analyzing the surrounding global context.
Best For: Natural Language Understanding (NLU) tasks such as text classification, sentiment analysis, and extractive question answering.

Architecture: Stack of Transformer Decoders.
Attention Mechanism: Causal (Masked) Self-Attention. An upper-triangular mask restricts tokens from seeing future information, preventing data leakage during generation.
Training Objective: Autoregressive Next-Token Prediction. The model is trained to probabilistically predict the upcoming word using only the historical context.
Best For: Natural Language Generation (NLG) tasks such as document summarization, coding assistance, and conversational AI.

Self-Attention: Encoder vs Decoder