How Masked Attention Works

Step-by-step computation of the Causal Mask during Training

Step 1: We multiply the Queries by the Keys. Notice that "robot" (Row 1) currently has a high score for "orders" (Col 4). But "robot" is the first word—it hasn't seen "orders" yet!

Input Sequence Timeline

Select a position to see the available context at that exact time step.

robot
must
obey
orders

Cell Inspector

Hover over any cell in the matrix to see its calculation.

Hover over a cell...

Without this mask, the decoder would simply memorize the target sentence during training by looking ahead at the next word, rather than learning to actually predict it based on the past context.