Step-by-step computation of the Causal Mask during Training
Step 1: We multiply the Queries by the Keys. Notice that "robot" (Row 1) currently has a high score for "orders" (Col 4). But "robot" is the first word—it hasn't seen "orders" yet!
Select a position to see the available context at that exact time step.
Hover over any cell in the matrix to see its calculation.
Without this mask, the decoder would simply memorize the target sentence during training by looking ahead at the next word, rather than learning to actually predict it based on the past context.