Click any input pixel to see how filters transform it into output channels.
The same 3×3 filter is much cheaper to apply on a thin (reduced) volume than a thick one.
3×3 filter must process 192 input channels. Very expensive.
Same 3×3 filter now sees only 16 channels. 12× faster.
Final feature maps after the cheaper convolution.
In a real network, feature maps are much larger (e.g., 28×28 spatial size)
(64 filters = 64 output channels. This number is a design choice.)
Bottlenecks enable efficiency at scale