Visualizing why we reduce channels before applying $3 \times 3$ filters.
A $3 \times 3$ filter applied here must penetrate **6 channels**. This is computationally expensive.
The same $3 \times 3$ filter applied here only penetrates **3 channels**. The work is halved.
The output volume after spatial features are extracted from the "thin" Red volume.
By comparing the Blue and Red stacks, you can see the impact of dimensionality reduction. Both stacks are being scanned by a **$3 \times 3$ spatial filter**.
However, the filter on the Red stack is "shallower." In real-world networks like GoogLeNet, we might reduce 192 channels down to 16, making the spatial convolution **12x faster**.