Each channel convolved separately → summed into one output
R feature
G feature
B feature
∑ Sum output
Filter:
All 3 channels convolved simultaneously → summed
Press ▶ to see R, G, B kernels slide simultaneously → three feature maps → summed into final output.
How RGB Convolution Works
An RGB image is (H × W × 3). Each filter has depth=3. The kernel slides only in H and W — processing all 3 channels simultaneously and summing into one value per position.
Input: (H, W, 3) e.g. (224, 224, 3)
Kernel: (k, k, 3) e.g. (3, 3, 3)
Output: (H-2, W-2, 1) per filter
Sliding: H × W only ←→ ↑↓
With 64 filters → feature tensor (H-2, W-2, 64). Each filter detects a different pattern.
Classification Demo
① Select image type above → ② Press Classify ▶
Key insight: depth=3 in the kernel
The kernel must match input depth. For RGB (depth=3), every kernel is also depth=3.
output[y,x] = Σ_c Σ_i Σ_j kernel[c,i,j] × input[c, y+i, x+j]
↑ sum over R,G,B channels (c = 0,1,2)
3D CNNStep 1 — Video frames → 3D volume
Scene:
8 consecutive frames (t=0…7)
Frames stacked → 3D volume (T × H × W)
3D CNNStep 2 — 3D kernel slides T×H×W
Press ▶ to see the 3D kernel slide across T, H, and W dimensions.
How 3D Convolution Works
A video has shape (T × H × W × C). The 3D kernel spans d frames in time. It slides in T, H, and W simultaneously — detecting motion patterns across consecutive frames.
Input: (T, H, W, 3) e.g. (16, 112, 112, 3)
Kernel: (d, k, k, 3) e.g. (3, 3, 3, 3)
Output: (T-d+1, H-2, W-2) per filter
Sliding: T × H × W ←→ ↑↓ ⏎
The 3D kernel detects temporal patterns — "wave crashing", "flame flickering", "cloud drifting" — that are invisible in a single frame.
Scene Classification Demo
① Select scene above → ② Press Classify ▶
Why 3D CNN needs multiple frames
A single frame of ocean waves looks similar to a single frame of a still lake. The 3D kernel sees d=3 consecutive frames simultaneously — detecting the wave motion pattern that distinguishes the two.
output[t,y,x] = Σ_c Σ_dt Σ_i Σ_j
kernel[c,dt,i,j] × input[c, t+dt, y+i, x+j]
↑ also sums over time dimension dt = 0,1,2
RGB CNNR, G, B convolved separately → sum → feature map