RGB CNN vs 3D CNN

RGB CNNReal image — channel decomposition

Image:

Original

R channel

G channel

B channel

Each channel convolved separately → summed into one output

R feature

G feature

B feature

∑ Sum output

Filter:

All 3 channels convolved simultaneously → summed

Press ▶ to see R, G, B kernels slide simultaneously → three feature maps → summed into final output.

How RGB Convolution Works

An RGB image is (H × W × 3). Each filter has depth=3. The kernel slides only in H and W — processing all 3 channels simultaneously and summing into one value per position.

Input: (H, W, 3) e.g. (224, 224, 3) Kernel: (k, k, 3) e.g. (3, 3, 3) Output: (H-2, W-2, 1) per filter Sliding: H × W only ←→ ↑↓

With 64 filters → feature tensor (H-2, W-2, 64). Each filter detects a different pattern.

Classification Demo

① Select image type above → ② Press Classify ▶

Key insight: depth=3 in the kernel

The kernel must match input depth. For RGB (depth=3), every kernel is also depth=3.

output[y,x] = Σ_c Σ_i Σ_j kernel[c,i,j] × input[c, y+i, x+j] ↑ sum over R,G,B channels (c = 0,1,2)

3D CNNStep 1 — Video frames → 3D volume

Scene:

8 consecutive frames (t=0…7)

Frames stacked → 3D volume (T × H × W)

3D CNNStep 2 — 3D kernel slides T×H×W

Press ▶ to see the 3D kernel slide across T, H, and W dimensions.

How 3D Convolution Works

A video has shape (T × H × W × C). The 3D kernel spans d frames in time. It slides in T, H, and W simultaneously — detecting motion patterns across consecutive frames.

Input: (T, H, W, 3) e.g. (16, 112, 112, 3) Kernel: (d, k, k, 3) e.g. (3, 3, 3, 3) Output: (T-d+1, H-2, W-2) per filter Sliding: T × H × W ←→ ↑↓ ⏎

The 3D kernel detects temporal patterns — "wave crashing", "flame flickering", "cloud drifting" — that are invisible in a single frame.

Scene Classification Demo

① Select scene above → ② Press Classify ▶

Why 3D CNN needs multiple frames

A single frame of ocean waves looks similar to a single frame of a still lake. The 3D kernel sees d=3 consecutive frames simultaneously — detecting the wave motion pattern that distinguishes the two.

output[t,y,x] = Σ_c Σ_dt Σ_i Σ_j kernel[c,dt,i,j] × input[c, t+dt, y+i, x+j] ↑ also sums over time dimension dt = 0,1,2

RGB CNNR, G, B convolved separately → sum → feature map

3 channel kernels slide H×W independently → outputs summed → one feature map

3D CNNFrames stacked → 3D kernel slides T×H×W

3D kernel spans d frames, slides through time + space → 3D feature volume

Key Differences

Property	RGB CNN (2D)	3D CNN
Input	(H, W, 3) — one image	(T, H, W, C) — video clip
Kernel	(k, k, 3)	(d, k, k, C) — extra time dim d
Sliding	H and W only (2D)	T, H, and W (3D)
Captures	Spatial: edges, textures, shapes	Spatiotemporal: motion, temporal change
Use cases	Image classification, detection	Video classification, action recognition
Models	VGG, ResNet, EfficientNet	C3D, I3D, SlowFast

RGB CNN vs 3D CNN — Interactive Visualization