Operational Logic (Standard):
In this setup, the weight register holds f[s] constant while the inner loop streams i[w] and updates o[q]. This minimizes weight memory bandwidth.