Done. Lessons 1 and 2 checked off.
Lesson 3: Matrix Multiplication Kernel
This is the big one. Matmul is the heart of every neural network. When people say "GPUs are good for AI" — they mean GPUs are good at matmul.
Why matmul maps to GPUs: To multiply two matrices, every cell in the output is an independent dot product. A 128x10 output matrix = 1,280 dot products that don't depend on each other. That's 1,280 threads that can run in parallel. CPUs have 8-16 cores. GPUs have thousands.
Let's start small. Before we touch the MNIST-sized matrices, we'll multiply two tiny matrices and verify correctness.
Test case — multiply a 2x3 matrix by a 3x2 matrix:
The result at [row, col] = dot product of row row from A and column col from B.
For example: result[0][0] = 1*7 + 2*9 + 3*11 = 58.
The shader needs to know:
- Matrix dimensions (M, N, K) — A is MxK, B is KxN, result is MxN
- Which output element this thread computes — that's
@builtin(global_invocation_id), a vec3 that tells each thread its unique index
The key idea: each GPU thread computes exactly one element of the output matrix. Thread (row, col) loops over the shared dimension K, accumulating the dot product.
Here's what you need to change from the add shader:
Buffers: Replace the three scalar buffers with three matrix buffers (A, B, result) plus a uniform buffer for dimensions (M, N, K). Uniform buffers are for small read-only data that every thread reads — perfect for dimensions.
Shader: The WGSL uniform buffer looks like:
Storage buffers for A, B, result at bindings 1, 2, 3. All
array<f32>— matrices stored flat, row-major (row 0 first, then row 1, etc). To index element[row][col]in a matrix with N columns:row * N + col.Workgroup size: Use
@workgroup_size(1)for now (we can optimize later). But dispatchdispatchWorkgroups(N, M)— one thread per output element.The main function: Gets
@builtin(global_invocation_id) id : vec3<u32>. Useid.xas column,id.yas row. LoopKtimes accumulating the dot product.
Start fresh — new shader string, new buffers. You can keep the WebGPU init code (adapter + device). Go for it.