Zubnet AILearnWiki › Matrix Multiplication
Fundamentals

Matrix Multiplication

Matmul, GEMM
The fundamental mathematical operation underlying all neural networks. Multiplying a weight matrix by an input vector (or matrix) produces an output vector. Every linear layer, every attention computation, and every embedding lookup is ultimately a matrix multiplication. The performance of AI hardware (GPUs, TPUs) is measured in how fast it can do matrix multiplications.

Why it matters

Understanding that neural networks are just sequences of matrix multiplications (with non-linearities in between) demystifies the entire field. It explains why GPUs are essential (they're parallel matrix multiplication machines), why model size is measured in parameters (the number of values in the weight matrices), and why FLOPs is the unit of compute (it counts the multiply-add operations in these matrix multiplications).

Deep Dive

A linear layer with input dimension 4096 and output dimension 4096 multiplies a (batch_size × 4096) input matrix by a (4096 × 4096) weight matrix, producing a (batch_size × 4096) output. Each output element is the dot product of an input row and a weight column: 4096 multiplications and 4095 additions. For one example, that's 4096 × 4096 ≈ 16.8 million multiply-add operations. For one layer. A 32-layer Transformer does this dozens of times per layer.

Why GPUs

Matrix multiplication is "embarrassingly parallel": every output element can be computed independently. A CPU computes them sequentially (fast per element, but serial). A GPU computes thousands simultaneously (slower per element, but massively parallel). An NVIDIA H100 performs ~1000 TFLOP/s of FP16 matrix multiplication — roughly 1 quadrillion multiply-adds per second. This parallelism is the entire reason deep learning became practical.

GEMM Optimization

GEMM (General Matrix Multiply) is so central that hardware vendors optimize it obsessively. CUDA cores are designed for matmul. Tensor Cores (NVIDIA) perform 4×4 matrix multiplications in a single clock cycle. The entire memory hierarchy (registers, shared memory, L1/L2 cache, HBM) is organized to keep data flowing to the matmul units. When people say AI inference is "memory-bandwidth bound," they mean the hardware can multiply faster than it can read the matrices from memory.

Related Concepts

← All Terms
ESC