Matrix Multiplication: Definition & Meaning — AI Wiki

所有神經網路底層的基本數學運算。把一個權重矩陣乘以一個輸入向量(或矩陣)產生一個輸出向量。每個線性層、每次注意力運算、每次 embedding 查找,最終都是矩陣乘法。AI 硬體(GPU、TPU)的性能用它做矩陣乘法有多快來衡量。

為什麼重要

理解神經網路只是矩陣乘法的序列(中間夾著非線性),讓整個領域去神秘化。它解釋為什麼 GPU 必不可少(它們是平行矩陣乘法機器)、為什麼模型大小用參數衡量(權重矩陣裡的數值數量)、為什麼 FLOP 是運算的單位(它數的就是這些矩陣乘法裡的乘加操作)。

Deep Dive

A linear layer with input dimension 4096 and output dimension 4096 multiplies a (batch_size × 4096) input matrix by a (4096 × 4096) weight matrix, producing a (batch_size × 4096) output. Each output element is the dot product of an input row and a weight column: 4096 multiplications and 4095 additions. For one example, that's 4096 × 4096 ≈ 16.8 million multiply-add operations. For one layer. A 32-layer Transformer does this dozens of times per layer.

Why GPUs

Matrix multiplication is "embarrassingly parallel": every output element can be computed independently. A CPU computes them sequentially (fast per element, but serial). A GPU computes thousands simultaneously (slower per element, but massively parallel). An NVIDIA H100 performs ~1000 TFLOP/s of FP16 matrix multiplication — roughly 1 quadrillion multiply-adds per second. This parallelism is the entire reason deep learning became practical.

GEMM Optimization

GEMM (General Matrix Multiply) is so central that hardware vendors optimize it obsessively. CUDA cores are designed for matmul. Tensor Cores (NVIDIA) perform 4×4 matrix multiplications in a single clock cycle. The entire memory hierarchy (registers, shared memory, L1/L2 cache, HBM) is organized to keep data flowing to the matmul units. When people say AI inference is "memory-bandwidth bound," they mean the hardware can multiply faster than it can read the matrices from memory.

Matrix Multiplication

為什麼重要

Deep Dive

Why GPUs

GEMM Optimization

相關概念