Zubnet AI学习Wiki › Matrix Multiplication
基础

Matrix Multiplication

Matmul, GEMM
所有神经网络底层的基本数学运算。把一个权重矩阵乘以一个输入向量(或矩阵)产生一个输出向量。每个线性层、每次注意力计算、每次 embedding 查找,最终都是矩阵乘法。AI 硬件(GPU、TPU)的性能用它做矩阵乘法有多快来衡量。

为什么重要

理解神经网络只是矩阵乘法的序列(中间夹着非线性),让整个领域去神秘化。它解释为什么 GPU 必不可少(它们是并行矩阵乘法机器)、为什么模型大小用参数衡量(权重矩阵里的数值数量)、为什么 FLOP 是计算的单位(它数的就是这些矩阵乘法里的乘加操作)。

Deep Dive

A linear layer with input dimension 4096 and output dimension 4096 multiplies a (batch_size × 4096) input matrix by a (4096 × 4096) weight matrix, producing a (batch_size × 4096) output. Each output element is the dot product of an input row and a weight column: 4096 multiplications and 4095 additions. For one example, that's 4096 × 4096 ≈ 16.8 million multiply-add operations. For one layer. A 32-layer Transformer does this dozens of times per layer.

Why GPUs

Matrix multiplication is "embarrassingly parallel": every output element can be computed independently. A CPU computes them sequentially (fast per element, but serial). A GPU computes thousands simultaneously (slower per element, but massively parallel). An NVIDIA H100 performs ~1000 TFLOP/s of FP16 matrix multiplication — roughly 1 quadrillion multiply-adds per second. This parallelism is the entire reason deep learning became practical.

GEMM Optimization

GEMM (General Matrix Multiply) is so central that hardware vendors optimize it obsessively. CUDA cores are designed for matmul. Tensor Cores (NVIDIA) perform 4×4 matrix multiplications in a single clock cycle. The entire memory hierarchy (registers, shared memory, L1/L2 cache, HBM) is organized to keep data flowing to the matmul units. When people say AI inference is "memory-bandwidth bound," they mean the hardware can multiply faster than it can read the matrices from memory.

相关概念

← 所有术语
ESC