Matrix Multiplication: Definition & Meaning — AI Wiki

A operação matemática fundamental subjacente a todas as redes neurais. Multiplicar uma matriz de pesos por um vetor (ou matriz) de entrada produz um vetor de saída. Cada camada linear, cada computação de atenção e cada lookup de embedding é em última instância uma multiplicação matricial. O desempenho do hardware IA (GPUs, TPUs) é medido em quão rápido ele consegue fazer multiplicações matriciais.

Por que importa

Entender que redes neurais são só sequências de multiplicações matriciais (com não-linearidades entre) desmistifica o campo inteiro. Explica por que GPUs são essenciais (são máquinas de multiplicação matricial paralela), por que tamanho do modelo é medido em parâmetros (o número de valores nas matrizes de pesos), e por que FLOPs é a unidade de compute (conta as operações multiply-add nessas multiplicações matriciais).

Deep Dive

A linear layer with input dimension 4096 and output dimension 4096 multiplies a (batch_size × 4096) input matrix by a (4096 × 4096) weight matrix, producing a (batch_size × 4096) output. Each output element is the dot product of an input row and a weight column: 4096 multiplications and 4095 additions. For one example, that's 4096 × 4096 ≈ 16.8 million multiply-add operations. For one layer. A 32-layer Transformer does this dozens of times per layer.

Why GPUs

Matrix multiplication is "embarrassingly parallel": every output element can be computed independently. A CPU computes them sequentially (fast per element, but serial). A GPU computes thousands simultaneously (slower per element, but massively parallel). An NVIDIA H100 performs ~1000 TFLOP/s of FP16 matrix multiplication — roughly 1 quadrillion multiply-adds per second. This parallelism is the entire reason deep learning became practical.

GEMM Optimization

GEMM (General Matrix Multiply) is so central that hardware vendors optimize it obsessively. CUDA cores are designed for matmul. Tensor Cores (NVIDIA) perform 4×4 matrix multiplications in a single clock cycle. The entire memory hierarchy (registers, shared memory, L1/L2 cache, HBM) is organized to keep data flowing to the matmul units. When people say AI inference is "memory-bandwidth bound," they mean the hardware can multiply faster than it can read the matrices from memory.

Matrix Multiplication

Por que importa

Deep Dive

Why GPUs

GEMM Optimization

Conceitos relacionados