Zubnet AIApprendreWiki › Flash Attention
Infrastructure

Flash Attention

FlashAttention, FlashAttention-2
Une implémentation optimisée pour GPU du mécanisme d'attention qui est 2–4x plus rapide et utilise significativement moins de mémoire que l'attention standard. Flash Attention réalise ça pas en changeant ce que l'attention calcule, mais en restructurant comment le calcul est effectué sur le hardware GPU — minimisant les transferts de mémoire lents entre la HBM du GPU et la SRAM sur la puce.

Pourquoi c'est important

Flash Attention est sans doute l'optimisation systems la plus impactante en IA moderne. Elle a rendu les modèles long-context pratiques en réduisant l'utilisation mémoire de l'attention de quadratique à quasi-linéaire (en pratique), permettant directement le saut de fenêtres de contexte de 4K à 128K+. Chaque LLM majeur l'utilise. Sans Flash Attention, les modèles long-context d'aujourd'hui seraient prohibitivement chers.

Deep Dive

The key insight (Dao et al., 2022): standard attention materializes the full N×N attention matrix in GPU HBM (high bandwidth memory), which is both memory-intensive (quadratic in sequence length) and slow (HBM bandwidth is the bottleneck). Flash Attention never materializes this matrix. Instead, it computes attention in tiles, loading small blocks of Q, K, V into fast on-chip SRAM, computing partial results, and accumulating them — a technique called "tiling" or "kernel fusion."

The Memory Savings

Standard attention stores the N×N attention matrix, requiring O(N²) memory. For a 128K context with 128 attention heads, that's hundreds of gigabytes. Flash Attention uses O(N) memory by computing softmax incrementally and never storing the full matrix. This is what made 128K–1M context windows feasible on existing hardware. FlashAttention-2 further improved throughput by better parallelizing across GPU thread blocks.

IO-Aware Algorithm Design

Flash Attention exemplifies a broader principle: on modern hardware, the bottleneck is often memory bandwidth, not compute. GPUs can perform trillions of operations per second but can only read/write hundreds of gigabytes per second from HBM. Algorithms that minimize memory traffic (even at the cost of extra computation) often win. This "IO-aware" approach is influencing how the entire field thinks about algorithm design for AI workloads.

Concepts liés

← Tous les termes
← Fine-tuning FLOPs →