Transformer

The neural network architecture behind virtually all modern LLMs and many image/audio models. Introduced by Google in the 2017 paper "Attention Is All You Need," Transformers use self-attention to process all parts of an input simultaneously rather than sequentially, enabling massive parallelism during training.

Why it matters

Transformers are the architecture that made the current AI boom possible. GPT, Claude, Gemini, Llama, Mistral — they're all Transformers under the hood. Understanding this architecture helps you understand why models have the capabilities and limitations they do.

Deep Dive

A Transformer block has two main components stacked together: a multi-head self-attention layer and a feedforward network (FFN), each wrapped in layer normalization and a residual connection. The attention layer handles information routing — it decides which tokens should influence which other tokens. The FFN handles information processing — it transforms each token's representation independently through a wider hidden layer (typically 4x the model dimension) with a nonlinearity. Most of the model's parameters live in the FFN layers, and research suggests that this is where factual knowledge gets stored, while the attention layers learn relational and syntactic patterns. Stack 32 to 128 of these blocks, and you get a modern LLM.

Three Variants

The original 2017 "Attention Is All You Need" paper described an encoder-decoder architecture for machine translation. The encoder processes the input sequence and produces contextualized representations; the decoder generates the output sequence one token at a time, attending both to its own previous outputs and to the encoder's output via cross-attention. But the field quickly diverged into three variants. Encoder-only models (like BERT) process the full input bidirectionally and are great for classification and retrieval. Decoder-only models (GPT, Claude, Llama, Mistral) use causal masking so each token can only attend to previous tokens — this is what you want for text generation. Encoder-decoder models (T5, BART) kept the original architecture and work well for translation and summarization. The decoder-only variant won the scaling race because it is simpler to train and naturally supports autoregressive generation.

The Scaling Laws

Scaling laws are what turned the Transformer from an architecture into an industry. The Chinchilla paper (Hoffmann et al., 2022) showed that model performance scales predictably as a power law of compute, data, and parameters. This means you can forecast how good a model will be before training it, which turned LLM development into an engineering problem with relatively predictable returns on investment. That predictability is what justified the billions of dollars in GPU clusters. It also showed that most models at the time were undertrained — given a fixed compute budget, you get better results from a smaller model trained on more data than a bigger model trained on less. This insight reshaped the entire industry: Llama, Mistral, and Gemma all train on far more tokens relative to their parameter count than earlier models did.

Modern Transformers have diverged significantly from the original paper. Pre-norm (applying layer normalization before attention/FFN instead of after) is now standard because it stabilizes training at scale. RMSNorm replaced LayerNorm for efficiency. Rotary Position Embeddings (RoPE) replaced learned or sinusoidal position encodings because they generalize better to longer sequences than the model was trained on. SwiGLU activation replaced ReLU in the FFN for better performance. Grouped-Query Attention (GQA) shares key-value heads across query heads to shrink the KV cache. Flash Attention restructured the attention computation to be memory-efficient without changing the math. None of these change the fundamental architecture, but together they represent years of engineering iteration that makes training and serving large models practical.

The Scaling Wall

The biggest practical limitation of Transformers is the quadratic cost of attention with respect to sequence length. Every token must attend to every previous token, so processing a 128K-token context requires orders of magnitude more compute than a 4K context. This drives the cost of long-context API calls, and it is the reason why alternatives like SSMs and hybrid architectures are actively researched. The KV cache — the stored key-value pairs from all previous tokens that must be kept in memory during generation — is the other major bottleneck. For a large model generating long sequences, the KV cache can consume more GPU memory than the model weights themselves. Techniques like paged attention (vLLM), quantized KV caches, and speculative decoding are all engineering responses to what is fundamentally an architectural constraint. The Transformer is not going away anytime soon, but the next generation of architectures will almost certainly be hybrids that keep its strengths while addressing these scaling limitations.

Why it matters

Deep Dive

Three Variants

The Scaling Laws

The Scaling Wall

Related Concepts