Large Language Model: Definition & Meaning — AI Wiki

一种在海量文本上训练、用来理解与生成人类语言的神经网络。“Large” 指的是参数量(以十亿计)与训练数据规模(以万亿 token 计)。Claude、GPT、Gemini、Llama、Mistral 都是 LLM。

为什么重要

LLM 是你正在使用的每一个 AI 聊天、代码助手、文本生成器背后的技术。理解它们是什么(统计模式匹配器,不是有意识的存在),有助于你更有效地使用它们、也更清楚地看到它们的边界。

Deep Dive

At its core, an LLM is a function that takes a sequence of tokens and outputs a probability distribution over the next token. That is the entire trick. During training, the model sees trillions of tokens of text and adjusts its billions of parameters to get better at predicting what comes next. When you chat with Claude or GPT, the model generates one token at a time, each time feeding its own previous output back in as input. This autoregressive process is why you see responses streaming in word by word — the model genuinely does not know what it will say next until it gets there.

The Transformer Backbone

Most modern LLMs are built on the Transformer architecture, introduced by Google researchers in 2017. The Transformer's key innovation is the attention mechanism, which lets the model look at every other token in the input when deciding what a given token means. This solves a problem that plagued earlier architectures (RNNs, LSTMs): they struggled with long-range dependencies because information had to flow sequentially through every intermediate step. Attention lets a model directly connect "it" in paragraph five to "the database server" in paragraph one, regardless of how much text sits between them. Some newer architectures like Mamba use state-space models instead of attention, trading some flexibility for much better efficiency on long sequences, but Transformers remain the dominant paradigm for the largest models.

Why Scale Matters

The "Large" in LLM is doing real work. Scale turns out to matter in ways researchers did not fully expect. A 1-billion-parameter model can handle basic grammar and simple facts. A 70-billion-parameter model can write working code and reason through multi-step problems. The largest models (hundreds of billions of parameters, trained on trillions of tokens) exhibit emergent capabilities — skills that appear suddenly at scale rather than improving gradually. Chain-of-thought reasoning, multilingual transfer, and in-context learning are all capabilities that only reliably show up once models cross certain size thresholds. This scaling behavior is described by "scaling laws" that relate model size, dataset size, and compute budget to performance in surprisingly predictable ways.

From Predictor to Assistant

After pre-training, raw LLMs are not particularly useful to talk to — they just want to complete text, so they might continue your question with more questions instead of answering. This is where alignment comes in. Techniques like RLHF (reinforcement learning from human feedback) and constitutional AI train the model to be helpful, harmless, and honest rather than just a text predictor. This is the difference between a base model (like raw Llama) and a chat model (like Claude or ChatGPT). The base model has the knowledge; alignment teaches it how to use that knowledge in a conversation.

The Reliability Gap

A practical gotcha that catches many developers: LLMs do not "know" things the way a database does. They have encoded statistical patterns from training data, which means they can confidently state things that are subtly or completely wrong — hallucination. They also have a knowledge cutoff date and cannot access real-time information unless given tools. The best practitioners treat LLMs as very capable but unreliable collaborators: great for drafting, brainstorming, and code generation, but requiring verification for factual claims. Retrieval-augmented generation (RAG), structured output parsing, and tool use are the engineering patterns that make LLM-powered applications reliable in production.

Large Language Model