Large Language Model: Definition & Meaning — AI Wiki

一種在海量文字上訓練、用來理解與生成人類語言的神經網路。「Large」指的是參數量(以十億計)與訓練資料規模(以兆 token 計)。Claude、GPT、Gemini、Llama、Mistral 都是 LLM。

為什麼重要

LLM 是你正在使用的每一個 AI 聊天、程式助手、文字產生器背後的技術。理解它們是什麼(統計模式匹配器,不是有意識的存在),有助於你更有效率地使用它們、也更清楚地看到它們的邊界。

Deep Dive

At its core, an LLM is a function that takes a sequence of tokens and outputs a probability distribution over the next token. That is the entire trick. During training, the model sees trillions of tokens of text and adjusts its billions of parameters to get better at predicting what comes next. When you chat with Claude or GPT, the model generates one token at a time, each time feeding its own previous output back in as input. This autoregressive process is why you see responses streaming in word by word — the model genuinely does not know what it will say next until it gets there.

The Transformer Backbone

Most modern LLMs are built on the Transformer architecture, introduced by Google researchers in 2017. The Transformer's key innovation is the attention mechanism, which lets the model look at every other token in the input when deciding what a given token means. This solves a problem that plagued earlier architectures (RNNs, LSTMs): they struggled with long-range dependencies because information had to flow sequentially through every intermediate step. Attention lets a model directly connect "it" in paragraph five to "the database server" in paragraph one, regardless of how much text sits between them. Some newer architectures like Mamba use state-space models instead of attention, trading some flexibility for much better efficiency on long sequences, but Transformers remain the dominant paradigm for the largest models.

Why Scale Matters

The "Large" in LLM is doing real work. Scale turns out to matter in ways researchers did not fully expect. A 1-billion-parameter model can handle basic grammar and simple facts. A 70-billion-parameter model can write working code and reason through multi-step problems. The largest models (hundreds of billions of parameters, trained on trillions of tokens) exhibit emergent capabilities — skills that appear suddenly at scale rather than improving gradually. Chain-of-thought reasoning, multilingual transfer, and in-context learning are all capabilities that only reliably show up once models cross certain size thresholds. This scaling behavior is described by "scaling laws" that relate model size, dataset size, and compute budget to performance in surprisingly predictable ways.

From Predictor to Assistant

After pre-training, raw LLMs are not particularly useful to talk to — they just want to complete text, so they might continue your question with more questions instead of answering. This is where alignment comes in. Techniques like RLHF (reinforcement learning from human feedback) and constitutional AI train the model to be helpful, harmless, and honest rather than just a text predictor. This is the difference between a base model (like raw Llama) and a chat model (like Claude or ChatGPT). The base model has the knowledge; alignment teaches it how to use that knowledge in a conversation.

The Reliability Gap

A practical gotcha that catches many developers: LLMs do not "know" things the way a database does. They have encoded statistical patterns from training data, which means they can confidently state things that are subtly or completely wrong — hallucination. They also have a knowledge cutoff date and cannot access real-time information unless given tools. The best practitioners treat LLMs as very capable but unreliable collaborators: great for drafting, brainstorming, and code generation, but requiring verification for factual claims. Retrieval-augmented generation (RAG), structured output parsing, and tool use are the engineering patterns that make LLM-powered applications reliable in production.

Large Language Model