Layer: Definition & Meaning — AI Wiki

神經網路中在特定抽象層級處理資料的一組神經元。輸入層接收原始資料。隱藏層(中間的那些)學習越來越抽象的表示。輸出層產生最終結果。「深度」學習意思是很多隱藏層 — 現代 LLM 有 32 到 128+ 層。

為什麼重要

層創造了讓深度學習強大的層級。早期層學習簡單模式(影像中的邊緣、文字中的詞片段)。中間層把這些組合成概念(臉、短語)。深層把概念組合成高層理解(場景辨識、推理)。一個網路的深度決定了它能學到的模式的複雜度。

Deep Dive

In a Transformer, each layer (called a "block") consists of two sub-layers: a multi-head attention layer (which mixes information across tokens) and a feedforward network (which processes each token independently). Each sub-layer has a residual connection (the input is added back to the output) and normalization. A 32-layer Transformer applies this attention+FFN pattern 32 times, each time refining the representation.

What Happens Across Layers

Research has revealed a rough pattern in LLMs: early layers handle syntax and surface patterns, middle layers handle semantic meaning and entity recognition, and late layers handle task-specific reasoning and output formatting. This isn't a hard boundary — information flows through all layers via residual connections — but it explains why some fine-tuning techniques only modify certain layers and why pruning middle layers often hurts more than pruning early or late ones.

Width vs. Depth

A network's "width" is the number of neurons per layer (the model dimension). Its "depth" is the number of layers. Both matter, but they contribute differently: wider layers can represent more features simultaneously, while deeper networks can learn more complex, compositional patterns. Modern LLMs tend to be both wide (dimensions of 4096–8192) and deep (32–128 layers). Scaling laws suggest that width and depth should be scaled together for optimal performance.

Layer

為什麼重要

Deep Dive

What Happens Across Layers

Width vs. Depth

相關概念