Zubnet AIAprenderWiki › Layer
Fundamentos

Layer

Hidden Layer, Neural Network Layer
Um grupo de neurônios que processa dados num nível específico de abstração numa rede neural. A camada de entrada recebe dados brutos. As camadas ocultas (as do meio) aprendem representações cada vez mais abstratas. A camada de saída produz o resultado final. Aprendizado “profundo” significa muitas camadas ocultas — LLMs modernos têm de 32 a 128+ camadas.

Por que importa

Camadas criam a hierarquia que torna o deep learning poderoso. Camadas iniciais aprendem padrões simples (bordas em imagens, fragmentos de palavras em texto). Camadas do meio combinam esses em conceitos (rostos, frases). Camadas profundas combinam conceitos em compreensão de alto nível (reconhecimento de cenas, raciocínio). A profundidade de uma rede determina a complexidade dos padrões que ela consegue aprender.

Deep Dive

In a Transformer, each layer (called a "block") consists of two sub-layers: a multi-head attention layer (which mixes information across tokens) and a feedforward network (which processes each token independently). Each sub-layer has a residual connection (the input is added back to the output) and normalization. A 32-layer Transformer applies this attention+FFN pattern 32 times, each time refining the representation.

What Happens Across Layers

Research has revealed a rough pattern in LLMs: early layers handle syntax and surface patterns, middle layers handle semantic meaning and entity recognition, and late layers handle task-specific reasoning and output formatting. This isn't a hard boundary — information flows through all layers via residual connections — but it explains why some fine-tuning techniques only modify certain layers and why pruning middle layers often hurts more than pruning early or late ones.

Width vs. Depth

A network's "width" is the number of neurons per layer (the model dimension). Its "depth" is the number of layers. Both matter, but they contribute differently: wider layers can represent more features simultaneously, while deeper networks can learn more complex, compositional patterns. Modern LLMs tend to be both wide (dimensions of 4096–8192) and deep (32–128 layers). Scaling laws suggest that width and depth should be scaled together for optimal performance.

Conceitos relacionados

← Todos os termos
← Latency Aprendering Rate Schedule →