Zubnet AI學習Wiki › Encoder
基礎

Encoder

Encoder Network, Feature Extractor
一個神經網路元件,把輸入資料轉換成壓縮的、資訊豐富的表示(編碼)。在 Transformer 中,encoder 使用雙向 attention 處理完整輸入,產生帶上下文的表示。在 autoencoder 中,encoder 把輸入壓縮到一個潛在瓶頸中。在影像生成中,VAE encoder 把影像轉換到潛在空間。Encoder 是很多架構中「理解」那一半。

為什麼重要

Encoder 無處不在:BERT 是一個 encoder,CLIP 有一個 text encoder 和一個 image encoder,Stable Diffusion 有一個 VAE encoder,RAG 系統用 encoder 模型做 embeddings。理解 encoder 做什麼 — 把輸入壓縮成一個有用的表示 — 能幫你理解所有這些系統。編碼的品質決定下游一切的品質。

Deep Dive

In a Transformer encoder (BERT, the left half of T5), every token attends to every other token bidirectionally. This means the representation of the word "bank" incorporates information from both "river" (left context) and "fishing" (right context) simultaneously. This bidirectional attention is why encoder representations are richer than decoder (left-to-right only) representations for understanding tasks.

Encoder vs. Decoder

The key distinction: encoders process input (understanding), decoders generate output (creation). Encoders see everything at once (bidirectional). Decoders see only past tokens (causal/left-to-right). This is why encoder models (BERT) are better for classification and search, while decoder models (GPT, Claude) are better for generation. Encoder-decoder models (T5, BART) use an encoder for input understanding and a decoder for output generation, connected by cross-attention.

Encoders in Multimodal Systems

Multimodal systems typically use separate encoders for each modality: a vision encoder (ViT) for images, a text encoder (BERT/CLIP) for text, and potentially audio encoders for speech. These produce embeddings in a shared space, enabling cross-modal understanding. The quality of each encoder determines how well the system understands that modality. This is why CLIP's training (aligning image and text encoders) was so impactful — it created a bridge between vision and language understanding.

相關概念

← 所有術語
ESC