Zubnet AIसीखेंWiki › Encoder
मूल सिद्धांत

Encoder

Encoder Network, Feature Extractor
एक neural network component जो input data को एक compressed, information-rich representation (encoding) में convert करता है। Transformers में, encoder bidirectional attention का use करता है पूरे input को process करने के लिए और contextual representations produce करने के लिए। Autoencoders में, encoder input को एक latent bottleneck में compress करता है। Image generation में, VAE encoder images को latent space में convert करता है। Encoders कई architectures का “understanding” वाला आधा हिस्सा हैं।

यह क्यों matter करता है

Encoders हर जगह हैं: BERT एक encoder है, CLIP के पास एक text encoder और एक image encoder है, Stable Diffusion के पास एक VAE encoder है, RAG systems embeddings के लिए encoder models use करते हैं। Encoder क्या करता है — input को एक useful representation में compress करता है — ये समझना इन सभी systems को समझने में help करता है। Encoding की quality downstream सब कुछ की quality को determine करती है।

Deep Dive

In a Transformer encoder (BERT, the left half of T5), every token attends to every other token bidirectionally. This means the representation of the word "bank" incorporates information from both "river" (left context) and "fishing" (right context) simultaneously. This bidirectional attention is why encoder representations are richer than decoder (left-to-right only) representations for understanding tasks.

Encoder vs. Decoder

The key distinction: encoders process input (understanding), decoders generate output (creation). Encoders see everything at once (bidirectional). Decoders see only past tokens (causal/left-to-right). This is why encoder models (BERT) are better for classification and search, while decoder models (GPT, Claude) are better for generation. Encoder-decoder models (T5, BART) use an encoder for input understanding and a decoder for output generation, connected by cross-attention.

Encoders in Multimodal Systems

Multimodal systems typically use separate encoders for each modality: a vision encoder (ViT) for images, a text encoder (BERT/CLIP) for text, and potentially audio encoders for speech. These produce embeddings in a shared space, enabling cross-modal understanding. The quality of each encoder determines how well the system understands that modality. This is why CLIP's training (aligning image and text encoders) was so impactful — it created a bridge between vision and language understanding.

संबंधित अवधारणाएँ

← सभी Terms
ESC