Zubnet AIसीखेंWiki › BPE
मूल सिद्धांत

BPE

Byte Pair Encoding, Subword Tokenization
Tokenizer vocabularies build करने का सबसे common algorithm। BPE individual bytes या characters से start करता है और iteratively सबसे frequent adjacent pair को एक नए token में merge करता है। हज़ारों merges के बाद, common words single tokens बन जाते हैं (“the”, “function”) जबकि rare words subword pieces में split होते हैं (“un” + “common”)। GPT, Claude, Llama, और अधिकांश modern LLMs द्वारा use होता है।

यह क्यों matter करता है

BPE ही वजह है कि आपका tokenizer जैसे काम करता है वैसे काम करता है। ये explain करता है कि common words cheap क्यों होते हैं (एक token), rare words expensive क्यों होते हैं (कई tokens), और non-English text ज़्यादा क्यों cost करता है (non-English character pairs को कम merges allocate होते हैं)। BPE समझना आपको token counts predict करने, prompts optimize करने, और ये समझने में help करता है कि अलग tokenizers same text के लिए अलग results क्यों produce करते हैं।

Deep Dive

The algorithm: (1) start with a base vocabulary of individual bytes (256 entries) or characters, (2) scan the training corpus and count every adjacent pair of tokens, (3) merge the most frequent pair into a new token and add it to the vocabulary, (4) repeat steps 2–3 until the vocabulary reaches the target size (typically 32K–128K). The merge order defines a priority: "th" might be merge #50 while "ing" is merge #200, meaning "th" is a more fundamental unit in this tokenizer.

SentencePiece

SentencePiece (Google) is a popular BPE implementation that treats the input as raw bytes rather than pre-tokenized words. This means it can handle any language without language-specific preprocessing — no need for word segmentation in Chinese or morphological analysis in Turkish. Most modern LLMs use SentencePiece or a similar byte-level BPE variant. The alternative, WordPiece (used by BERT), is similar but uses a slightly different merge criterion.

The Training Corpus Matters

BPE merges reflect the training corpus's statistics. A tokenizer trained on English code gets efficient merges for "function," "return," and "const" but fragments Hindi or Arabic text. This is why multilingual tokenizers need balanced training corpora — the merge table must allocate enough merges to every language's common patterns. Llama 3's tokenizer explicitly trained on more balanced multilingual data, improving non-English token efficiency by 2–3x compared to Llama 2.

संबंधित अवधारणाएँ

← सभी Terms
← BLEU & ROUGE Bria →