Zubnet AIAprenderWiki › Language Detection
Using AI

Language Detection

Language Identification, LangID
Identificar automaticamente em qual língua um texto está escrito. “Olá mundo” → Português. “こんにちは世界” → Japonês. Modelos modernos podem distinguir 100+ línguas a partir de só algumas palavras, lidar com texto de línguas mistas (code-switching) e identificar línguas proximamente relacionadas (norueguês vs. dinamarquês, malaio vs. indonésio).

Por que importa

Detecção de língua é o primeiro passo essencial em qualquer pipeline multilíngue: você precisa saber em qual língua está a entrada antes de poder traduzi-la, roteá-la para o modelo certo, ou aplicar processamento específico da língua. É usada em motores de busca, roteamento de suporte ao cliente, moderação de conteúdo, e todo sistema que lida com texto de usuários do mundo inteiro.

Deep Dive

Simple approaches use character n-gram statistics: each language has distinctive character patterns ("th" is common in English, "tion" in French, "ung" in German). FastText's language identification model uses character n-grams with a shallow neural network and can identify 176 languages with high accuracy from as little as one sentence. For very short text (a few words), accuracy drops because there's not enough signal.

Hard Cases

Some language pairs are extremely difficult to distinguish: Serbian (Cyrillic) vs. Serbian (Latin) vs. Croatian vs. Bosnian share most vocabulary and grammar. Simplified vs. Traditional Chinese requires examining specific character choices. Short ambiguous text like "no" could be English, Spanish, Italian, or Portuguese. Code-switched text ("I went to the tienda to buy leche") mixes languages within a sentence. Robust systems handle these edge cases through statistical confidence scores rather than hard classification.

In Practice

For most applications, Google's CLD3, FastText's lid.176.bin, or the langdetect Python library provide sufficient accuracy. LLMs can also detect language as a side effect of their training, though using a 70B model for language detection is like using a chainsaw to cut butter. The practical architecture: fast language detection first (FastText, <1ms), then route to language-specific processing.

Conceitos relacionados

← Todos os termos
ESC