Zubnet AIAprenderWiki › Language Detection
Using AI

Language Detection

Language Identification, LangID
Identificar automáticamente en qué idioma está escrito un texto. «Hola mundo» → Español. «こんにちは世界» → Japonés. Los modelos modernos pueden distinguir 100+ idiomas a partir de solo unas pocas palabras, manejar texto de idiomas mezclados (code-switching) e identificar idiomas cercanamente relacionados (noruego vs. danés, malayo vs. indonesio).

Por qué importa

La detección de idioma es el primer paso esencial en cualquier pipeline multilingüe: necesitas saber en qué idioma está la entrada antes de poder traducirla, enrutarla al modelo correcto, o aplicar procesamiento específico del idioma. Se usa en motores de búsqueda, enrutamiento de soporte al cliente, moderación de contenido, y cada sistema que maneja texto de usuarios de todo el mundo.

Deep Dive

Simple approaches use character n-gram statistics: each language has distinctive character patterns ("th" is common in English, "tion" in French, "ung" in German). FastText's language identification model uses character n-grams with a shallow neural network and can identify 176 languages with high accuracy from as little as one sentence. For very short text (a few words), accuracy drops because there's not enough signal.

Hard Cases

Some language pairs are extremely difficult to distinguish: Serbian (Cyrillic) vs. Serbian (Latin) vs. Croatian vs. Bosnian share most vocabulary and grammar. Simplified vs. Traditional Chinese requires examining specific character choices. Short ambiguous text like "no" could be English, Spanish, Italian, or Portuguese. Code-switched text ("I went to the tienda to buy leche") mixes languages within a sentence. Robust systems handle these edge cases through statistical confidence scores rather than hard classification.

In Practice

For most applications, Google's CLD3, FastText's lid.176.bin, or the langdetect Python library provide sufficient accuracy. LLMs can also detect language as a side effect of their training, though using a 70B model for language detection is like using a chainsaw to cut butter. The practical architecture: fast language detection first (FastText, <1ms), then route to language-specific processing.

Conceptos relacionados

← Todos los términos
ESC