Simple approaches use character n-gram statistics: each language has distinctive character patterns ("th" is common in English, "tion" in French, "ung" in German). FastText's language identification model uses character n-grams with a shallow neural network and can identify 176 languages with high accuracy from as little as one sentence. For very short text (a few words), accuracy drops because there's not enough signal.
Some language pairs are extremely difficult to distinguish: Serbian (Cyrillic) vs. Serbian (Latin) vs. Croatian vs. Bosnian share most vocabulary and grammar. Simplified vs. Traditional Chinese requires examining specific character choices. Short ambiguous text like "no" could be English, Spanish, Italian, or Portuguese. Code-switched text ("I went to the tienda to buy leche") mixes languages within a sentence. Robust systems handle these edge cases through statistical confidence scores rather than hard classification.
For most applications, Google's CLD3, FastText's lid.176.bin, or the langdetect Python library provide sufficient accuracy. LLMs can also detect language as a side effect of their training, though using a 70B model for language detection is like using a chainsaw to cut butter. The practical architecture: fast language detection first (FastText, <1ms), then route to language-specific processing.