Text-to-Speech: Definition & Meaning — AI Wiki

Converter texto escrito em áudio falado de som natural. Sistemas TTS modernos usam redes neurais para gerar voz quase indistinguível de vozes humanas, com controle sobre emoção, ritmo, ênfase e até clonagem de vozes específicas. ElevenLabs, OpenAI TTS e modelos abertos como Bark e XTTS tornaram síntese de voz de alta qualidade amplamente acessível.

Por que importa

TTS completa o loop de voice AI: reconhecimento de voz converte voz em texto, um LLM o processa, e TTS converte a resposta de volta em voz. Isso habilita assistentes de voz, narração de audiolivros, ferramentas de acessibilidade, localização de conteúdo, e personagens IA em jogos e mídia. A qualidade do TTS moderno cruzou o uncanny valley — voz sintetizada agora soa natural.

Deep Dive

Modern TTS typically works in two stages: a text-to-spectrogram model (converting text to a visual representation of audio frequencies) and a vocoder (converting the spectrogram to actual audio waveforms). Some newer approaches are end-to-end, directly generating audio tokens from text using Transformer-based architectures similar to LLMs but operating on audio tokens instead of text tokens.

Voice Cloning

Voice cloning creates a synthetic version of a specific person's voice from a short audio sample (sometimes as little as 15 seconds). This enables personalization, dubbing, and preserving voices of people who have lost the ability to speak. It also creates obvious risks: impersonation, fraud, and non-consensual voice replication. Most providers implement consent verification and watermarking to mitigate misuse.

The Latency Challenge

For conversational AI, TTS latency matters as much as quality. A user asking a voice assistant a question expects a response within 1–2 seconds. Full TTS generation can take longer, so streaming TTS (generating and playing audio in chunks as the LLM produces text) is essential. The pipeline — STT + LLM + TTS — must stay under ~2 seconds total for natural conversation, which constrains model sizes and infrastructure choices.

Text-to-Speech

Por que importa

Deep Dive

Voice Cloning

The Latency Challenge

Conceitos relacionados