Zubnet AIAprenderWiki › Text-to-Speech
Using AI

Text-to-Speech

TTS, Speech Synthesis, Voice AI
Converter texto escrito em áudio falado de som natural. Sistemas TTS modernos usam redes neurais para gerar voz quase indistinguível de vozes humanas, com controle sobre emoção, ritmo, ênfase e até clonagem de vozes específicas. ElevenLabs, OpenAI TTS e modelos abertos como Bark e XTTS tornaram síntese de voz de alta qualidade amplamente acessível.

Por que importa

TTS completa o loop de voice AI: reconhecimento de voz converte voz em texto, um LLM o processa, e TTS converte a resposta de volta em voz. Isso habilita assistentes de voz, narração de audiolivros, ferramentas de acessibilidade, localização de conteúdo, e personagens IA em jogos e mídia. A qualidade do TTS moderno cruzou o uncanny valley — voz sintetizada agora soa natural.

Deep Dive

Modern TTS typically works in two stages: a text-to-spectrogram model (converting text to a visual representation of audio frequencies) and a vocoder (converting the spectrogram to actual audio waveforms). Some newer approaches are end-to-end, directly generating audio tokens from text using Transformer-based architectures similar to LLMs but operating on audio tokens instead of text tokens.

Voice Cloning

Voice cloning creates a synthetic version of a specific person's voice from a short audio sample (sometimes as little as 15 seconds). This enables personalization, dubbing, and preserving voices of people who have lost the ability to speak. It also creates obvious risks: impersonation, fraud, and non-consensual voice replication. Most providers implement consent verification and watermarking to mitigate misuse.

The Latency Challenge

For conversational AI, TTS latency matters as much as quality. A user asking a voice assistant a question expects a response within 1–2 seconds. Full TTS generation can take longer, so streaming TTS (generating and playing audio in chunks as the LLM produces text) is essential. The pipeline — STT + LLM + TTS — must stay under ~2 seconds total for natural conversation, which constrains model sizes and infrastructure choices.

Conceitos relacionados

← Todos os termos
← Test-Time Compute Throughput →