Text-to-Speech: Definition & Meaning — AI Wiki

把书面文本转换成听起来自然的口语音频。现代 TTS 系统用神经网络生成几乎和人声难以区分的语音,能控制情感、节奏、重音,甚至特定声音的克隆。ElevenLabs、OpenAI TTS、以及 Bark、XTTS 这类开源模型让高质量语音合成广泛可用。

为什么重要

TTS 完成了 voice AI 的闭环:语音识别把声音转成文字,LLM 处理它,TTS 再把回答转回语音。这使得语音助手、有声书朗读、无障碍工具、内容本地化以及游戏和媒体中的 AI 角色成为可能。现代 TTS 的质量已经跨过恐怖谷 — 合成的语音现在听起来自然。

Deep Dive

Modern TTS typically works in two stages: a text-to-spectrogram model (converting text to a visual representation of audio frequencies) and a vocoder (converting the spectrogram to actual audio waveforms). Some newer approaches are end-to-end, directly generating audio tokens from text using Transformer-based architectures similar to LLMs but operating on audio tokens instead of text tokens.

Voice Cloning

Voice cloning creates a synthetic version of a specific person's voice from a short audio sample (sometimes as little as 15 seconds). This enables personalization, dubbing, and preserving voices of people who have lost the ability to speak. It also creates obvious risks: impersonation, fraud, and non-consensual voice replication. Most providers implement consent verification and watermarking to mitigate misuse.

The Latency Challenge

For conversational AI, TTS latency matters as much as quality. A user asking a voice assistant a question expects a response within 1–2 seconds. Full TTS generation can take longer, so streaming TTS (generating and playing audio in chunks as the LLM produces text) is essential. The pipeline — STT + LLM + TTS — must stay under ~2 seconds total for natural conversation, which constrains model sizes and infrastructure choices.

Text-to-Speech

为什么重要

Deep Dive

Voice Cloning

The Latency Challenge

相关概念