Text-to-Speech: Definition & Meaning — AI Wiki

把書面文字轉換成聽起來自然的口語音訊。現代 TTS 系統用神經網路生成幾乎和人聲難以區分的語音,能控制情感、節奏、重音,甚至特定聲音的克隆。ElevenLabs、OpenAI TTS、以及 Bark、XTTS 這類開源模型讓高品質語音合成廣泛可用。

為什麼重要

TTS 完成了 voice AI 的閉環:語音辨識把聲音轉成文字,LLM 處理它,TTS 再把回答轉回語音。這使得語音助手、有聲書朗讀、無障礙工具、內容在地化以及遊戲和媒體中的 AI 角色成為可能。現代 TTS 的品質已經跨過恐怖谷 — 合成的語音現在聽起來自然。

Deep Dive

Modern TTS typically works in two stages: a text-to-spectrogram model (converting text to a visual representation of audio frequencies) and a vocoder (converting the spectrogram to actual audio waveforms). Some newer approaches are end-to-end, directly generating audio tokens from text using Transformer-based architectures similar to LLMs but operating on audio tokens instead of text tokens.

Voice Cloning

Voice cloning creates a synthetic version of a specific person's voice from a short audio sample (sometimes as little as 15 seconds). This enables personalization, dubbing, and preserving voices of people who have lost the ability to speak. It also creates obvious risks: impersonation, fraud, and non-consensual voice replication. Most providers implement consent verification and watermarking to mitigate misuse.

The Latency Challenge

For conversational AI, TTS latency matters as much as quality. A user asking a voice assistant a question expects a response within 1–2 seconds. Full TTS generation can take longer, so streaming TTS (generating and playing audio in chunks as the LLM produces text) is essential. The pipeline — STT + LLM + TTS — must stay under ~2 seconds total for natural conversation, which constrains model sizes and infrastructure choices.

Text-to-Speech

為什麼重要

Deep Dive

Voice Cloning

The Latency Challenge

相關概念