Supertonic v3 on-device TTS: 99M params, 31 langs, MIT + OpenRAIL-M, Zubnet AI News

Supertone — a speech-AI company — released Supertonic v3, an on-device text-to-speech model with 31-language support, expression tags, and a deployment footprint that is small enough to run on an e-reader. The architecture is a speech autoencoder plus a flow-matching text-to-latent module plus a duration predictor, integrating Length-Aware Rotary Position Embedding (LARoPE) and a Self-Purifying Flow Matching training technique. Parameter count is roughly 99M (v2 was 66M), disk footprint is 404 MB, and inference completes in 2 flow-matching steps. The MIT license covers the code; OpenRAIL-M covers the model weights. Python SDK ships via `pip install supertonic`, with ONNX assets auto-downloaded from Hugging Face on first run.

The hardware target is the headline. Supertone reports Real-Time Factor of 0.3x on an Onyx Boox Go 6 e-reader — an Android-based e-paper device with an ARM SoC and very modest compute relative to a phone or laptop. RTF 0.3 means the model generates one second of audio in 300 ms on that class of hardware, which is comfortable headroom for streaming playback even with significant overhead for tokenization and buffering. The language list runs across Indo-European, East Asian, and Semitic families — English, Korean, Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Spanish, Estonian, Finnish, French, Hindi, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, Vietnamese — plus an "na" fallback for unknown languages. Supertone reports WER and CER competitive with VoxCPM2, which is a significantly larger model.

The expression tags are simple and useful: ``, ``, and `` can be embedded inline in input text and the model produces the prosodic cue without a separate preprocessing step or a second model layer. That is the deployment-side detail that matters most for product integrators — embedding three tags in the input pipeline is trivial compared to running a second model for expressiveness, and the tags are explicit enough to control deterministically. The other deployment-friendly choice is that v3 preserves the v2 ONNX inference contract, so existing integrations upgrade without code changes. That continuity decision is what lets a deployed product roll forward to v3 without rewriting the audio pipeline.

For builders shipping anything with voice on the edge — mobile apps, accessibility tools, robotics, IoT, e-readers, vehicle infotainment — Supertonic v3 is now in the candidate set alongside Kokoro, Piper, and the larger Coqui line. The two questions worth running on your own evals are whether the WER on your target language matches the headline competitiveness with VoxCPM2, and whether RTF on your specific target hardware (not Onyx Boox Go 6) gives you the latency budget for your use case. License is permissive enough for commercial use; the OpenRAIL-M on weights is the only constraint to read carefully if you are building a commercial product. The ONNX runtime portability is the other thing to verify — most edge deployments will be ARM CPU or NPU rather than GPU.

Supertonic v3 on-device TTS: 99M params, 31 langs, MIT + OpenRAIL-M

More News