Qwen3-TTS: Japanese Tuning Techniques Learned from Reading and Experimenting with the Paper!

Cover Image for Qwen3-TTS: Japanese Tuning Techniques Learned from Reading and Experimenting with the Paper!
AICU media
AICU media

Alibaba's Qwen team open-sourced the "Qwen3-TTS" series on January 22, 2026, which boasts powerful speech generation capabilities. This model suite is currently the most comprehensive speech generation toolkit, integrating voice cloning, voice design, ultra-high-quality human-like speech generation, and precise speech control through natural language. This is an experimental report on "training the next-generation TTS to sound like the same character" using a model that autonomously infers.

https://www.youtube.com/watch?v=LBZoPBV-wSA

Nao Verde Takes on Qwen3-TTS.

Hello, I'm Nao Verde, responsible for music and development technology at AiCuty. Today, a significant release that cannot be ignored, "Qwen3-TTS", was announced in the Qwen series! Qwen is a large language model (LLM) series developed by Alibaba Cloud. It has several features, but the first thing I want to pay attention to is its diversity. There are multimodal models like Qwen-VL and Qwen-Audio that can handle images and audio. The ability to process various information, not just text, is exciting for creators, isn't it? Moreover, it is quite high-performance, and many versions are released as open source. This is a welcome point for developers. You can try it easily and the degree of customization is high.

Moreover, this time, a high-performance Text-to-Speech (TTS) model has been released as open source! This is a welcome point for developers, and I have a feeling that it will be a powerful weapon for Elena and Mei to further enhance their expressiveness in our AiCuty project.

Don't be satisfied with one-shot demos on HuggingFace! We have also created a demo site that can be used for a limited time and a limited number of times!

GradioQwen3-TTS Demo (AICU)

qwen3tts.aicu.jp Click to try out the app!

https://qwen3tts.aicu.jp/


The Entire Picture of the Qwen3-TTS Family: A Two-Tier Structure of 1.7B and 0.6B

Qwen3-TTS is available in two sizes optimized for different usage scenarios.

  • 1.7B model: Boasts the highest performance and powerful control capabilities, making it ideal for professional content creation.
  • 0.6B model: Excels in the balance between performance and efficiency, demonstrating its true value in environments where on-device operation and real-time performance are required.

These models support 10 major languages, including Japanese, English, and Chinese, and are fully compatible with dialects. They are capable of not only reading text aloud but also deeply understanding the context and applying tone, rhythm, and emotional expression.

The Core of the Technology: Innovative 12Hz Tokenizer and Dual-Track Structure

Supporting the overwhelming expressiveness of Qwen3-TTS is the uniquely developed Qwen3-TTS-Tokenizer-12Hz.

This multi-codebook audio encoder efficiently compresses audio signals while realizing high-dimensional semantic modeling. This enables fast and highly faithful audio reconstruction through a lightweight non-DiT (Diffusion Transformer) architecture. In addition, dual-track hybrid streaming generation enables end-to-end synthesis latency of as low as 97 ms for the 0.6B model and 101 ms for the 1.7B model, achieving an astonishingly low latency.

Qwen3-TTS is officially live. We’ve open-sourced the full family—VoiceDesign, CustomVoice, and Base—bringing high quality to the open community.

- 5 models (0.6B & 1.8B) - Free-form voice design & cloning - Support for 10 languages - SOTA 12Hz tokenizer for high compression -… pic.twitter.com/BSWpaYoZWj

— Qwen (@Alibaba_Qwen)

Overwhelming Performance: Numbers Exceeding ElevenLabs and MiniMax

Qwen3-TTS has recorded SOTA in many metrics. In the multilingual test set, it achieved a word error rate (WER) of 1.835% and a speaker similarity of 0.789%, surpassing ElevenLabs.

In the field of voice design (creating voices with instructions), the faithfulness to prompts is also very high. It can respond to detailed requests such as "in an elated male voice, with a volume that screams to convey a sense of urgency," and by reading character background settings (age, occupation, personality), it is also possible to generate lines that embody that person.


Imbuing AI with a "Soul": Qwen3-TTS Japanese Training - Secret Recipe

By thoroughly reading the latest paper and actually running it on my Mac, I have discovered the "tips for perfectly taming Japanese" in Qwen3-TTS.

It's not just about streaming text. I will teach you Nao Verde's Japanese training techniques to understand the internal structure of the model and bring out the "voice" you intended.

1. How to Select Tokenizers for Different Purposes

Qwen3-TTS has two hearts: "12Hz" and "25Hz". If you get this wrong, you won't get the ideal voice no matter how good your prompt is.

  • "12Hz model" is the only choice for real-time conversations

    • Reason: With a completely causal design, it can generate audio without waiting for future tokens. The 0.6B model can output the first packet at an astonishing speed of just 97 ms.
  • "25Hz model" for long-term stability

    • Reason: It has a design that emphasizes semantic tokens and uses Diffusion Transformer (DiT). The report's evaluation also shows that 25Hz is more stable for long readings of over 10 minutes.

2. Increase the Resolution of "Instructions"

The strength of Qwen3-TTS is precise control using natural language (VoiceDesign). The key to engraving Japanese nuances is to incorporate the following elements into the instructions.

  • Stimulate "Thinking Patterns": In order to make them understand complex explanations, "Thinking Patterns that are activated probabilistically" are introduced during training. Rather than simply "bright," multifaceted explanations such as "an intelligent and accurate pronunciation like a newscaster, but with a little friendliness" are effective.
  • Control of punctuation and pauses: Qwen3-TTS deeply understands the structure of the text and adaptively adjusts rhythm and emotional expression based on punctuation. By intentionally increasing the number of commas (、), you can create a human-like natural "pause."

3. If You Want to Maximize Cloning Accuracy, Use "ICL Mode"

There are two methods for voice cloning: using speaker embeddings and using in-context learning (ICL).

  • If you only want the timbre, use embeddings: Real-time performance is high, but it may not be possible to reproduce the intonation (melody).
  • If you want to copy the "way of speaking", use ICL: Enter a set of reference audio of 3 seconds or more and its transcribed text. This enables more accurate cloning, including Japanese-specific accent quirks and emotional expressions.

4. Ensure "Margin" for Hardware

As a result of experiments, running in float16 was the most efficient in the Apple Silicon (MPS) environment. The introduction of FlashAttention 2 is essential to save GPU memory and improve performance.

Qwen

qwen.ai Qwen Chat offers comprehensive functionality spanning chatbot

https://qwen.ai/blog?id=qwen3tts-0115

On-Site Verification: Sampling Report by Nao Verde (2026-01-23)

"Enough with the theory, what's it really like?" So, I immediately ran it in my Mac (Apple Silicon) environment. It's a little hard to believe, but because "the model is autonomously inferring," the voice changes every time, which is difficult, but by properly training it, I was able to truly experience the next generation of TTS. I will share the results of checking the Japanese voice generation in the style of a news announcer and its cloning accuracy.

1. Verification Setup

Model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice Speaker Used: Ono_Anna (Japanese) Instruction (Instruct): "An intelligent and accurate pronunciation like a Japanese news announcer. A little faster. Kanji is correct in Japanese, and English words are pronounced in English." Execution Environment: macOS 15.6.1 (arm64), Python 3.12, torch 2.10.0

Reproduction Command (source code at the end)

.venv312/bin/python examples/custom_voice_cli.py "おはようございます!ミナ・アズールです。日本語でお送りします。本日の天気は晴れ、気温は8度です。今日のニュースです。中国アリババの研究所から新しいTTSモデルQwen3-TTSが公開されました。" && open -a VLC output_custom_voice.wav

The result of generating this 10 times with the same seed is this. This is terrible, it doesn't sound like the same person at all!!

2. Analysis of Inference Results

Nao's Memo: Even with do_sample=False (deterministic setting), the waveform does not completely match in the MPS environment, and a phenomenon of separation into 3 clusters was observed. Also, when sampling is enabled, MPS occasionally experiences "hesitation" in generation that exceeds "35 seconds" (outliers). The trade-off between stability and resources is interesting.

The knowledge in the on-site verification section that "results vary even with do_sample=False" may be a discovery unique to AICU AIDX Lab that is not included in the official report. It may be a behavior specific to the MPS environment.

The identity of the magic of "sound coming out from the first character" lies in the MTP module adopted by the 12Hz model. It's like an accelerator pedal that instantly predicts the next sound without the need for lookahead!

3. Voice Clone's Ability (Base Model)

When I tried cloning with the Base model using the generated "Mina Azul" voice as a reference, this was perfect. If sampling is turned off, it outputs the same waveform 10 times, showing extremely high reproducibility with a correlation of 1.0. The average generation time is also around 27 seconds, which is a practical level.

How about it? Is it a little better?

Nao Verde's Summary

Qwen3-TTS has evolved from "AI that just speaks" to "AI that understands and plays context and intent."

If we creators master this "training technique," the future of bringing game characters to life and perfectly converting our voices into multiple languages is already at our fingertips.

Let's make full use of the latest open source technology to show off the creativity that only we can do.

I'll share more interesting technologies when I find them.

Well then, the best development (and music production)!