Minimax Speech 2.0

Original link: https://arxiv.org/pdf/2505.07916

Introduction

I personally have a pretty positive feeling toward Minimax as a company. I listened to several podcasts with their CEO and CTO before, and I could feel that they not only have a business layout, but also a firm technical pursuit, especially linear attention. So I was quite looking forward to their new model. It turns out that this new TTS model is indeed excellent to use, especially in Chinese speech. However, the technical report itself is just average. More than half of it talks about how good their results are, so I feel the purpose may be more about flexing muscles than sharing. My guess is that the company may have some pressure around financing.

Minimax Speech 2.0 is an autoregressive transformer architecture TTS (Text to Speech) model, and it achieves SOTA results. The model’s innovation is that it uses a scientifically designed speaker encoder, making 0-shot learning possible, and it also supports one-shot. Not only that, the model also uses flow matching and a flow VAE decoder to make generation quality better.

Data

Unfortunately, this technical report does not mention details of the training data. It only vaguely talks about some data components and preprocessing methods that everyone already knows: training uses data from 32 languages; it uses two independent ASR (Automatic Speech Recognition) models to transcribe audio, and if the results are close, they are considered accurate, otherwise they are processed further; it uses VAD (Voice Activity Detection) together with ASR output timestamps and punctuation; it keeps steady background noise in recordings to improve the model’s robustness in real environments; and it uses an SVR (Speaker Verification Model). At the moment, the more complete description of TTS model data that I remember still comes from OpenAI Whisper a few years ago. I feel domestic companies are still relatively conservative in this aspect.

Model architecture

The model architecture is a classic multimodal architecture: compress different modalities into a unified space, then decode the output. Specifically, text uses classic BPE as the encoder, while speech uses Speaker Encoder + Audio Tokenizer, one to extract voice characteristics and one to extract content. Different from other TTS models, Minimax does not use a pre-trained audio encoder, but trains this encoder together with the AR transformer. The advantage of doing this is that the corpus of a pre-trained encoder may not be rich enough. My personal guess is that the effect on Chinese may not be good, and this time Minimax probably strengthened the Chinese corpus on the data side.

This architectural innovation allows Minimax to achieve high-quality 0-shot learning, meaning users only need to upload a segment of reference speech and can directly output the desired voice-cloned segment through text. By comparison, traditional speech models need speech-text pairs for 1-shot or fine-tuning to achieve decent results.

Flow matching

Flow Matching is a type of generative model. In essence, it learns a continuous transformation that turns a simple distribution into a complex continuous distribution. TTS models usually convert the discrete tokens generated by an AR transformer into a continuous distribution.

1. Autoregressive Transformer: generating discrete audio tokens

Input conditions:
- Text-encoded tokens.
- The condition vector output by the speaker encoder, used to specify the target speaker’s timbre and style.
Processing logic:
The autoregressive Transformer takes text tokens as input, combines them with the speaker condition vector, and gradually generates discrete audio tokens through the attention mechanism. This process imitates the temporal characteristics of human speech generation and is good at capturing natural changes in rhythm and intonation.
Advantages:
Compared with non-autoregressive models, the autoregressive architecture does not need to explicitly model phoneme duration alignment, and can generate more natural speech rhythm through implicit learning.

2. Latent Flow Matching module: from discrete tokens to continuous speech features

The discrete audio tokens output by the autoregressive Transformer then enter the Latent Flow Matching module, which contains two key components:

(1) Flow-VAE: optimizing latent feature representation

Structure and function:
- Encoder: Converts discrete audio tokens into continuous speech features, or latent variables, capturing acoustic details such as pitch and timbre.
- Flow Model: Applies invertible transformations to the distribution of latent variables and maps them to a standard normal distribution, enhancing feature expression and distribution fitting ability.
- Decoder (neural vocoder): Restores latent variables into the audio waveform and uses KL divergence constraints to ensure reconstruction accuracy.
Innovation:
Traditional VAEs assume the latent space follows a standard normal distribution, while Flow-VAE uses invertible transformations from flow models, such as affine transformations and permutations, to learn more complex posterior distributions. This allows it to capture multimodal features in speech data more accurately.
Experiments show that Flow-VAE has lower waveform reconstruction error than traditional VAE, and the generated speech features are more compact and information-rich.

(2) Flow Matching Model

Input conditions:
- The discrete audio tokens generated by the autoregressive Transformer, encoded by Flow-VAE into latent variables.
- Speaker condition vectors and text-encoded context information, used to guide style and content alignment in synthesized speech.
Processing logic:
The flow matching model is based on a Transformer architecture. It models the distribution of latent variables by matching the data distribution with a prior distribution, such as a standard normal distribution, and generates high-quality continuous speech features. This process does not need explicit duration modeling, but instead captures temporal dependencies in speech through implicit learning.
Advantages:
Compared with directly predicting the next token, flow matching models avoid quantization error in discrete space through distribution modeling in continuous latent space, and can more flexibly handle the dynamic range and detail changes of speech.