Skip to content
YC's Blog
Go back

Minimax Speech 2.0

Updated:

Original link: https://arxiv.org/pdf/2505.07916

Introduction

I personally have a pretty positive feeling toward Minimax as a company. I listened to several podcasts with their CEO and CTO before, and I could feel that they not only have a business layout, but also a firm technical pursuit, especially linear attention. So I was quite looking forward to their new model. It turns out that this new TTS model is indeed excellent to use, especially in Chinese speech. However, the technical report itself is just average. More than half of it talks about how good their results are, so I feel the purpose may be more about flexing muscles than sharing. My guess is that the company may have some pressure around financing.

Minimax Speech 2.0 is an autoregressive transformer architecture TTS (Text to Speech) model, and it achieves SOTA results. The model’s innovation is that it uses a scientifically designed speaker encoder, making 0-shot learning possible, and it also supports one-shot. Not only that, the model also uses flow matching and a flow VAE decoder to make generation quality better.

Data

Unfortunately, this technical report does not mention details of the training data. It only vaguely talks about some data components and preprocessing methods that everyone already knows: training uses data from 32 languages; it uses two independent ASR (Automatic Speech Recognition) models to transcribe audio, and if the results are close, they are considered accurate, otherwise they are processed further; it uses VAD (Voice Activity Detection) together with ASR output timestamps and punctuation; it keeps steady background noise in recordings to improve the model’s robustness in real environments; and it uses an SVR (Speaker Verification Model). At the moment, the more complete description of TTS model data that I remember still comes from OpenAI Whisper a few years ago. I feel domestic companies are still relatively conservative in this aspect.

Model architecture

The model architecture is a classic multimodal architecture: compress different modalities into a unified space, then decode the output. Specifically, text uses classic BPE as the encoder, while speech uses Speaker Encoder + Audio Tokenizer, one to extract voice characteristics and one to extract content. Different from other TTS models, Minimax does not use a pre-trained audio encoder, but trains this encoder together with the AR transformer. The advantage of doing this is that the corpus of a pre-trained encoder may not be rich enough. My personal guess is that the effect on Chinese may not be good, and this time Minimax probably strengthened the Chinese corpus on the data side.

This architectural innovation allows Minimax to achieve high-quality 0-shot learning, meaning users only need to upload a segment of reference speech and can directly output the desired voice-cloned segment through text. By comparison, traditional speech models need speech-text pairs for 1-shot or fine-tuning to achieve decent results.

Flow matching

Flow Matching is a type of generative model. In essence, it learns a continuous transformation that turns a simple distribution into a complex continuous distribution. TTS models usually convert the discrete tokens generated by an AR transformer into a continuous distribution.


1. Autoregressive Transformer: generating discrete audio tokens


2. Latent Flow Matching module: from discrete tokens to continuous speech features

The discrete audio tokens output by the autoregressive Transformer then enter the Latent Flow Matching module, which contains two key components:

(1) Flow-VAE: optimizing latent feature representation


(2) Flow Matching Model


Share this post on:

Previous Post
Design Google Translate
Next Post
Seedream 3.0 Technical Report