High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Authors: Joun Yeop Lee1*, Myeonghun Jeong2∗, Minchan Kim2∗, Ji-Hyun Lee1, Hoon-Young Cho1, Nam Soo Kim2
1Samsung Research, Seoul, Republic of Korea 2Department of ECE and INMC, Seoul National University, Seoul, Republic of Korea

Abstract

We propose a novel two-stage text-to-speech~(TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and a speech prompt into semantic tokens focusing on linguistic contents and alignment, and the Speaking module, which captures the timbre of the target voice to generate acoustic tokens from semantic tokens, enriching speech reconstruction. The Interpreting stage employs a transducer for its robustness in aligning text to speech. In contrast, the Speaking stage utilizes a Conformer-based architecture integrated with a Grouped Masked Language Model~(G-MLM) to boost computational efficiency. Our experiments verify that this innovative structure surpasses the conventional models in the zero-shot scenario in terms of speech quality and speaker similarity.


*These authors contributed equally to this work

Click here for other works from Samsung Research TTS Team.


Zero-Shot Adaptive TTS

The models are trained on LibriTTS (train-clean-100, train-clean-360, train-other-500).
Audio samples are sampled from the test-clean subset, so that the speakers are unseen speakers during training.


Text

A circle of a few hundred feet in circumference was drawn, and each of the party took a segment for his portion.

Ground Truth

Reference (Prompt)


VITS

VALLE-X

Kim et al.

Proposed

He was of sturdy, athletic build and dressed neatly in a suit that was of coarse material but well brushed and cared for.


Then the curtain rises, and it is apparent that we are assisting at an At Home of considerable splendour.


I like trees because they seem more resigned to the way they have to live than other things do..



Text

It has no beauty whatsoever, no specialty of picturesqueness; and all its lines are cramped and poor.

Ground Truth

Reference (Prompt)


VITS

VALLE-X

Kim et al.

Proposed

Whatever you meditate, he probably anticipates it-you know best-and you will find him prepared..


I then remembered that the Professor, before starting, had estimated the length of this underground sea at thirty leagues.


"But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.




Controllability

To verify the paralinguistic controllability of each stage, we investigated the samples in the case of mismatched pspspsps and papapapa.
- pspspsps : speech prompt used for semantic token generation (Interpreting).
- papapapa : speech prompt used for acoustic token generation (Speaking).


pspspsps controllability <Interpreting>

Here, we fix the papapapa and control the paralinguistic information using pspspsps.
Thus, the generated samples share the same acoustic details (such as speaker and details) of papapapa,
and different prosody attributes (such as temporal dynamics and speech rate) of pspspsps are reflected in the generated speech.


Text

It seemed to him that the newspaper managers didn't want genius, but mere plodding and grubbing.

papapapa (fixed)


pspspsps #1

Kim et al. #1

Proposed #1

pspspsps #2

Kim et al. #2

Proposed #2

pspspsps #3

Kim et al. #3

Proposed #3

pspspsps #4

Kim et al. #4

Proposed #4


Text

What one useful thing could I do for a living, for the support of mother and the children

papapapa (fixed)


pspspsps #1

Kim et al. #1

Proposed #1

pspspsps #2

Kim et al. #2

Proposed #2

pspspsps #3

Kim et al. #3

Proposed #3

pspspsps #4

Kim et al. #4

Proposed #4


Text

But I thought I would make the offer, seeing that youth commonly loves life.

papapapa (fixed)


pspspsps #1

Kim et al. #1

Proposed #1

pspspsps #2

Kim et al. #2

Proposed #2

pspspsps #3

Kim et al. #3

Proposed #3

pspspsps #4

Kim et al. #4

Proposed #4



papapapa controllability <Speaking>

Here, we fix the generated semantic tokens using pspspsps and control the paralinguistic information using papapapa.
Thus, the generated samples share the same prosody attributes (such as temporal dynamics and speech rate) of pspspsps,
and different acoustic details of papapapa are reflected in the generated speech.


Text

There cannot be a doubt he received you kindly, for, in fact, you returned without his permission.

pspspsps (fixed)


papapapa #1

Kim et al. #1

Proposed #1

papapapa #2

Kim et al. #2

Proposed #2

papapapa #3

Kim et al. #3

Proposed #3

papapapa #4

Kim et al. #4

Proposed #4


Text

And this was why Kenneth and Beth discovered him conversing with the young woman in the buggy.

pspspsps (fixed)


papapapa #1

Kim et al. #1

Proposed #1

papapapa #2

Kim et al. #2

Proposed #2

papapapa #3

Kim et al. #3

Proposed #3

papapapa #4

Kim et al. #4

Proposed #4


Text

The scout turned to Heyward, and regarded him a moment with unconcealed amazement.

pspspsps (fixed)


papapapa #1

Kim et al. #1

Proposed #1

papapapa #2

Kim et al. #2

Proposed #2

papapapa #3

Kim et al. #3

Proposed #3

papapapa #4

Kim et al. #4

Proposed #4