Guided Discrete Flow Matching for Step-Efficient Zero-Shot Text-to-Speech

Heejin Choi, Joun Yeop Lee, Min-Kyung Kim, Hongsun Yang, Hoon-Young Cho
Samsung Research, Seoul, Republic of Korea

Abstract

We present DFM-TTS, a non-autoregressive framework that applies discrete flow matching directly to multi-level codec tokens and augments training with two lightweight guidance signals. First, a text-guided consistency head (auxiliary CTC) tightens semantic–acoustic alignment. Second, a coarse-to-fine, time-dependent weighting stabilizes multi-codebook learning without modifying the sampler. Under zero-shot evaluation on LibriTTS, DFM-TTS improves intelligibility and speaker similarity over a retrained F5-TTS baseline while maintaining competitive naturalness, especially in the few-step regime (NFE ∈ {4, 8, 16}). Because the guidance acts only during training, per-step inference cost remains unchanged. Overall, the approach offers a simple, parallel path to step-efficient zero-shot TTS in discrete token space.

Click here for other works from Samsung Research TTS Team.


Baseline Comparison Samples


Text

The chaos in which his ardour extinguished itself was a cold indifferent knowledge of himself.

Reference (Prompt)

Ground Truth


CosyVoice (AR)

FireRedTTS (AR)

MaskGCT (NAR; NFE=32)

F5-TTS (NAR; NFE=32)

DFM-TTS (NAR; NFE=32)

What a very bad notion that was of his, I thought to myself, to take soundings just here!


Just as he made his cast, he saw the fleeing drake and the pursuing hawk come round the bend.



Text

In a heightened degree, therefore, the livery comes to be a badge of servitude, or rather servility.

Reference (Prompt)

Ground Truth


CosyVoice (AR)

FireRedTTS (AR)

MaskGCT (NAR; NFE=32)

F5-TTS (NAR; NFE=32)

DFM-TTS (NAR; NFE=32)

Without his scrapbooks, his chemicals, and his homely untidiness, he was an uncomfortable man.


I'll pay twice sixty dollars for the delivery to me of the forged check, and the withdrawal of the prosecution.



Step-Wise Samples


Text

Yes, but if she should have understood, and understood too well, she may talk.

Reference (Prompt)

Ground Truth


NFE step

4

F5-TTS

DFM-TTS

DFM-TTS + TG

DFM-TTS + TG + CFTG


8


16


Text

Whoever, therefore, is ambitious of distinction in this way ought to be prepared for disappointment.

Reference (Prompt)

Ground Truth


NFE step

4

F5-TTS

DFM-TTS

DFM-TTS + TG

DFM-TTS + TG + CFTG


8


16