Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding Strategies in a Mamba Framework

Joun Yeop Lee, Sangjun Park, Byoung Jin Choi, Ji-Hyun Lee, Min-Kyung Kim, Hoon-Young Cho
Samsung Research, Seoul, Republic of Korea

Abstract

Recent advances in neural codec-based text-to-speech (TTS) systems have achieved remarkable audio quality; however, their reliance on large model size and heavy computational requirements limits CPU-based on-device deployment. In this work, we present a Mamba-based streaming acoustic model with two novel depthwise decoding strategies for Residual Vector Quantization: a Masked Language Model (MLM) approach and an Implicit Neural Representation (INR) approach. The MLM strategy iteratively refines tokens along the code depth axis to enhance audio quality, whereas the INR approach predicts all quantization levels in parallel to reduce computational cost. We further incorporate a zero-shot speaker embedding conditioning mechanism, enabling robust performance on unseen speakers. Experimental results demonstrate comparable or even superior improvements in both objective and subjective metrics compared to other larger TTS baseline models.

Click here for other works from Samsung Research TTS Team.


Zero-Shot TTS Samples

The proposed models (SMAM+MLM, SMAM+INR, and SMAM+noMLM) are trained on LibriTTS (train-clean-100, train-clean-360, train-other-500).
Audio samples are sampled from the test-clean subset, ensuring a zero-shot condition by using unseen speakers.


Text

Yes, his mother was hostile to the idea, as he had read from her listless silence.

Ground Truth

Reference (Prompt)


VITS

VALLE-X

Lee et al.

Small-E

SMAM+MLM

SMAM+INR

SMAM+noMLM

I am not good enough for you, and you must be kept from the contamination of too intimate society.


It won't be much, but I'm grateful to find a friend. I'm guilty, you know



Text

He came down to her slowly, with fixed, hungry eyes, threading his way amid the Fleece.

Ground Truth

Reference (Prompt)


VITS

VALLE-X

Lee et al.

Small-E

SMAM+MLM

SMAM+INR

SMAM+noMLM

Then the curtain rises, and it is apparent that we are assisting at an At Home of considerable splendour.


It has no beauty whatsoever, no specialty of picturesqueness; and all its lines are cramped and poor.