Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding Strategies in a Mamba Framework
Abstract
Recent advances in neural codec-based text-to-speech (TTS) systems have achieved remarkable audio quality; however, their reliance on large model size and heavy computational requirements limits CPU-based on-device deployment. In this work, we present a Mamba-based streaming acoustic model with two novel depthwise decoding strategies for Residual Vector Quantization: a Masked Language Model (MLM) approach and an Implicit Neural Representation (INR) approach. The MLM strategy iteratively refines tokens along the code depth axis to enhance audio quality, whereas the INR approach predicts all quantization levels in parallel to reduce computational cost. We further incorporate a zero-shot speaker embedding conditioning mechanism, enabling robust performance on unseen speakers. Experimental results demonstrate comparable or even superior improvements in both objective and subjective metrics compared to other larger TTS baseline models.
Click here for other works from Samsung Research TTS Team.
Zero-Shot TTS Samples
The proposed models (SMAM+MLM, SMAM+INR, and SMAM+noMLM) are trained on LibriTTS (train-clean-100, train-clean-360, train-other-500).
Audio samples are sampled from the test-clean subset, ensuring a zero-shot condition by using unseen speakers.
Text
Yes, his mother was hostile to the idea, as he had read from her listless silence.
Ground Truth
Reference (Prompt)
VITS
VALLE-X
Lee et al.
Small-E
SMAM+MLM
SMAM+INR
SMAM+noMLM
I am not good enough for you, and you must be kept from the contamination of too intimate society.
It won't be much, but I'm grateful to find a friend. I'm guilty, you know
Text
He came down to her slowly, with fixed, hungry eyes, threading his way amid the Fleece.
Ground Truth
Reference (Prompt)
VITS
VALLE-X
Lee et al.
Small-E
SMAM+MLM
SMAM+INR
SMAM+noMLM
Then the curtain rises, and it is apparent that we are assisting at an At Home of considerable splendour.
It has no beauty whatsoever, no specialty of picturesqueness; and all its lines are cramped and poor.