Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis


Joun Yeop Lee, Jae-Sung Bae, Seongkyu Mun, Jihwan Lee, Ji-Hyun Lee, Hoon-Young Cho, Chanwoo Kim
Samsung Research, Seoul, Republic of Korea

Abstract

Although recent zero-shot text-to-speech (zs-TTS) models have shown high performance in terms of speech quality, speaker similarity is not up to par. Speaker similarity can be expressed in two different components: intra-speaker consistent component (timbre) and inter-utterance variate component (cadence). In this paper, we propose a timbre-cadence speaker encoder for zs-TTS that improves speaker similarity by modeling these components. To disentangle timbre and cadence more efficiently, we employ a hierarchical structure. The cadence embedding is first encoded with VICReg which enlarges the inter-utterance embedding within a batch. Next, timbre embedding is extracted after subtracting cadence embedding and using a loss between timbre embedding and speaker ID-based speaker embedding. Additionally, we propose an effective data augmentation called speaker mixing augmentation, where two short utterances from different speakers are concatenated for a more robust zs-TTS model.

Click here for other works from Samsung Research TTS Team.


Zero-shot Experiments

Text1: The economic embargo against that country, and the general policy of the United States with regard to Cuba.

Reference 1 of speaker 1

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Reference 2 of speaker 1

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Reference 1 of speaker 2

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Reference 2 of speaker 2

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Text2: Some might be in abeyance, but they had never been repealed, and some were quite freshly imported upon the Statute Book.

Reference 1 of speaker 3

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Reference 2 of speaker 3

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Reference 1 of speaker 4

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Reference 2 of speaker 4

Spk. Encoder Sample Spk. Encoder Sample
reference TiCa
TiCa-noSMAug REF
META EXTERN

Model Configurations

- Text encoder: phoneme embedding table with 512 dimension, three convolution blocks with 512 channels and five kernel size, 128 LSTM units for each direction.

- Decoder: linear layers with 512 units, three layers of 256 LSTM units for each direction, one linear projection layer with 22 units.

- Postnet: five convolution blocks with 256 channels, five kernel size, and tanh activation.

- Duration Prediction: two convolution blocks with 256 channels and three kernel size followed by layer norm, projection linear layer with 1 units.

- Timbre-Cadence Speaker Encoder: four convolution blocks with [32, 32, 64, 128] channels, three kernel size, and [2, 2, 1, 1] strides. 512 unidirectional GRU.

- Cadence embedding part: convolution block with 256 channels and one kernel size, attention pooling layer followed by projection layer with 32 dimension.

- Timbre embedding part: two convolution block with 256 channels and one kernel size with [LeakyReLU, ReLU] activations, attention pooling layer followed by projection layer with 32 dimension.

- META, REF: same configure as original paper except embedding dim (32).

- EXTERN: use pre-trained model with 512 embedding dim.