Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis
Abstract
Although recent zero-shot text-to-speech (zs-TTS) models have shown high performance in terms of speech quality, speaker similarity is not up to par. Speaker similarity can be expressed in two different components: intra-speaker consistent component (timbre) and inter-utterance variate component (cadence). In this paper, we propose a timbre-cadence speaker encoder for zs-TTS that improves speaker similarity by modeling these components. To disentangle timbre and cadence more efficiently, we employ a hierarchical structure. The cadence embedding is first encoded with VICReg which enlarges the inter-utterance embedding within a batch. Next, timbre embedding is extracted after subtracting cadence embedding and using a loss between timbre embedding and speaker ID-based speaker embedding. Additionally, we propose an effective data augmentation called speaker mixing augmentation, where two short utterances from different speakers are concatenated for a more robust zs-TTS model.
Click here for other works from Samsung Research TTS Team.
Zero-shot Experiments
Text1: The economic embargo against that country, and the general policy of the United States with regard to Cuba.
Reference 1 of speaker 1
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Reference 2 of speaker 1
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Reference 1 of speaker 2
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Reference 2 of speaker 2
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Text2: Some might be in abeyance, but they had never been repealed, and some were quite freshly imported upon the Statute Book.
Reference 1 of speaker 3
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Reference 2 of speaker 3
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Reference 1 of speaker 4
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Reference 2 of speaker 4
Spk. Encoder | Sample | Spk. Encoder | Sample |
---|---|---|---|
reference | TiCa | ||
TiCa-noSMAug | REF | ||
META | EXTERN |
Model Configurations
- Text encoder: phoneme embedding table with 512 dimension, three convolution blocks with 512 channels and five kernel size, 128 LSTM units for each direction.
- Decoder: linear layers with 512 units, three layers of 256 LSTM units for each direction, one linear projection layer with 22 units.
- Postnet: five convolution blocks with 256 channels, five kernel size, and tanh activation.
- Duration Prediction: two convolution blocks with 256 channels and three kernel size followed by layer norm, projection linear layer with 1 units.
- Timbre-Cadence Speaker Encoder: four convolution blocks with [32, 32, 64, 128] channels, three kernel size, and [2, 2, 1, 1] strides. 512 unidirectional GRU.
- Cadence embedding part: convolution block with 256 channels and one kernel size, attention pooling layer followed by projection layer with 32 dimension.
- Timbre embedding part: two convolution block with 256 channels and one kernel size with [LeakyReLU, ReLU] activations, attention pooling layer followed by projection layer with 32 dimension.
- META, REF: same configure as original paper except embedding dim (32).
- EXTERN: use pre-trained model with 512 embedding dim.