Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis


Jae-Sung Bae, Joun Yeop Lee, Ji-Hyun Lee, Seongkyu Mun, Taehwa Kang, Hoon-Young Cho, Chanwoo Kim
Samsung Research
Seoul, Republic of Korea

Abstract

Previous works in zero-shot text-to-speech (ZS-TTS) have attempted to enhance its systems by enlarging the training data through crowd-sourcing or augmenting existing speech data. However, the use of low-quality data has led to a decline in the overall system performance. To avoid such degradation, instead of directly augmenting the input data, we propose a latent filling (LF) method that adopts simple but effective latent space data augmentation in the speaker embedding space of the ZS-TTS system. By incorporating a consistency loss, LF can be seamlessly integrated into existing ZS-TTS systems without the need for additional training stages. Experimental results show that LF significantly improves speaker similarity while preserving speech quality.

Click here for other works from Samsung Research TTS Team.

Contents

  1. Intra-lingual Zero-shot Experiments (English -> English)
  2. Cross-lingual Zero-shot Experiments (Korean -> English)
  3. Ablation study

1. Intra-lingual Zero-shot Experiments (English -> English)


Text: These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

Reference GT GT-re SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)

Text: The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.

Reference GT GT-re SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)

Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

Reference GT GT-re SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)

Text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.

Reference GT GT-re SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)



2. Cross-lingual Zero-shot Experiments (Korean -> English)

*: Korean speech samples


Text: Hawkeye started, and dropped his rifle, when, directed by the finger of his companion, the stranger came under his view.

Reference* GT* GT-re* SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)

Text: "I'm glad you like it," says Wylder, chuckling benignantly on it, over his shoulder.

Reference* GT* GT-re* SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)

Text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

Reference* GT* GT-re* SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)

Text: Even though the show is seriously exacting, physically demanding.

Reference* GT* GT-re* SC-GlowTTS YourTTS Baseline Baseline+CS Baseline+PS Baseline+LF
(Proposed)



3. Ablation Study


Text: Ms. Anderson yesterday put a brave face on the departure.

Reference GT GT-re Baseline+LF w/o noise adding w/o interpolation

Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

Reference GT GT-re Baseline+LF w/o noise adding w/o interpolation

Text: Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain.

Reference GT GT-re Baseline+LF w/o noise adding w/o interpolation