Audio samples : "MELS-TTS : Multi-Emotion Multi-Lingual Multi-Speaker Text-to-Speech System via Disentangeld Style Token"


Authors

Heejin Choi (Samsung Research, Seoul, Republic of Korea)

Jae-Sung Bae (Samsung Research, Seoul, Republic of Korea)

Joun Yeop Lee (Samsung Research, Seoul, Republic of Korea)

Seongkyu Mun (Samsung Research, Seoul, Republic of Korea)

Jihwan Lee (Samsung Research, Seoul, Republic of Korea)

Hoon-Young Cho (Samsung Research, Seoul, Republic of Korea)

Chanwoo Kim (Samsung Research, Seoul, Republic of Korea)

Click here for other works from Samsung Research TTS Team.


Audio Samples

Comparison with the baseline methods

EN2KR - Synthesized Korean emotional speech from English neutral only speakers


Condition Synthesized
Speaker: p339
Neutral
Happy
Sad
Angry
좌익수로 가장 많은 경기에 나섰다. 동생 뼈는 아주 튼튼하구나! 편지로라도 그의 어머니에게 하소연할 수도 없게 된 오늘날 신세이다. 비단을 더러운 발로 아주 못쓰게 해놓았다.
jwaigsulo gajang manh-eun gyeong-gie naseossda. dongsaeng ppyeoneun aju teunteunhaguna! pyeonjilolado geuui eomeoniege hasoyeonhal sudo eobsge doen oneulnal sinseida. bidan-eul deoleoun ballo aju mos-sseuge haenoh-assda.
LB
GST
GST-C
Proposed

Speaker: p363
Neutral
Happy
Sad
Angry
뺨과 턱에다 붙이고 활 시위를 당기세요. 난 널 싫어한 적이 없어. 저는 그런 적 없어요. 이제 집에 가고 싶단 말이야!
ppyamgwa teog-eda but-igo hwal siwileul dang-giseyo. nan neol silh-eohan jeog-i eobs-eo. jeoneun geuleon jeog eobs-eoyo. ije jib-e gago sipdan mal-iya!
LB
GST
GST-C
Proposed



KR2EN - Synthesized English emotional speech from Korean neutral only speakers


Condition Synthesized
Speaker: 0015_OES
Neutral
Happy
Sad
Angry
Many complicated ideas about the rainbow have been formed.
LB
GST
GST-C
Proposed

Speaker: 0010_CST
Neutral
Happy
Sad
Angry
Sentencing is also being considered in that review.
LB
GST
GST-C
Proposed

Hyper-parameters

The architecture of our proposed model was based on LPCTron, replacing the CBHG block in the encoder with 3 convolution layers with 512 channels and 5 kernels follwd by a single bi-directional LSTM layer with 128 cells.
For the reference encoder, the same architecture used in GST was used. The number of heads and the dimension of output embedding of the multi-head attention in the emotion encoder were 4 and 256, respectively.
The token embedding dimension was set to 64. To use the ouputs of the speaker and language look-up tables as disentangled tokens, their dimensions matched those of the token embeddings. The number of tokens of emotion and residual token sets was five, and they were initialized orthogonally. The speaker and language embeddings were projected into 256 dimensions by a linear layer of the corresponding dimension size.