Tao Li, Xinsheng Wang, Qicong Xie, Mingqi Jiang, Yunlin Chen, Lei Xie;
Northwestern Polytechnical University
Xi’an Jiaotong University
Mobvoi AI Lab

0. Contents

  1. Abstract
  2. Demos -- Emotional speech synthesis by transferring the emotion from reference audio
  3. Demos -- The necessity of emotion embedding and prosody compensation
  4. Demos -- The effectiveness of GC block
  5. Additional demos -- Transfer emotion to a new target speaker who only has 200 sentences through simple fine-tuning

1. Abstract

Cross-speaker emotion transfer speech synthesis aims to synthesize emotional speech for a target speaker by transferring the emotion from reference recorded by another (source) speaker, in which task extracting speaker-independent emotion embedding from reference plays an important role. However, the emotional information of such emotion embedding tends to be weakened in the process to squeeze out the source speaker's timbre information. In this paper, a prosody compensation module (PCM) is proposed to compensate emotional information for the disentangled emotion embedding. Specifically, the PCM tries to obtain speaker-independent emotional information from the intermediate feature of a pre-trained ASR model. To this end, a prosody compensation encoder with global context (GC) blocks is introduced to obtain global emotional information from the ASR model's intermediate feature. Experiments demonstrate that the proposed PCM can effectively compensate emotional information to the emotion embedding, and meanwhile maintain the timbre of the target speaker. Comparisons with state-of-the-art models show that our proposed method presents obvious superiority on the cross-speaker emotion transfer task.

1.1 The structure of the proposed model:

1.2 The structure of global context (GC) blocks:

2. Demos -- Comparison with other methods

Corresponding to section 4.1 in the paper, several samples synthesized by the proposed CSPC and other compared methods on the emotion transfer task are listed below.

Emotion Emotion reference Target speaker reference Multi-R CSET PB CSPC (Proposed)
Surprise Text: 什么?大个居然变成石头啦!(English: What? big guy turned to stone!)
Happy Text: 谢谢您,我会争取做得更好,也欢迎你随时来提问。 (English: Thank you. I will try to do better, and you are welcome to ask questions at any time.)
Sadness Text: 新买的键盘,回车键就坏了。 (English: The new keyboard, the enter key is broken.)
Angry Text: 这人如此狂妄。(English: This man is so arrogant.)
Disgust Text: 见到她能绕道走,就尽量绕道走。(English: When you see that she can make a detour, try to make a detour.)
Fear Text: 孩子如果你害怕的话就别看吧。(English: Son, if you're scared, don't look.)

Short summary: It can be found that compared with Multi-R, CSET, and PB, the proposed CSPC can achieve a good balance between maintaining the target speaker's timbre and enriching the transferred emotional expression.

3. Demos -- The necessity of emotion embedding and prosody compensation

Corresponding to section 4.2.1 in the paper, samples synthesized by CSPC and also CSPC' variants that without emotion embedding (w/o EE) or prosody compensation embedding (w/o PCE) on the emotion transfer task are listed below.

Emotion Emotion reference Target speaker reference w/o EE w/o PCE CSPC (Proposed)
Surprise Text: 天哪!梦中的一切成真了!(English: My god! Everything in the dream came true!)
Happy Text: 散散步,呼吸下新鲜空气,整个人都神清气爽了呢。(English: Take a walk and get some fresh air, i feel refreshed and refreshed.)
Sadness Text: 我的显示器又花屏了。(English: My monitor is blurred again.)
Angry Text: 快跟我回去!(English: Come back with me!)
Disgust Text: 看见你我就想踹飞你。(English: When I see you, I want to kick you.)
Fear Text: 走开,不要靠近我。(English: Go away, don't come near me.)

Short summary: It can be found that the prosody compensation embedding can provide extra emotion information to the emotion embedding, and the proposed prosody compensation method can effectively improve the emotion transfer performance and mean-while maintain the target speaker’s voice.

4. Demos -- The effectiveness of GC block

As we can see that in terms of the emotion similarity, CSPC obviously outperforms the w/o GC without the GC block.

As we can see that in terms of the speaker similarity, dropping the GC block brings a slight increase which is not obvious.

Corresponding to section 4.2.2 in the paper, below lists the samples that are synthesized for presenting the effectiveness of GC block.

Emotion Emotion reference Target speaker reference w/o GC CSPC (Proposed)
Surprise Text: 啊!真残忍!(English: Aah! How cruel!)
Happy Text: 姐姐说话温柔细声细气的,给人的感觉态度很好哦。(English: The elder sister speaks softly and softly, which gives people the impression that the attitude is very good.)
Sadness Text: 我的显示器又花屏了。(English: My monitor is blurred again.)
Angry Text: 还有什么事!(English: Is there anything else!)
Disgust Text: 我受不了了,赶紧滚!(English: I can't take it anymore. Get out of here!)
Fear Text: 我害怕极了,连大气都不敢出。(English: I'm so scared, I don't dare to breathe.)

Short summary: It can be found that in terms of emotional expressiveness, CSPC significantly outperforms the variant w/o GC without significant influence on speaker similarity.

5. Additional demos -- Transfer emotion to a new target speaker who only has 200 sentences through simple fine-tuning

Following samples are synthesized by transferring the emotion to a target speaker who has only 200 utterances.

Target speaker reference Surprise Happy Sad Angry Disgust Fear
Source speaker record Source speaker record Source speaker record Source speaker record Source speaker record Source speaker record
Text: 这么大的一颗子弹!(English: Such a big bullet!) Text: 有西瓜吃喽,西瓜西瓜,我的最爱,我都流口水了。(English: There is watermelon to eat, watermelon, watermelon, my favorite, I am drooling.) Text: 我的作文,被老师当作笑话看待。(English: My composition is treated as a joke by the teacher.) Text: 谁跟你开玩笑!(English: Not joking with you.) Text: 把这恶心的东西拿走。(English: Take this disgusting thing away.) Text: 求你把枪放下。(English: Please put down the gun.)
Transfer sample Transfer sample Transfer sample Transfer sample Transfer sample Transfer sample
Text: 什么,大个居然变成石头啦!(English: What? big guy turned to stone!) Text: 孩纸还是去上学吧,九年义务教育是一定要完成哒!(English: Children should go to school. The nine-year compulsory education must be completed!) Text: 因为他的原因,球队输掉了比赛。(English: Because of him, the team lost the game.) Text: 讨厌的东西。(English: Annoying things.) Text: 面无表情,看都不想看你。(English: Expressionless, I don't even want to look at you.) Text: 我不敢再看,双腿发软。(English: I didn't dare to look again, my legs felt weak.)
Transfer sample Transfer sample Transfer sample Transfer sample Transfer sample Transfer sample
Text: 天哪!梦中的一切成真了!(English: My god! Everything in the dream came true!) Text: 散散步,呼吸下新鲜空气,整个人都神清气爽了呢。(English: Take a walk and get some fresh air, i feel refreshed and refreshed.) Text: 我新买的手机,就被我摔碎了屏幕。(English: I smashed the screen of my new mobile phone.) Text: 我偏偏要捣乱。(English: I just want to make trouble.) Text: 别烦我,从我面前消失。(English: Don't bother me, go where I can't see.) Text: 我那颗忐忑不安的心越跳越快。(English: My uneasy heart beats faster and faster.)
Transfer sample Transfer sample Transfer sample Transfer sample Transfer sample Transfer sample

Short summary: It can be found that the model we proposed can be well generalized to the new target speaker who only has a small amount of data through simple fine-tuning, which transfers the emotion to the new target speaker while maintaining the target speaker’s timbre.