Controllable Emotion Transfer For End-to-End Speech Synthesis

Abstract:

Emotion embedding space learned from references is a straightforward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers – one after the reference encoder, one after the decoder output – to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.


1. The architecture of the proposed model:

arch

2. Demo of style transfer for emotional TTS :

To facilitate fair comparison, we use the same text to synthesize speech in six emotions. This may let the listeners more focused on the emotion delivered in the acoustic aspects. The text is (in Chinese): 让那些小主顾们等一等到吧。
emotion Reference audio Prosody Tacotron + Lcls_src + Lcls_tgt + Lcls_src + Lcls_tgt Ltotal
surprise
happy
sad
angry
disgust
fear

3. Demo of emotion strength control in emotional TTS :

To facilitate fair comparison, we use the same text to synthesize speech in six emotions and three strengths. This may let the listeners more focused on the emotion delivered in the acoustic aspects. The text is (in Chinese): 让那些小主顾们等一等到吧。
emotion RA-Tacotron (Low) proposed (Low) RA-Tacotron (Medium) proposed (Medium) RA-Tacotron (Strong) proposed (Strong)
surprise
happy
sad
angry
disgust
fear

4. Demo of continuous emotion strength control :

A synthetic story using the proposed emotion strength control. The proposed approach can make audiobooks more expressive. Each sentence is assoiciated with an emotion type and strength (larger is stronger).
【neutral】有一天我走路去学校上学。
【neutral】从家到学校要走一段好长的山路。
【happy-0.1】还好路边风景秀丽,开满了野花。
【happy-1.0】我开心极了,边走边哼着小曲儿。
【happy-1.5】一想到明天就是周六,我高兴的快跳了起来。
【happy-2.0】约好了和朋友们明天出去玩,真的太棒了!
【surprise-2.0】突然,路边出现了一个金光闪闪的箱子。
【surprise-1.0】好奇怪啊,我每天都走这条路,从来没见过这个箱子啊。
【surprise-0.5】我好奇的打开了它。
【fear-0.5】一个猴子从箱子里跳了出来,把我吓了一跳。
【fear-1.0】它样子怪怪的,还拦住了我的去路。
【surprise-1.0】猴子突然说:
【sad-0.5】求求你,救救我吧,我是王子,被老巫婆施加了魔咒。
【sad-1.0】你不救我,我就永远是一只猴子了。
【sad-2.0】求求你了,求求你了,你只要把我头上的紧箍咒拿掉就好。
【sad-2.5】你救了我,我会答应你所有的愿望。
【disgust-2.0】“好吧,好吧,我来救你”。
【neutral】我取掉了猴子头上的紧箍咒。
【surprise-2.0】猴子果然变成了一个英俊的王子。
【happy-2.0】太感谢你了!太感谢你了!
【happy-1.0】王子高兴的说。
【happy-1.5】你有什么愿望呢?我都能帮你实现。
【surprise-2.5】能让我会飞吗?
【angry-1.5】王子生气的说:
【angry-2.0】除了这个忙,我都能帮你!
【neutral】我睁开了眼,原来这是一场梦。
Story with continuous emotion strength control