Controllable Emotion Transfer For End-to-End Speech Synthesis
Abstract:
Emotion embedding space learned from references is a straightforward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers – one after the reference encoder, one after the decoder output – to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.
1. The architecture of the proposed model:
2. Demo of style transfer for emotional TTS :
To facilitate fair comparison, we use the same text to synthesize speech in six emotions. This may let the listeners more focused on the emotion delivered in the acoustic aspects. The text is (in Chinese): 让那些小主顾们等一等到吧。
emotion
Reference audio
Prosody Tacotron
+ Lcls_src
+ Lcls_tgt
+ Lcls_src + Lcls_tgt
Ltotal
surprise
happy
sad
angry
disgust
fear
3. Demo of emotion strength control in emotional TTS :
To facilitate fair comparison, we use the same text to synthesize speech in six emotions and three strengths. This may let the listeners more focused on the emotion delivered in the acoustic aspects. The text is (in Chinese): 让那些小主顾们等一等到吧。
emotion
RA-Tacotron (Low)
proposed (Low)
RA-Tacotron (Medium)
proposed (Medium)
RA-Tacotron (Strong)
proposed (Strong)
surprise
happy
sad
angry
disgust
fear
4. Demo of continuous emotion strength control :
A synthetic story using the proposed emotion strength control. The proposed approach can make audiobooks more expressive. Each sentence is assoiciated with an emotion type and strength (larger is stronger).