Demo for paper: Controllable cross-speaker emotion transfer for end-to-end speech synthesis

Introduction:

The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. During the emotion transfer process, the identity information of the source speaker could also affect the synthesized results, resulting in the issue of speaker leakage, i.e., synthetic speech may have the voice identity of the source speaker rather than the target speaker. In this paper, a new method was proposed with the aim to synthesize controllable emotional expressive speech and meanwhile maintain the target speaker's identity in the cross-speaker emotion TTS task. The proposed method is a Tacotron2-based framework with the emotion embedding as the conditioning variable to provide emotion information. Two emotion disentangling modules are contained in our method to 1) get speaker-independent and emotion-discriminative embedding, and 2) explicitly constrain the emotion and speaker identity of synthetic speech to be that as expected. Moreover, the first effort that to control the transferred emotion strength for cross-speaker emotion transfer has been conducted in this work. Specifically, the learned emotion embedding is adjusted with a flexible scalar value, which allows controlling the emotion strength conveyed by the embedding. Extensive experiments have been conducted on a Mandarin disjoint corpus, and the results demonstrate that the proposed method is able to synthesize reasonable emotional speech for the target speaker. Compared to the state-of-the-art reference embedding learned methods, our method gets the best performance on the cross-speaker emotion transfer task, indicating that our method achieves the new state-of-the-art performance on learning the speaker-independent emotion embedding. Furthermore, the strength ranking test and pitch trajectories plots demonstrate that the proposed method can effectively control the emotion strength, leading to prosody-diverse synthetic speech.

1. The architecture of the proposed model:

arch

The architecture of the proposed cross-speaker model. The input text is represented as the phone sequence, and speech is represented by Mel-spectrogram which can be converted to waveform signal via a vocoder. The input ``Mel" is from reference audio to provide the speaker-independent emotion embedding via the emotion encoder. Two EDMs are just used during the training processing, and only the emotion encoder with ``Mel" as input is kept during the inference.

Note:

2. Comparisons with other methods on the task of emotion transfer TTS:

Corresponding to Section 5.1 of the paper,the following demos are synthesized by different methods with the same text as input, i.e., 让那些小主顾们等一等到吧 (English Translation: Let those little customers wait a while). The task is to synthesized speech with the same emotion as that from emotion reference audio (the first column) and the same voice as that from target speaker reference audio (the second column). It can be found that compared to both Mspk-GST and Mspk-VAE, the proposed method can achieve a good balance between maintaining the target speaker's identity and enriching the transferred emotional expression.

emotion	Emotion reference audio	Target speaker reference audio	Mspk-GST	Mspk-VAE	Proposed
fear
disgust
angry
sad
happy
surprise

3. Synthesized speech controlled by different emotion strengths:

Corresponding to Section 5.2 of the paper,the following demos present synthetic speech with different emotion strengths (weak, medium, and strong) with the same text as input, i.e., 让那些小主顾们等一等到吧 (English Translation: Let those little customers wait a while). The emotion strength is controlled by a flexible emotion scaler with values of 1, 2, and 3. It can be found that strength differences can be successfully reflected by speech controlled by different scalar values, and the speaker leakage issue is not obvious in speech with strong emotions.

emotion	Emotion reference audio	Target speaker reference audio	Proposed (Weak)	Proposed (Medium)	Proposed (Strong)
fear
disgust
angry
sad
happy
surprise

4. Ablation study

Corresponding to Section 5.3 of the paper. Input text: 让那些小主顾们等一等到吧 (English Translation: Let those little customers wait a while).

emotion	Emotion reference audio	Target speaker reference audio	w/o 2ort	w/o ort	Proposed
fear
disgust
angry
sad
happy
surprise

5. Extra samples on controllable emotional speech synthesis

In particular, we choose some sentences with obvious emotions (from the text) to synthesize emotional speeches of different strengths.

【Fear】我，我再也不敢动了。(English Translation: I-I don't dare to move anymore.)

emotion	Emotion reference audio	Target speaker reference audio	Weak	Medium	Strong
fear

【Disgust】但要理解一个遥远的国家对他来说还太困难。(English Translation: But it is too difficult for him to understand a faraway country.)

emotion	Emotion reference audio	Target speaker reference audio	Weak	Medium	Strong
disgust

【Angry】阿三是个工人的妻子，她丈夫失了业.(English Translation: A San is the wife of a worker, and her husband lost his job.)

emotion	Emotion reference audio	Target speaker reference audio	Weak	Medium	Strong
angry

【Sad】不管怎样都要注意身体，否则我的心会难受的.(English Translation: Pay attention to your body no matter what, or my heart will feel bad.)

emotion	Emotion reference audio	Target speaker reference audio	Weak	Medium	Strong
sad

【Happy】你说的没错，自信是成功的一半，所以要先树立起你的自信哦，嘻嘻.(English Translation: You are right, self-confidence is half of success, so first build up your self-confidence, hee hee)

emotion	Emotion reference audio	Target speaker reference audio	Weak	Medium	Strong
happy

【Surprise】鲁宾孙乘这艘船在海上航行半年后！.(English Translation: After six months of sailing on this ship, Robinson)

emotion	Emotion reference audio	Target speaker reference audio	Weak	Medium	Strong
surprise