While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery.
In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers.
Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition.
To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding.
Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery.
Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.
1.1 The structure of the proposed model:
Note:
To facilitate fair comparison, we particularly select a sentence without apparent emotion (from text) to synthesize emotional speech with different emotions because a sentence with apparent emotion may mislead the listeners to have an unfair judgment on the emotion expressiveness.
2. Demos -- Comparison with other methods
2.1 Corresponding to section V in the paper, several samples synthesized by the proposed DiCLET-TTS and other compared methods on transferring the emotion from reference audio to intra-lingual target speaker "CN1" are listed below.
Emotion
Emotion reference
Target speaker reference
M3
CSET
Grad-TTS
DiCLET-TTS
English (Neutral)
Text: Normally if there is a problem they would rebook there and then.
Text Content:
于是在住所前的空地上插下杨柳桩子。 (So he inserted willow stakes in the open space in front of the residence.)
Chinese (Neutral)
Surprise
Happy
Sadness
Angry
Disgust
Fear
2.2 Corresponding to section V in the paper, several samples synthesized by the proposed DiCLET-TTS and other compared methods on transferring the emotion from reference audio to cross-lingual target speaker "EN1" are listed below.
Emotion
Emotion reference
Target speaker reference
M3
CSET
Grad-TTS
DiCLET-TTS
English (Neutral)
Text: Look, a steady arcade gig is nothing to sneeze at.
Text Content:
与杰西告别时,查理握着杰西的手。 (Charlie holds Jesse's hand as he says goodbye.)
Chinese (Neutral)
Surprise
Happy
Sadness
Angry
Disgust
Fear
Short summary: It can be found that compared with M3, CSET, and Grad-TTS, the proposed DiCLET-TTS can achieve a good balance between maintaining the target speaker's timbre and enriching the transferred emotional expression, in both in-lingual and cross-lingual scenarios.
3. Demos -- The necessity of content loss and emotion adaptor
Corresponding to section VI-A in the paper, samples synthesized by DiCLET-TTS and also DiCLET-TTS' variants that without content loss (w/o CTL) or emotion adaptor (w/o EA) on the emotion transfer task are listed below.
Emotion
Emotion reference
Target speaker reference
w/o CTL
w/o EA
DiCLET-TTS
English (Neutral)
Text: Even last week was pretty comfortable with ultra low humidity.
Text Content:
楼三室的对联,借用了古人的诗句。(The couplets in the third room of the building borrowed poems from the ancients.)
Chinese (Neutral)
Surprise
Happy
Sadness
Angry
Disgust
Fear
4. Demos -- The necessity of emotion adaptor and condition-enhanced DPM decoder
Corresponding to section VI-B in the paper, samples synthesized by DiCLET-TTS and also DiCLET-TTS' variants that without emotion adaptor (w/o EA) or condition-enhanced DPM decoder (w/o CE-D) on the emotion transfer task are listed below.
Emotion
Emotion reference
Target speaker reference
w/o EA
w/o CE-D
DiCLET-TTS (Proposed)
Text Content:
它们便渐渐敢伸出小脑袋瞅瞅我。(They gradually dared to look at me with their little heads out.)
Surprise
Happy
Sadness
Angry
Disgust
Fear
Short summary: It can be found that in terms of emotional expressiveness, DiCLET-TTS significantly outperforms the variant w/o EA and w/o CE-D without significant influence on speaker similarity.
5. Demos -- The advantages of emotion embedding space with orthogonal projection
As we can see that in terms of the emotion similarity, DiCLET-TTS obviously outperforms the w/o OPL without the Orthogonal Projection Loss (OPL).
As we can see that in terms of the speaker similarity, there is no significant difference between the ``w/o OPL'' and DiCLET-TTS, most listeners give ``No preference''.
Corresponding to section VI-C in the paper, below lists the samples that are synthesized for presenting the effectiveness of OPL.
Emotion
Emotion reference
Target speaker reference
w/o OPL
DiCLET-TTS
Text Content:
阿三是个工人的妻子,她丈夫失了业。(Ah San is a worker's wife, and her husband lost his job. )
Surprise
Happy
Sadness
Angry
Disgust
Fear
Short summary: It can be found that in terms of emotional expressiveness, DiCLET-TTS significantly outperforms the variant w/o OPL without significant influence on emotion similarity.