DiCLET-TTS: Diffusion model based cross-lingual emotion transfer for text-to-speech

Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie,
Northwestern Polytechnical University
Tsinghua University
Audio, and Music Intelligence (SAMI) Group, ByteDance

0. Contents

  1. Abstract
  2. Demos -- Emotional speech synthesis by transferring the emotion from reference audio to intra-lingual and cross-lingual target speakers
  3. Demos -- The necessity of content loss and emotion adaptor
  4. Demos -- The effectiveness of condition-enhanced DPM decoder
  5. Demos -- Advantages of emotion embedding space with orthogonal projection

1. Abstract

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.

1.1 The structure of the proposed model:



Note:

To facilitate fair comparison, we particularly select a sentence without apparent emotion (from text) to synthesize emotional speech with different emotions because a sentence with apparent emotion may mislead the listeners to have an unfair judgment on the emotion expressiveness.

2. Demos -- Comparison with other methods

2.1 Corresponding to section V in the paper, several samples synthesized by the proposed DiCLET-TTS and other compared methods on transferring the emotion from reference audio to intra-lingual target speaker "CN1" are listed below.

Emotion Emotion reference Target speaker reference M3 CSET Grad-TTS DiCLET-TTS
English (Neutral) Text: Normally if there is a problem they would rebook there and then.
Text Content: 于是在住所前的空地上插下杨柳桩子。 (So he inserted willow stakes in the open space in front of the residence.)
Chinese (Neutral)
Surprise
Happy
Sadness
Angry
Disgust
Fear

2.2 Corresponding to section V in the paper, several samples synthesized by the proposed DiCLET-TTS and other compared methods on transferring the emotion from reference audio to cross-lingual target speaker "EN1" are listed below.

Emotion Emotion reference Target speaker reference M3 CSET Grad-TTS DiCLET-TTS
English (Neutral) Text: Look, a steady arcade gig is nothing to sneeze at.
Text Content: 与杰西告别时,查理握着杰西的手。 (Charlie holds Jesse's hand as he says goodbye.)
Chinese (Neutral)
Surprise
Happy
Sadness
Angry
Disgust
Fear

Short summary: It can be found that compared with M3, CSET, and Grad-TTS, the proposed DiCLET-TTS can achieve a good balance between maintaining the target speaker's timbre and enriching the transferred emotional expression, in both in-lingual and cross-lingual scenarios.

3. Demos -- The necessity of content loss and emotion adaptor

Corresponding to section VI-A in the paper, samples synthesized by DiCLET-TTS and also DiCLET-TTS' variants that without content loss (w/o CTL) or emotion adaptor (w/o EA) on the emotion transfer task are listed below.

Emotion Emotion reference Target speaker reference w/o CTL w/o EA DiCLET-TTS
English (Neutral) Text: Even last week was pretty comfortable with ultra low humidity.
Text Content: 楼三室的对联,借用了古人的诗句。(The couplets in the third room of the building borrowed poems from the ancients.)
Chinese (Neutral)
Surprise
Happy
Sadness
Angry
Disgust
Fear

4. Demos -- The necessity of emotion adaptor and condition-enhanced DPM decoder

Corresponding to section VI-B in the paper, samples synthesized by DiCLET-TTS and also DiCLET-TTS' variants that without emotion adaptor (w/o EA) or condition-enhanced DPM decoder (w/o CE-D) on the emotion transfer task are listed below.

Emotion Emotion reference Target speaker reference w/o EA w/o CE-D DiCLET-TTS (Proposed)
Text Content: 它们便渐渐敢伸出小脑袋瞅瞅我。(They gradually dared to look at me with their little heads out.)
Surprise
Happy
Sadness
Angry
Disgust
Fear

Short summary: It can be found that in terms of emotional expressiveness, DiCLET-TTS significantly outperforms the variant w/o EA and w/o CE-D without significant influence on speaker similarity.

5. Demos -- The advantages of emotion embedding space with orthogonal projection

As we can see that in terms of the emotion similarity, DiCLET-TTS obviously outperforms the w/o OPL without the Orthogonal Projection Loss (OPL).

As we can see that in terms of the speaker similarity, there is no significant difference between the ``w/o OPL'' and DiCLET-TTS, most listeners give ``No preference''.

Corresponding to section VI-C in the paper, below lists the samples that are synthesized for presenting the effectiveness of OPL.

Emotion Emotion reference Target speaker reference w/o OPL DiCLET-TTS
Text Content: 阿三是个工人的妻子,她丈夫失了业。(Ah San is a worker's wife, and her husband lost his job. )
Surprise
Happy
Sadness
Angry
Disgust
Fear

Short summary: It can be found that in terms of emotional expressiveness, DiCLET-TTS significantly outperforms the variant w/o OPL without significant influence on emotion similarity.