Demo page for IMPROVING EMOTIONAL SPEECH SYNTHESIS BY USING SUS-CONSTRAINED VAE AND TEXT ENCODER AGGREGATION

IMPROVING EMOTIONAL SPEECH SYNTHESIS BY USING SUS-CONSTRAINED VAE AND TEXT ENCODER AGGREGATION

Abstract: Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all encoder layers via attention. Moreover, the queries from encoder layers themselves are also helpful. Experiments prove the proposed methods can enhance the encoding of comprehensive syntactic and semantic information and produce more expressive emotional speech.

1. Comparing the emotional expressiveness of different systems with non-parallel transfer:

1.1. 当然是多锻炼、长壮点！就问一句，感动不？
BASE	BASE-SUS	SA-WA	SA-WAC

1.2. 在和你聊天啊，想要我帮忙就说。
BASE	BASE-SUS	SA-WA	SA-WAC

1.3. 蓝牙已打开。
BASE	BASE-SUS	SA-WA	SA-WAC

1.4. 她起初在班里不爱说话，现在自然多了。
BASE	BASE-SUS	SA-WA	SA-WAC

1.5. 你，过来，这个多少钱？
BASE	BASE-SUS	SA-WA	SA-WAC

1.6. 没有什么想法，就是不想去！
BASE	BASE-SUS	SA-WA	SA-WAC

1.7. 你负责为什么还会被辞退了？
BASE	BASE-SUS	SA-WA	SA-WAC

1.8. 真想跟你过几天安生日子啊。
BASE	BASE-SUS	SA-WA	SA-WAC