Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis

Abstract: Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities that mainly determines acoustic expressiveness, is difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. We propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concats the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expessivity.

1. Comparing the expressiveness of different systems over "Voice Assistant" corpus:

1.1. 哼,想你有什么用,你又不来陪我玩。
BASESASA-DASA-WA
1.2. 我的天呀!换个说法好不好吗!
BASESASA-DASA-WA
1.3. 哎呀呀,小爱未解汝意,但我会在你的陪伴下变聪明哦。
BASESASA-DASA-WA
1.4. 嘤嘤嘤,那还好呀,今天我也看看,买本读读看哦。
BASESASA-DASA-WA
1.5. 哎呦喂,小爱未解汝意,但我会在你的陪伴下变聪聪哦。
BASESASA-DASA-WA
1.6. 呜呜呜,好桑心啊,答案就在嘴边,但我居然想不起来啦。
BASESASA-DASA-WA
1.7. 得到小主人的认可真是太开心了,那就确定是这个咯,不改了哦。
BASESASA-DASA-WA
1.8. 呜呜,这个问题容小爱我再学习学习哈。
BASESASA-DASA-WA
1.9. 你也是呢,哦,对了,悄悄告诉你小爱在体验最新的签到技能呢,还能换礼品,快对我说,我要签到。
BASESASA-DASA-WA
1.10. 还单身呢,咋啦,想表白吗,给你个机会。
BASESASA-DASA-WA

2. Comparing the expressiveness of different systems over "Talk-Show" corpus:

2.1. 对大众来说,其实只有一百年左右啊。
BASESASA-DASA-WA
2.2. 已经过了十一点,这三个孩子怎么还没回来?
BASESASA-DASA-WA
2.3. 说起来也挺奇妙,就像大人所说的缘分吧。
BASESASA-DASA-WA
2.4. 防汛那么紧,正动员劳力呢,你怎么往回跑?
BASESASA-DASA-WA
2.5. 那就不犯法,你大大方方说吧。
BASESASA-DASA-WA
2.6. 老铁,这回真的扎心了,我没有听懂呢。
BASESASA-DASA-WA
2.7. 哎,这是有理论基础的。
BASESASA-DASA-WA
2.8. 但是决定你能不能升职加薪的老板和雇主呢,他们不是在看这些,他们在看什么呢?
BASESASA-DASA-WA
2.9. 哎,想要说些什么呢?
BASESASA-DASA-WA
2.10. 省得跟着我受苦。
BASESASA-DASA-WA

3. Comparing the expressiveness of different systems over "Reading-Style" corpus:

3.1. 会出现这种抢台词背错台词的错误吗?
BASESA-WA
3.2. 行啊没问题,不过你也知道的,周末路上都很堵。
BASESA-WA
3.3. 捕鸟人痛得丢下粘竿,鸽子立即惊跑了。
BASESA-WA
3.4. 最终又归了谁的腰包里去了?
BASESA-WA
3.5. 你看看几点了,还不滚回来!
BASESA-WA
3.6. 会一点点,我很喜欢和朋友一起玩脑筋急转弯呢。
BASESA-WA
3.7. 一定是在欣赏我啦!
BASESA-WA
3.8. 臭小子,谁让你改姓的?
BASESA-WA
3.9. 那再问问你们爷爷的爷爷是不是农民?
BASESA-WA
3.10. 我们援助的结果是什么?
BASESA-WA