Improve bilingual TTS using language and phonology embedding with embedding strength modulator

Abstract: In most cases, bilingual TTS needs to handle three types of input scripts: first language only, second language only, and second language embedded in the first language. In the latter two situations, it is a big challenge to accurately model the pronunciation and intonation of the second language in different contexts without mutual interference. This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker. We introduce phonology embedding to capture the English differences between different phonology with embedding masks. An embedding strength modulator is specially designed to capture the dynamic strength of language and phonology. Experiments show that our approach can produce significantly more natural and standard spoken English speech of the monolingual Chinese speaker. From analysis, we find that suitable phonology control contributes to better performance in different scenarios.

1. Comparing the speech naturalness and speaker similarity of different bilingual systems:

English:

1.1. Drag the bars to adjust the volume level.
BASEEMESM
1.2. It is mostly detected during childhood and student years itself.
BASEEMESM
1.3. In other words, my employer is always telling me to do things.
BASEEMESM
1.4. No food was available at the Press Club.
BASEEMESM
1.5. Therefore, one cartridge case of this type was not recovered.
BASEEMESM

Code-switching:

1.6. 最良心的新款SUV!
BASEEMESM
1.7. Something Just Like This 1分钟试听版送给你。
BASEEMESM
1.8. 好的,小爱想陪你一起听,绿钻会员歌曲:See You Again。
BASEEMESM

Mandarin:

1.9. 下午啦,快醒醒,要开始工作啦。
BASEEMESM
1.10. 好的,小爱帮你找回记忆中的旋律。
BASEEMESM
1.11. 抱歉,查询有失败,请稍后再试吧。
BASEEMESM

2. ESM component analysis:

2.1. Children this age fight to get into the bathroom.
Base combinationReference combinationDynamic phonology embeddingStatic phonology embeddingDynamic language embeddingStatic language embedding
(Articulation)(Intonation)(Speaking rate)(Pause duration)
2.2. Next, the pores plug up and trap the oil inside.
Base combinationReference combinationDynamic phonology embeddingStatic phonology embeddingDynamic language embeddingStatic language embedding
(Articulation)(Intonation)(Speaking rate)(Pause duration)

3. Control phonology for enhancing expressiveness:

3.1. Genes of intracellular calcium metabolism and blood pressure control in primary hypertension.
Base combinationReference combinationDouble Adjustment
3.2. The American people have learned from the depression.
Base combinationReference combinationDouble Adjustment

4. Control phonology for smooth transition:

4.1. 好呀,我们一起来听I fall in love too easily。
Base combinationSmooth transitionReference combination
4.2. 好的,为你播放抖音神曲twinkle twinkle little star。
Base combinationSmooth transitionReference combination