Demos for " Speech reconstruction from silent tongue and lip articulation by pseudo target generation and domain adversarial training " .

This paper has been accepted by ICASSP 2023. A presentation video is available here.

Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Abstract: This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. This task falls under the umbrella of articulatory-to-acoustic conversion, and may also be refered to as a silent speech interface. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model. When using an automatic speech recognition (ASR) model to measure intelligibility, the word error rate (WER) of our proposed method decreases by over 15% compared to the baseline. In addition, our proposed method also outperforms the baseline on the intelligibility of the speech reconstructed in vocalized articulating mode, reducing the WER by approximately 10%.


1. Comparison of Silent Utterances

Note: TaLNet denotes the original TaLNet and Proposed denotes our proposed method leveraging pseudo targets generation and domain adversarial training with our iterative training strategy.

TaLNet Proposed Content

The chap slipped into the crowd and was lost.

The rainbow is a division of white light into many beautiful colours.

Her wardrobe consists of only skirts and blouses.

They told wild tales to frighten him.

2. Comparison of Vocalized Utterances

Note: TaLNet denotes the original TaLNet and Proposed denotes our proposed method leveraging pseudo targets generation and domain adversarial training with our iterative training strategy.

TaLNet Proposed Content

The rainbow is a division of white light into many beautiful colours.

This take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

This take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

There is, according to legend, a boiling pot of gold at one end.

3. Ablation Studies in Silent Speaking Mode

Note: Proposed denotes our proposed method leveraging pseudo targets generation and domain adversarial training with our iterative training strategy; w/o ITS denotes our proposed method without iterative training strategy; and w/o DAT denotes our proposed method without iterative training strategy and domain adversarial training.

Proposed w/o ITS w/o DAT Content

The rainbow is a division of white light into many beautiful colours.

People look but no one ever finds it.

The rainbow is a division of white light into many beautiful colours.

The barrel of beer was a brew of malt and hops.

4. Ablation Studies in Vocalized Speaking Mode

Note: Proposed denotes our proposed method leveraging pseudo targets generation and domain adversarial training with our iterative training strategy; w/o ITS denotes our proposed method without iterative training strategy; and w/o DAT denotes our proposed method without iterative training strategy and domain adversarial training.

Proposed w/o ITS w/o DAT Content

The rainbow is a division of white light into many beautiful colours.

The rainbow is a division of white light into many beautiful colours.

The rainbow is a division of white light into many beautiful colours.

This take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.