This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos, with a particular focus on a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing real speech. This task falls under the umbrella of articulatory-to-acoustic (A2A) conversion and may also be referred to as a silent speech interface (SSI). To overcome the domain discrepancy between silent and standard vocalized articulation, we introduce a novel pseudo target generation strategy. It integrates the text modality to align with articulatory movements, thereby guiding the generation of pseudo acoustic features for supervised training on speech reconstruction from silent articulation. Furthermore, we propose to employ a denoising diffusion probabilistic model (DDPM) as the fundamental architecture for the A2A conversion task and train the model with the pseudo acoustic features generated by the proposed pseudo target generation strategy using a combined training approach. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in the silent speaking mode compared to all baseline methods. Specifically, the word error rate (WER) of the reconstructed speech decreases by approximately 5% when measured using automatic speech recognition (ASR) models for intelligibility assessment, while the mean opinion scores (MOS) for naturalness improve by 0.14. Moreover, analytical experiments reveal that the proposed pseudo target generation strategy can generate pseudo acoustic features that synchronize better with articulatory movements than previous methods.
An Example of Vocalized and Silent Speaking Mode
Vocalized Mode | Silent Mode |
---|---|
Larynx and lungs function as expected during speaking. Also called audible or modal speaking mode. |
Only the oral and nasal articulators is activated but the laryngeal activity is suppressed. No sound is produced during speaking. |
Experimental Results
Speech Reconstructed in Vocalized Speaking Mode
For more samples, please refer to this folder.
Content | Ground-Truth | Vocoder-Recon | TaLNet | Proposed |
---|---|---|---|---|
The difference in the rainbow depends considerably upon the size of the water drops, and the width of the coloured band increases as the size of the drops increases. |
||||
This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue. |
||||
Some have accepted it as a miracle without physical explanation. |
||||
The Norsemen considered the rainbow as a bridge over which the gods passed from the earth to their home in the sky. |
||||
There is, according to legend, a boiling pot of gold at one end. |
||||
These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. |
Speech Reconstructed in Silent Speaking Mode
For more samples, please refer to this folder.
Content | TaLNet | Zheng et al. (2023) | Proposed w/o Pseudo Targets Training | Proposed |
---|---|---|---|---|
There is, according to legend, a boiling pot of gold at one end. |
||||
These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. |
||||
The rainbow is a division of white light into many beautiful colours. |
||||
There is, according to legend, a boiling pot of gold at one end. |
||||
The rainbow is a division of white light into many beautiful colours. |
||||
When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow. |
Pseudo Targets Generated by the Proposed Dubbing Strategy and the Previous DTW Strategy
For more samples, please refer to this folder.
The pseudo targets generated by the DTW strategy failed to synchronize with the lip and tongue movements though maintaining the right linguistic content.
Content | Dubbing | DTW |
---|---|---|
The friendly gang left the drug store. |
||
When sunlight strikes raindrops in the air, they act like a prism and form a rainbow. |
||
Ten pins were set in order. |
||
People look, but no one ever finds it. |