Diff-A2A-demo

An Example of Vocalized and Silent Speaking Mode

Vocalized Mode	Silent Mode

Larynx and lungs function as expected during speaking. Also called audible or modal speaking mode.	Only the oral and nasal articulators is activated but the laryngeal activity is suppressed. No sound is produced during speaking.

Experimental Results

Speech Reconstructed in Vocalized Speaking Mode

For more samples, please refer to this folder.

Content	Ground-Truth	Vocoder-Recon	TaLNet	Proposed
The difference in the rainbow depends considerably upon the size of the water drops, and the width of the coloured band increases as the size of the drops increases.
This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.
Some have accepted it as a miracle without physical explanation.
The Norsemen considered the rainbow as a bridge over which the gods passed from the earth to their home in the sky.
There is, according to legend, a boiling pot of gold at one end.
These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

Speech Reconstructed in Silent Speaking Mode

For more samples, please refer to this folder.

Content	TaLNet	Zheng et al. (2023)	Proposed w/o Pseudo Targets Training	Proposed
There is, according to legend, a boiling pot of gold at one end.
These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.
The rainbow is a division of white light into many beautiful colours.
There is, according to legend, a boiling pot of gold at one end.
The rainbow is a division of white light into many beautiful colours.
When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

Pseudo Targets Generated by the Proposed Dubbing Strategy and the Previous DTW Strategy

For more samples, please refer to this folder.

The pseudo targets generated by the DTW strategy failed to synchronize with the lip and tongue movements though maintaining the right linguistic content.

Content	Dubbing	DTW
The friendly gang left the drug store.
When sunlight strikes raindrops in the air, they act like a prism and form a rainbow.
Ten pins were set in order.
People look, but no one ever finds it.

Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation

Speech Reconstruction from Silent Lip and Tongue Articulation by
Diffusion Models and Text-Guided Pseudo Target Generation