Skip to the content.

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Author: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling.

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporating of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting and improving the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.


Results on TaL80 Datasets

Method LivingRoom Noise at 2.5dB Cafe Noise at -2.5dB LivingRoom Noise at -7.5dB Cafeteria Noise at -7.5dB
Noisy
  spec spec spec spec
Groundtruth
  spec spec spec spec
VSE
  spec spec spec spec
PVSE
  spec spec spec spec
IOAVSE
  spec spec spec spec
VisualVoice
  spec spec spec spec
KD-based
  spec spec spec spec
Memory-based
  spec spec spec spec
Audio-Lip
  spec spec spec spec
Audio-Lip-Tongue
  spec spec spec spec
Audio-Tongue
  spec spec spec spec
Audio-Only
  spec spec spec spec

Results on GRID Datasets

Method Restaurant Noise at 2.5dB Cafeteria Noise at -2.5dB Cafeteria Noise at -7.5dB
Noisy
  spec spec spec
Groundtruth
  spec spec spec
Audio-Lip
  spec spec spec
KD-based
  spec spec spec
Memory-based
  spec spec spec