Hi there! My name is Rui-Chen Zheng (郑瑞晨).
I am currently a fourth-year PhD student at National Engineering Research Center for Speech and Language Information Processing of University of Science and Technology of China, supervised by Prof. Zhen-Hua Ling. My main research interests lie within deep learning for speech synthesis, and now I am researching on articulatory-acoustic relationship in speech synthesis.
My CV can be downloaded here.

Looking for 6-12 months of visiting opportunities in speech synthesis! Please contact me if you have an appropriate position :)

📖 Educations

  • 2021.09 - 2026.06 (Expected), PhD Student in Information and Comunication Engineering, University of Science and Technology of China
    • Supervised by Prof. Zhen-Hua Ling
    • GPA: 3.9/4.3 (Top 3%)
  • 2017.09 - 2021.06, Bachelor’s Degree of Electronic Information Engineering, University of Science and Technology of China
    • Thesis: Method and Practice on Text-to-speech Without Text
    • GPA: 3.89/4.3, 90.46/100 (Top 5%)
    • Minor in Business Administration

📝 Publications

🎈 Articulation-Acoustic Relationship

🔑 Articulation-to-Speech Synthesis

ACM Multimedia 2024
sym

Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Demo Page

  • This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos, with a particular focus on a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing real speech. We introduce a novel pseudo target generation strategy, integrating the text modality to align with articulatory movements, thereby guiding the generation of pseudo acoustic features for supervised training on speech reconstruction from silent articulation. Furthermore, we propose to employ a diffusion model as the fundamental architecture for the A2A conversion task and train the model with the pseudo acoustic features generated by the proposed pseudo target generation strategy using a combined training approach. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in the silent speaking mode compared to all baseline methods.
ICASSP 2023
sym

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Demo Page

  • This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing sound. We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. Experiments show that our proposed method significantly improves the intelligibility and naturalness of the reconstructed speech in silent speaking mode compared to the baseline TaLNet model.

🔑 Audio-Articulation Speech Enhancement

IEEE/ACM TASLP 2024
sym

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Demo Page

  • This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based audio-visual speech enhancement (AV-SE) systems. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.
INTERSPEECH 2023
sym

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Demo Page

  • Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems’ performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audiolip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images.

🎈 Speech Coding

Under Review
sym

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Demo Page

  • Current neural audio codecs typically use residual vector quantization (RVQ) to discretize speech signals. However, they often experience codebook collapse, which reduces the effective codebook size and leads to suboptimal performance. To address this problem, we introduce ERVQ, Enhanced Residual Vector Quantization, a novel enhancement strategy for the RVQ framework in neural audio codecs. ERVQ mitigates codebook collapse and boosts codec performance through both intra- and inter-codebook optimization. Intra-codebook optimization incorporates an online clustering strategy and a code balancing loss to ensure balanced and efficient codebook utilization. Inter-codebook optimization improves the diversity of quantized features by minimizing the similarity between successive quantizations. Our experiments show that ERVQ significantly enhances audio codec performance across different models, sampling rates, and bitrates, achieving superior quality and generalization capabilities. Further experiments indicate that audio codecs improved by the ERVQ strategy can improve unified speech-and-text large language models (LLMs).

🎖 Honors and Awards

  • 2021.06 Honor Rank for Top 5% Graduates of USTC.
  • 2020.12 Huawei Scholarship.
  • 2019.12 Top-Notch Program Funding.
  • 2019.12 USTC Outstanding Student Scholarship, Gold Award.

📚 Teaching Assistant Experience

  • 2022 Fall, Fundamentals of Speech Signal Processing, USTC (Prof. Zhen-Hua Ling)
  • 2021 Fall, Fundamentals of Speech Signal Processing, USTC (Prof. Zhen-Hua Ling)
  • 2020 Fall, Computer Programing Design A, USTC (Lecturer. Hu Si)