Noise robustness remains a critical challenge in the development of neural audio codecs, particularly for real-world speech communication scenarios. This paper presents a novel training strategy, progressive probabilistic top-K sampling, designed to enhance the noise robustness of audio codecs while training exclusively on clean speech data. Unlike traditional residual vector quantization (RVQ) methods that select the closest codebook vector, our approach probabilistically samples from the top-K closest candidates, simulating noise at the code level and enabling the model to handle unseen noisy conditions. Additionally, we propose a progressive training strategy that gradually introduces the noise robustness from the final quantizer to the first quantizer in the RVQ structure. Experimental results on one of the most advanced audio codecs demonstrate significant improvements in noise robustness, with PESQ increasing from 2.399 to 2.466 for decoded noisy speech, while maintaining high-quality performance for clean speech.
Experimental Results: Coding Noisy Speech
EnCodec @ 6kbps for 24kHz Audio
| Input Noisy | Closest | Closest* | Proposed | Proposed†|
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Experimental Results: Coding Clean Speech
EnCodec @ 6kbps for 24kHz Audio
| Groundtruth | Closest | Closest* | Proposed | Proposed†|
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |