Current neural audio codecs typically use residual vector quantization (RVQ) to discretize speech signals. However, they often experience codebook collapse, which reduces the effective codebook size and leads to suboptimal per formance. To address this problem, we introduce ERVQ, Enhanced Residual Vector Quantization, a novel enhance ment strategy for the RVQ framework in neural audio codecs. ERVQ mitigates codebook collapse and boosts codec perfor mance through both intra- and inter-codebook optimization. Intra-codebook optimization incorporates an online cluster ing strategy and a code balancing loss to ensure balanced and efficient codebook utilization. Inter-codebook optimiza tion improves the diversity of quantized features by mini mizing the similarity between successive quantizations. Our experiments show that ERVQ significantly enhances audio codec performance across different models, sampling rates, and bitrates, achieving superior quality and generalization capabilities. It also achieves 100% codebook utilization on one of the most advanced neural audio codecs. Further experiments indicate that audio codecs improved by the ERVQ strategy can improve unified speech-and-text large language models (LLMs). Specifically, there is a notable improvement in the naturalness of generated speech in downstream zero shot text-to-speech tasks.
Experimental Results: Speech Coding
For more samples, please refer to this folder.
HiFi-Codec @ 2kbps for 16kHz Audio
GroundTruth | Baseline | ERVQ |
---|---|---|
DAC @ 3kbps for 24kHz Audio
GroundTruth | Baseline | ERVQ |
---|---|---|
APCodec @ 6kbps for 48kHz Audio
GroundTruth | Baseline | ERVQ |
---|---|---|
EnCodec @ 6kbps for 24kHz Audio
GroundTruth | Baseline | ERVQ |
---|---|---|
APCodec_S @ 6kbps for 48kHz Audio
GroundTruth | Baseline | ERVQ |
---|---|---|
Experimental Results: FunCodec + LauraGPT
For more samples, please refer to this folder.
Prompt Text | Prompt Audio | Baseline | ERVQ |
---|---|---|---|
You feel that, properly, Alexandra's house is the big out of doors, and that it is in the soil that she expresses herself best. |
|||
When I saw that he was absent, I withdrew at once. |
|||
Yes; but perhaps I frightened her. |
|||
Sir Harry Towne bowed and said that he had met Mr. Alexander and his wife in Tokyo. |
|||
She's apt to grow a bit stale after a time. |
|||
A stage meal is popular, because it proves to the audience that the actors, even when called Charles Hawtrey or Owen Nares, are real people just like you and me. |