CycleCodec: Pretrain-Free Factorized Neural Speech Codec via Cycle-Consistent Speaker Swapping



Authors: Rui-Chen Zheng, Nicholas Sanders, Jinzuomu Zhong, Yang Ai, Zhen-Hua Ling, and Korin Richmond
CycleCodec

Factorized neural speech codecs often rely on supervision from pretrained self-supervised learning (SSL) encoders or automatic speech recognition (ASR) systems to disentangle content and speaker information. However, this dependency limits their applicability to low-resource languages where reliable teachers are unavailable during training. In this paper, we propose CycleCodec, a pretrain-free factorized codec trained entirely from scratch. We introduce a cycle-consistent speaker swapping strategy that enforces functional separability between time-varying content and global speaker streams via self-supervised reconstruction constraints. We further enhance the speaker representation using a query-based Transformer aggregator. Experiments on in-domain (English) and unseen languages (Mandarin and Vietnamese) show that CycleCodec significantly outperforms pretrain-free baselines and achieves more robust speaker-content disentanglement on unseen languages.






Experimental Results: Reconstruction

💡 Baseline Note

TiCodec is a distillation-free baseline that does not rely on any pretrained teacher model, making it the fairest comparison to CycleCodec. LSCodec, in contrast, uses pretrained-model supervision for content-speaker disentanglement, and should therefore be viewed as a teacher-guided upper-bound comparison rather than a strictly distillation-free baseline.




LibriTTS test-clean (English)

Input LSCodec TiCodec CycleCodec

Seed-tts-zh (Mandarin)

Input LSCodec TiCodec CycleCodec

VieNeu-TTS-140h (Vietnamese)

Input LSCodec TiCodec CycleCodec




Experimental Results: Disentanglement

💡 Baseline Note

TiCodec is a distillation-free baseline that does not rely on any pretrained teacher model, making it the fairest comparison to CycleCodec. LSCodec, in contrast, uses pretrained-model supervision for content-speaker disentanglement, and should therefore be viewed as a teacher-guided upper-bound comparison rather than a strictly distillation-free baseline.




LibriTTS test-clean (English)

Source Text Source Speech Reference Speech LSCodec TiCodec CycleCodec

To be sure, I had confidence in this devoted lad.

Ten thousand souls won for God in a single month!

Miss Taylor did not know much about cotton, but at least one more remark seemed called for.

I am six feet high, and I could do it with an effort. No one less than that would have a chance.

If any boys have special confessors perhaps it will be better for them not to change.


Seed-tts-zh (Mandarin)

Source Text Source Speech Reference Speech LSCodec TiCodec CycleCodec

他的建议迎合了一些,舍不得抛弃财产撤退的日侨,侥幸偷安的心理。

孙楠我是歌手退赛成就汪涵,揭秘汪涵的神坛之路。

我觉得你穿白衬衫配蓝领带看起来不错。

他没有钱啊,进城以后,房子越来越贵,租也租不到,就钻地下室。

玉帝说,众位爱卿,请安静,让我们好好商议一下,再做行动。


VieNeu-TTS-140h (Vietnamese)

Source Text Source Speech Reference Speech LSCodec TiCodec CycleCodec

Các trường mầm non dạy những định kiến này cho trẻ em.

Nhưng lại là hai sự thật xác định bản chất của ngành này.

Tình huống phản thực tế chỉ hữu dụng khi có giới hạn ràng buộc.

Các thể chế và đạo luật đó không thể thay đổi được tính cách con người.

Một sự trao truyền riêng biệt nằm ngoài kinh sách.