Factorized neural speech codecs often rely on supervision from pretrained self-supervised learning (SSL) encoders or automatic speech recognition (ASR) systems to disentangle content and speaker information. However, this dependency limits their applicability to low-resource languages where reliable teachers are unavailable during training. In this paper, we propose CycleCodec, a pretrain-free factorized codec trained entirely from scratch. We introduce a cycle-consistent speaker swapping strategy that enforces functional separability between time-varying content and global speaker streams via self-supervised reconstruction constraints. We further enhance the speaker representation using a query-based Transformer aggregator. Experiments on in-domain (English) and unseen languages (Mandarin and Vietnamese) show that CycleCodec significantly outperforms pretrain-free baselines and achieves more robust speaker-content disentanglement on unseen languages.
Experimental Results: Reconstruction
💡 Baseline Note
TiCodec is a distillation-free baseline that does not rely on any pretrained teacher model, making it the fairest comparison to CycleCodec. LSCodec, in contrast, uses pretrained-model supervision for content-speaker disentanglement, and should therefore be viewed as a teacher-guided upper-bound comparison rather than a strictly distillation-free baseline.
LibriTTS test-clean (English)
| Input | LSCodec | TiCodec | CycleCodec |
|---|---|---|---|
Seed-tts-zh (Mandarin)
| Input | LSCodec | TiCodec | CycleCodec |
|---|---|---|---|
VieNeu-TTS-140h (Vietnamese)
| Input | LSCodec | TiCodec | CycleCodec |
|---|---|---|---|
Experimental Results: Disentanglement
💡 Baseline Note
TiCodec is a distillation-free baseline that does not rely on any pretrained teacher model, making it the fairest comparison to CycleCodec. LSCodec, in contrast, uses pretrained-model supervision for content-speaker disentanglement, and should therefore be viewed as a teacher-guided upper-bound comparison rather than a strictly distillation-free baseline.
LibriTTS test-clean (English)
| Source Text | Source Speech | Reference Speech | LSCodec | TiCodec | CycleCodec |
|---|---|---|---|---|---|
|
To be sure, I had confidence in this devoted lad. |
|||||
|
Ten thousand souls won for God in a single month! |
|||||
|
Miss Taylor did not know much about cotton, but at least one more remark seemed called for. |
|||||
|
I am six feet high, and I could do it with an effort. No one less than that would have a chance. |
|||||
|
If any boys have special confessors perhaps it will be better for them not to change. |
Seed-tts-zh (Mandarin)
| Source Text | Source Speech | Reference Speech | LSCodec | TiCodec | CycleCodec |
|---|---|---|---|---|---|
|
他的建议迎合了一些,舍不得抛弃财产撤退的日侨,侥幸偷安的心理。 |
|||||
|
孙楠我是歌手退赛成就汪涵,揭秘汪涵的神坛之路。 |
|||||
|
我觉得你穿白衬衫配蓝领带看起来不错。 |
|||||
|
他没有钱啊,进城以后,房子越来越贵,租也租不到,就钻地下室。 |
|||||
|
玉帝说,众位爱卿,请安静,让我们好好商议一下,再做行动。 |
VieNeu-TTS-140h (Vietnamese)
| Source Text | Source Speech | Reference Speech | LSCodec | TiCodec | CycleCodec |
|---|---|---|---|---|---|
|
Các trường mầm non dạy những định kiến này cho trẻ em. |
|||||
|
Nhưng lại là hai sự thật xác định bản chất của ngành này. |
|||||
|
Tình huống phản thực tế chỉ hữu dụng khi có giới hạn ràng buộc. |
|||||
|
Các thể chế và đạo luật đó không thể thay đổi được tính cách con người. |
|||||
|
Một sự trao truyền riêng biệt nằm ngoài kinh sách. |