Current speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure of speech, where information is distributed unevenly over time. To address this, we propose VARSTok, a VAriable-frame-Rate Speech Tokenizer that adapts token allocation based on local feature similarity. VARSTok introduces a temporal-aware clustering algorithm that segments speech into variable-length units, followed by vector quantization. To support downstream language modeling without auxiliary duration predictors, we further design an implicit duration encoding scheme that integrates both content and temporal span into a single token index. Experimental results show that VARSTok achieves superior reconstruction naturalness and improved semantic representation compared to fixed-rate baselines while using fewer tokens. Additionally, it yields lower or comparable word error rates and better naturalness in downstream text-to-speech task. To the best of our knowledge, this is the first work to demonstrate that fully dynamic, variable-rate speech tokenizer can be directly and effectively integrated into downstream speech language models.
Experimental Results: Speech Reconstruction
Test‑Clean Subset
GroundTruth | BigCodec @ 40Hz, 0.52 kbps |
XCodec2.0 @ 50Hz, 0.80 kbps |
WavTokenizer @ 75Hz, 0.90 kbps |
WavTokenizer @ 40Hz, 0.48 kbps |
VARSTok @ 36.81Hz, 0.52 kbps |
VARSTok @ 30.95Hz, 0.43 kbps |
VARSTok @ 26.29Hz, 0.37 kbps |
---|---|---|---|---|---|---|---|
Test‑Other Subset
GroundTruth | BigCodec @ 40Hz, 0.52 kbps |
XCodec2.0 @ 50Hz, 0.80 kbps |
WavTokenizer @ 75Hz, 0.90 kbps |
WavTokenizer @ 40Hz, 0.48 kbps |
VARSTok @ 36.81Hz, 0.52 kbps |
VARSTok @ 30.95Hz, 0.43 kbps |
VARSTok @ 26.29Hz, 0.37 kbps |
---|---|---|---|---|---|---|---|
Experimental Results: TTS using a single AR Model
Prompt Text | Prompt Speech | WavTokenizer @ 40Hz | VARSTok @ 36.81Hz | VARSTok @ 30.95Hz | VARSTok @ 26.29Hz |
---|---|---|---|---|---|
The rain drops were still dripping and gleaming from the trees, flashing back the heavy yellow sunlight. |
|||||
But the middle class wife still carries on the business of vicarious leisure, for the good name of the household and its master. |
|||||
This was why he held several long conferences with his friend Marshall, the manager at the mill. |
|||||
The chaos in which his ardour extinguished itself was a cold indifferent knowledge of himself. |
|||||
Yes, his mother was hostile to the idea, as he had read from her listless silence. |
|||||
Yes, but if she should have understood, and understood too well, she may talk. |