We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals
through which children communicate. Specifically, ChildVox follows the full developmental trajectory
from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical
syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered
audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.
We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g. SSAST),
ASR-oriented (e.g., Whisper), and large audio-language models (e.g. Qwen2Audio), on tasks ranging from physiological sound classification,
through vocalization and canonical-syllable modeling, to speech quality assessment and recognition.
Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range
of acoustic signals from children, supporting downstream applications such as characterizing
children's language levels and tracking speech production with age.