Preprint 2026 · Under Review

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Tiantian Feng¹ Anfeng Xu¹ Xuan Shi¹ Aditya Kommineni¹ Shakhrul Iman Siam²
Megan Micheletti³ Zhonghao Shi⁴ Helen Tager-Flusberg⁵ Mi Zhang²
Lynn K. Perry⁶ Catherine Lord³ Daniel Messinger⁶ Shrikanth Narayanan¹

¹University of Southern California ²The Ohio State University
³University of California, Los Angeles ⁴Harvard University
⁵Boston University ⁶University of Miami
tiantiaf@usc.edu

Paper Model (Release Soon) Leaderboard

15+

Child-centered
audio datasets

20+

Evaluation
sub-tasks

Foundation models of
Audio, Speech, and LALMs

Multiple

Developmental stages
from Birth to School-age

♡

Physiological Sounds

Birth (and across Life)

Cardiac and respiratory sounds carry the clinical signals - before any spoken word or vocolization.

Vocalizations

Infancy

Cries, laughter, whimpers - non-linguistic events with rich affective meaning.

Canonical Syllables

Toddler

Babbling, proto-speech, and the emerging spoken language skills.

Speech

School-age

Pronunciation, fluency, prosody, intelligibility, and conversational ASR.

ChildVox is a unified benchmark covering the full developmental trajectory through which children "express" themselves, ranging from physiological sounds at birth, through infant vocalizations and toddler babble, to school-age speech.

Abstract

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.

We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g. SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g. Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Key Findings

Audio-SSL models capture what speech models miss.

SSAST and WavLM lead on cardiac murmur, crackle, wheeze, AudioSet-Child, ReCANVo, and SpeechMaturity. Generic audio pre-training preserves the fine-grained acoustic cues that speech-only pre-training may not capture.

Whisper Large dominates speech-level tasks.

Whisper-Large is best on all three SpeechOcean762 dimensions (prosody, accuracy, fluency), best on NLS intelligibility, and best on diarization and ASR, reaching 14.80 WER (Word Error Rate) on MyST and 17.7 DER (Diarization Error Rate) on NLS.

LALMs - Qwen2-Audio achieves competitive performances.

Qwen2-Audio matches the strongest encoders on AudioSet and ReCANVo while AudioFlamingo 3 has inconsistent instruction-following and hallucinated transcripts.

Specialized fine-tuned models outperforms frontier proprietary models.

Gemini 2.5 Flash and Gemini 3.5 Flash, prompted zero-shot, fall below Macro-F1 0.35 on CirCor, SPRSound, and ReCANVo. ChildVox-trained models lead on every comparable public task.

Benchmark Structure

Category I · Birth (But also across Life)

Physiological Sound Classification

Three pediatric cardiac and respiratory corpora supporting four classification tasks: murmur detection (CirCor, 5,272 cardiac auscultation recordings), crackle and wheeze detection (ICBHI), and respiratory condition / sound pattern classification (SPRSound). These signals carry clinical information that arrives before any spoken word.

Category II · Infancy

Vocalization Event Classification

Ten-way generic child sound classification on AudioSet (speech, babble, giggle, shout, laughter, cry, whimper, sing, play, and music for children), plus cry-cause classification on Donate-a-Cry (hunger vs. other) and CryBank (hunger, loneliness, discomfort). Most of these events are non-linguistic but carry rich affective meaning.

Category III · Toddler

Canonical Syllables Classification

Affective status classification on ReCANVo (delighted, dysregulated, frustrated, request, self-talk, social) for minimally speaking individuals, and vocal-development classification on BabbleCor and SpeechMaturity (cry, laugh, canonical, non-canonical, junk). Tasks here trace the emergence of spoken language ability.

Category IV · School-age

Speech Quality, Diarization & ASR

Speech-quality assessment on PERCEPT-R (rhotic vs. derhotic), SpeechOcean762 (prosody, fluency, accuracy), and UltraSuite (place of articulation); speech emotion recognition on C-BESD; child-adult speaker diarization on in-house NLS and in-house ADOS2-Mod3 dataset; word-level ASR on MyST; and phoneme-level recognition on TinyVox.

Datasets

ChildVox integrates more than 15 child-centered audio and speech datasets into one computational benchmark. The benchmark includes publicly available pediatric heart, lung, and vocalization corpora alongside two in-house naturalistic interaction datasets (NLS and ADOS2-Mod3).

Dataset	Category	Evaluation Task	Labels
CirCor	Physiological	Murmur detection	Absent · Present · Unknown
ICBHI	Physiological	Crackle / Wheeze / Condition	Healthy · COPD · Other / Crackles / Wheezes
SPRSound	Physiological	Respiratory sound	Normal · CAS · DAS · CAS+DAS · Poor Q.
Donate-a-Cry	Vocalization	Cry classification	Hunger · Other
CryBank	Vocalization	Cry cause	Hunger · Loneliness · Discomfort
AudioSet-Child	Vocalization	10-way child sound	Speech · Babble · Giggle · Cry · Music · …
ReCANVo	Canonical	Affective status	Delighted · Dysregulated · Frustrated · Request · Self-talk · Social
BabbleCor	Canonical	Vocal development	Cry · Laugh · Canonical · Non-Canonical · Junk
SpeechMaturity	Canonical	Vocal development	Cry · Laugh · Canonical · Non-Canonical · Junk
C-BESD	Speech	Emotion recognition	Anger · Happy · Neutral · Sad
PERCEPT-R	Speech	Rhotic production	Rhotic · Derhotic
SpeechOcean762	Speech	Prosody · Fluency · Accuracy	Poor · Nearly correct · Correct (and variants)
UltraSuite	Speech	Articulator place	Bilabial · Dental · Velar · Palatal · …
NLS (in-house)	Speech	Diarization · Intelligibility	Speaker · Intelligible · Vocalization
ADOS2-Mod3 (in-house)	Speech	Diarization · ASR	Speaker labels · Transcript
MyST	Speech	Word-level ASR	Speech transcript
TinyVox	Speech	Phoneme recognition	Phoneme transcript

ChildVox-Balanced. For fine-tuning of LALMs, we curate a label-balanced subset drawn only from publicly accessible resources, which include at most 2,000 / 50 train/test samples per label and 10,000 / 500 for ASR. This yields 64,641 audio recordings across 14 subtasks.

Foundation Models Evaluated

ChildVox evaluates three families of audio and speech foundation models, with size across ~20M to ~8B parameters. Encoder-based models are adapted via a learnable weighted average over hidden layers, a point-wise 1D convolution, and LoRA fine-tuning (rank 64). LALMs are fine-tuned with LoRA on query, key, value, and feed-forward projections.

Model	Family	Pre-training Objective	Params (used)
SSAST-Base	SSL	Self-supervised audio	~89M
voc2vec-HuBERT	SSL	Non-verbal vocalization SSL	~89M
WavLM-Large	SSL	Speech SSL	~316M
Whisper-Base	ASR	VAD · ASR · LID	~20M (Encoder)
Whisper-Small	ASR	VAD · ASR · LID	~88M (Encoder)
Whisper-Large v3	ASR	VAD · ASR · LID	~635M (Encoder)
Qwen2-Audio-Instruct	LALM	Multi-stage uni- + multi-modal	~7B
AudioFlamingo 3	LALM	Multi-stage uni- + multi-modal	~8B

Results

Performance varies largely across the four sound categories from children. Audio-SSL models lead on physiological sounds and non-linguistic vocalizations; Whisper models performs the best in pronunciation, fluency, and ASR. Different pre-trained encoders capture different aspects of child-centered acoustic signals.

Macro-F1 across categories (encoder-based models)

Task	Dataset	SSAST	voc2vec-HuBERT	WavLM-Large	Whisper-Base	Whisper-Small	Whisper-Large

Diarization & ASR (lower is better)

Model	DER ↓ NLS	DER ↓ ADOS	WER ↓ MyST	WER ↓ ADOS
WavLM-Large	22.10	45.60	16.56	59.34
Whisper-Base	24.20	48.20	16.99	55.85
Whisper-Small	18.70	44.00	15.93	59.91
Whisper-Large	17.70	42.50	14.80	40.20
Parakeet-TDT	-	-	15.82	45.60

ChildVox vs. Proprietary Zero-Shot Baselines

Macro-F1 on the ChildVox-Balanced test set. General-purpose LALMs struggle with fine-grained pediatric vocal and physiological audio, ChildVox-trained models lead on every task tested.

Downstream Applications

ChildVox models produce measurements that align with expert developmental judgments, not just leaderboard numbers. We highlight two exemplar applications.

Application I · NLS

Characterizing Language Level

Utterance rate from the ChildVox diarization model rises monotonically across expert-assigned spoken-language levels, pre-verbal (LL-1), first words (LL-2), word combinations (LL-3). This indicates a language level signal with diarization features.

Application II · PERCEPT-R

Tracking Rhotic Production with Age

The Whisper-Large rhotic / derhotic classifier shows a positive correlation (r = 0.576, p < 0.01) between predicted probability of correct /r/ and chronological age in typically developing children, where late-childhood articulation development captured automatically.

Ethics & License

All models released in this work are derived from publicly available sources, and ChildVox adheres to the scope of each dataset's license. Models will be released under the Responsible AI License (RAIL). Users are expected to respect the privacy of data subjects and comply with applicable laws in their jurisdictions.

Use of the models for clinical or diagnostic applications, surveillance, privacy-invasive applications, or commercial purposes is strictly prohibited. We encourage use of these models for research aimed at developing robust speech processing technology for and about children.

Citation

@inproceedings{feng_childvox2026,
  title     = {ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood},
  author    = {Tiantian Feng, Anfeng Xu, Xuan Shi, Aditya Kommineni, Shakhrul Iman Siam, Megan Micheletti, Zhonghao Shi, Helen Tager-Flusberg, Mi Zhang, Lynn K. Perry, Catherine Lord, Daniel Messinger, Shrikanth Narayanan},
  eprint={2605.29257},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  year      = {2026},
  note      = {Under review}
}

Model Release

The models trained with the public data will be released soon.