Preprint 2026 · Under Review

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Tiantian Feng1 Anfeng Xu1 Xuan Shi1 Aditya Kommineni1 Shakhrul Iman Siam2
Megan Micheletti3 Zhonghao Shi4 Helen Tager-Flusberg5 Mi Zhang2
Lynn K. Perry6 Catherine Lord3 Daniel Messinger6 Shrikanth Narayanan1
1University of Southern California    2The Ohio State University
3University of California, Los Angeles    4Harvard University   
5Boston University 6University of Miami
tiantiaf@usc.edu
15+
Child-centered
audio datasets
20+
Evaluation including
sub-tasks
5+
Foundation models of
Audio, Speech, and LALMs
Multiple
Developmental stages
from Birth to School-age
Physiological Sounds
Birth (and across Life)

Cardiac and respiratory sounds carry the clinical signals - before any spoken word or vocolization.

~
Vocalizations
Infancy

Cries, laughter, whimpers - non-linguistic events with rich affective meaning.

b
Canonical Syllables
Toddler

Babbling, proto-speech, and the emerging spoken language skills.

A
Speech
School-age

Pronunciation, fluency, prosody, intelligibility, and conversational ASR.

ChildVox is a unified benchmark covering the full developmental trajectory through which children "express" themselves, ranging from physiological sounds at birth, through infant vocalizations and toddler babble, to school-age speech.

Abstract

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.

We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g. SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g. Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Key Findings

Audio-SSL models capture what speech models miss.
SSAST and WavLM lead on cardiac murmur, crackle, wheeze, AudioSet-Child, ReCANVo, and SpeechMaturity. Generic audio pre-training preserves the fine-grained acoustic cues that speech-only pre-training may not capture.
Whisper Large dominates speech-level tasks.
Whisper-Large is best on all three SpeechOcean762 dimensions (prosody, accuracy, fluency), best on NLS intelligibility, and best on diarization and ASR, reaching 14.80 WER (Word Error Rate) on MyST and 17.7 DER (Diarization Error Rate) on NLS.
LALMs - Qwen2-Audio achieves competitive performances.
Qwen2-Audio matches the strongest encoders on AudioSet and ReCANVo while AudioFlamingo 3 has inconsistent instruction-following and hallucinated transcripts.
Specialized fine-tuned models outperforms frontier proprietary models.
Gemini 2.5 Flash and Gemini 3.5 Flash, prompted zero-shot, fall below Macro-F1 0.35 on CirCor, SPRSound, and ReCANVo. ChildVox-trained models lead on every comparable public task.

Benchmark Structure

1
Category I · Birth (But also across Life)

Physiological Sound Classification

Three pediatric cardiac and respiratory corpora supporting four classification tasks: murmur detection (CirCor, 5,272 cardiac auscultation recordings), crackle and wheeze detection (ICBHI), and respiratory condition / sound pattern classification (SPRSound). These signals carry clinical information that arrives before any spoken word.

2
Category II · Infancy

Vocalization Event Classification

Ten-way generic child sound classification on AudioSet (speech, babble, giggle, shout, laughter, cry, whimper, sing, play, and music for children), plus cry-cause classification on Donate-a-Cry (hunger vs. other) and CryBank (hunger, loneliness, discomfort). Most of these events are non-linguistic but carry rich affective meaning.

3
Category III · Toddler

Canonical Syllables Classification

Affective status classification on ReCANVo (delighted, dysregulated, frustrated, request, self-talk, social) for minimally speaking individuals, and vocal-development classification on BabbleCor and SpeechMaturity (cry, laugh, canonical, non-canonical, junk). Tasks here trace the emergence of spoken language ability.

4
Category IV · School-age

Speech Quality, Diarization & ASR

Speech-quality assessment on PERCEPT-R (rhotic vs. derhotic), SpeechOcean762 (prosody, fluency, accuracy), and UltraSuite (place of articulation); speech emotion recognition on C-BESD; child-adult speaker diarization on in-house NLS and in-house ADOS2-Mod3 dataset; word-level ASR on MyST; and phoneme-level recognition on TinyVox.

Datasets

ChildVox integrates more than 15 child-centered audio and speech datasets into one computational benchmark. The benchmark includes publicly available pediatric heart, lung, and vocalization corpora alongside two in-house naturalistic interaction datasets (NLS and ADOS2-Mod3).

Dataset Category Evaluation Task Labels
CirCorPhysiologicalMurmur detectionAbsent · Present · Unknown
ICBHIPhysiologicalCrackle / Wheeze / ConditionHealthy · COPD · Other / Crackles / Wheezes
SPRSoundPhysiologicalRespiratory soundNormal · CAS · DAS · CAS+DAS · Poor Q.
Donate-a-CryVocalizationCry classificationHunger · Other
CryBankVocalizationCry causeHunger · Loneliness · Discomfort
AudioSet-ChildVocalization10-way child soundSpeech · Babble · Giggle · Cry · Music · …
ReCANVoCanonicalAffective statusDelighted · Dysregulated · Frustrated · Request · Self-talk · Social
BabbleCorCanonicalVocal developmentCry · Laugh · Canonical · Non-Canonical · Junk
SpeechMaturityCanonicalVocal developmentCry · Laugh · Canonical · Non-Canonical · Junk
C-BESDSpeechEmotion recognitionAnger · Happy · Neutral · Sad
PERCEPT-RSpeechRhotic productionRhotic · Derhotic
SpeechOcean762SpeechProsody · Fluency · AccuracyPoor · Nearly correct · Correct (and variants)
UltraSuiteSpeechArticulator placeBilabial · Dental · Velar · Palatal · …
NLS (in-house)SpeechDiarization · IntelligibilitySpeaker · Intelligible · Vocalization
ADOS2-Mod3 (in-house)SpeechDiarization · ASRSpeaker labels · Transcript
MySTSpeechWord-level ASRSpeech transcript
TinyVoxSpeechPhoneme recognitionPhoneme transcript

ChildVox-Balanced. For fine-tuning of LALMs, we curate a label-balanced subset drawn only from publicly accessible resources, which include at most 2,000 / 50 train/test samples per label and 10,000 / 500 for ASR. This yields 64,641 audio recordings across 14 subtasks.

Foundation Models Evaluated

ChildVox evaluates three families of audio and speech foundation models, with size across ~20M to ~8B parameters. Encoder-based models are adapted via a learnable weighted average over hidden layers, a point-wise 1D convolution, and LoRA fine-tuning (rank 64). LALMs are fine-tuned with LoRA on query, key, value, and feed-forward projections.

ModelFamilyPre-training ObjectiveParams (used)
SSAST-BaseSSLSelf-supervised audio~89M
voc2vec-HuBERTSSLNon-verbal vocalization SSL~89M
WavLM-LargeSSLSpeech SSL~316M
Whisper-BaseASRVAD · ASR · LID~20M (Encoder)
Whisper-SmallASRVAD · ASR · LID~88M (Encoder)
Whisper-Large v3ASRVAD · ASR · LID~635M (Encoder)
Qwen2-Audio-InstructLALMMulti-stage uni- + multi-modal~7B
AudioFlamingo 3LALMMulti-stage uni- + multi-modal~8B

Results

Performance varies largely across the four sound categories from children. Audio-SSL models lead on physiological sounds and non-linguistic vocalizations; Whisper models performs the best in pronunciation, fluency, and ASR. Different pre-trained encoders capture different aspects of child-centered acoustic signals.

 Macro-F1 across categories (encoder-based models)

Task Dataset SSAST voc2vec-HuBERT WavLM-Large Whisper-Base Whisper-Small Whisper-Large

Diarization & ASR (lower is better)

Model DER ↓ NLS DER ↓ ADOS WER ↓ MyST WER ↓ ADOS
WavLM-Large22.1045.6016.5659.34
Whisper-Base24.2048.2016.9955.85
Whisper-Small18.7044.0015.9359.91
Whisper-Large17.7042.5014.8040.20
Parakeet-TDT--15.8245.60

ChildVox vs. Proprietary Zero-Shot Baselines

Macro-F1 on the ChildVox-Balanced test set. General-purpose LALMs struggle with fine-grained pediatric vocal and physiological audio, ChildVox-trained models lead on every task tested.

Downstream Applications

ChildVox models produce measurements that align with expert developmental judgments, not just leaderboard numbers. We highlight two exemplar applications.

Application I · NLS

Characterizing Language Level

Utterance rate from the ChildVox diarization model rises monotonically across expert-assigned spoken-language levels, pre-verbal (LL-1), first words (LL-2), word combinations (LL-3). This indicates a language level signal with diarization features.

Application II · PERCEPT-R

Tracking Rhotic Production with Age

The Whisper-Large rhotic / derhotic classifier shows a positive correlation (r = 0.576, p < 0.01) between predicted probability of correct /r/ and chronological age in typically developing children, where late-childhood articulation development captured automatically.

Ethics & License

All models released in this work are derived from publicly available sources, and ChildVox adheres to the scope of each dataset's license. Models will be released under the Responsible AI License (RAIL). Users are expected to respect the privacy of data subjects and comply with applicable laws in their jurisdictions.

Use of the models for clinical or diagnostic applications, surveillance, privacy-invasive applications, or commercial purposes is strictly prohibited. We encourage use of these models for research aimed at developing robust speech processing technology for and about children.

Citation

@inproceedings{feng_childvox2026,
  title     = {ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood},
  author    = {Tiantian Feng, Anfeng Xu, Xuan Shi, Aditya Kommineni, Shakhrul Iman Siam, Megan Micheletti, Zhonghao Shi, Helen Tager-Flusberg, Mi Zhang, Lynn K. Perry, Catherine Lord, Daniel Messinger, Shrikanth Narayanan},
  eprint={2605.29257},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  year      = {2026},
  note      = {Under review}
}

Model Release

The models trained with the public data will be released soon.