AgeVoicE: Evaluating Voice AI for Aging in Place
Older Adults Are Using AI — But Is AI Ready for Them?
The need is undeniable: 55% of adults age 50+ already use voice- or text-based AI technologies, relying on them for health information, reminders, safety, and staying socially connected. Yet 64% of older adults say today’s technology is not designed with their age in mind, revealing a growing gap between user needs and system design. And with the tech-for-aging market projected to reach $120 billion by 2030, the urgency to build AI that truly supports aging in place has never been greater.
Every day, new AI-powered tools promise to make life easier, yet not all voices benefit equally. As the aging population grows and the cost of assisted living skyrockets, more older adults want to remain independent and age in place. Voice-enabled technologies could help them do exactly that. But age-related changes in memory, speech planning, and vocal physiology - like slower cognitive processing, thinner vocal cords, weaker speech muscles, and breathier or shakier speech - can make it harder for AI systems to understand aging voices.
There has been extensive research on gender and accent bias in speech recognition, yet age-related performance gaps have been largely overlooked. This gap between technological promise and real-world accessibility is exactly what motivates our work: ensuring that voice AI evolves to serve an aging population that is already using these tools and increasingly depending on them.
Our Data Science Approach
From a data science perspective, this project centers on understanding how a large-scale speech recognition model performs across different populations and conditions - and how data, training choices, and model architecture influence that performance. To evaluate Whisper, we conducted a deep analysis the small, base, and large versions of the model, examining their internal layer behavior, their training data composition, and how each model handled speech from older adults. This involved investigating the pre-training corpus, analyzing tokenization patterns, and studying where recognition errors surfaced within the model pipeline.
Data
Because no dedicated dataset of aging voices exists, we used Mozilla's Common Voice 23, the most accessible open dataset that included speakers over 50. However, this dataset comes with important limitations. The voices in Common Voice are not fully representative of the 50+ population, which spans individuals who are highly tech-savvy to those with limited digital access or experience. Participants self-recorded their samples on the Mozilla website: requiring them to create an account, read scripted prompts, and submit recordings. This introduces selection bias and performance bias as the speakers who participated tend to be more comfortable with technology, and they can re-record multiple times until they sound their best, unlike real-world voice assistant usage.
To broaden the coverage of aging-related vocal characteristics, we also incorporated speech samples from DementiaBank, a clinical dataset containing recordings of individuals with Alzheimer’s disease and other cognitive impairments. These recordings capture speech patterns that are more representative of the real variability seen in older adults, including hesitations, reduced articulation clarity, and irregular pauses. Importantly, DementiaBank allowed us to supplement Common Voice with more naturalistic, conversational speech, rather than only short scripted prompts.
Measuring Whisper’s Stability: Bootstrapped Confidence Intervals Across Conditions
To evaluate Whisper’s robustness on aging voices, we randomly selected 300 audio clips from speakers 65+ in our Common Voice dataset and measured baseline performance using the Whisper base model. We then created modified versions of each clip:
- Slowed by 0.5
- Time-stretched the speech to 50% of the original speed
- Kept the pitch the same (did not make the voice lower)
- Result: slower, more hesitant, and more breathy-sounding speech - common traits in older speakers
- Pause
- Inserted a ~2-second silence into each audio clip
- Placed the pause at a natural break in the speech (a quiet gap between words/phrases)
- Avoided the very beginning or end of the clip to keep it realistic
- Stutter
- Identified a syllable onset (a strong sound at the start of a word)
- Took a 150 ms chunk of audio from that onset
- Repeated that chunk twice, with tiny 60 ms gaps of silence between each repeat
- Added a small crossfade to avoid clicks at the splice points
- Avoided creating the stutter too close to the very start or end of the clip
- Pause + Stutter
- First inserted a ~2-second pause at a natural break in the sentence
- Immediately after that pause, added a stutter effect (150 ms repeated twice with small gaps)
- Combined both effects while keeping transitions smooth (crossfades at joins)
- Ensured neither modification happened at the edges of the audio
Evaluation Methodology
We evaluated our AgeVoicE model against the OpenAI Whisper-Large-v3 baseline using Word Error Rate (WER) as the primary metric. Word Error Rate is the percentage of words the model gets wrong - counting substitutions, deletions, and insertions- and is the industry standard for evaluating automatic speech recognition. Our test set consisted of both DementiaBank samples and CommonVoice samples. To ensure fair comparison, we deduplicated oversampled data before evaluation and applied consistent text normalization to both predictions and ground-truth transcripts.
Results
Our fine-tuned AgeVoicE model achieved a 16.8 percentage point improvement in overall test WER, reducing errors from 33.3% to 16.5% - nearly cutting Whisper’s error rate in half. This level of improvement meaningfully shifts the user experience - what was previously an unreliable system that misheard one out of every three words now approaches practical usability for real-world voice assistant interactions.
Dataset | Whisper | AgeVoicE | 𝝙 Improvement |
|---|---|---|---|
DementiaBank | 52.8% | 50.7% | +2.1% |
CommonVoice 60+ | 21.5% | 9.9% | +11.6% |
Overall Test WER | 33.3% | 16.5% | +16.8% |
Across individual datasets, we observed consistent gains. On CommonVoice 60+, which contains older-adult speech, AgeVoicE reduced WER from 21.5% to 9.9% (+11.6 points). On DementiaBank, one of the most challenging benchmarks due to disfluencies, slowed retrieval, and cognitive-linguistic impairments, our model still achieved a measurable improvement (+2.1 points). Even small gains on DementiaBank are notable given the severity of speech degradation in these samples—demonstrating that our approach generalizes beyond healthy aging voices.
To evaluate whether our model specifically benefits older speakers, we conducted an age-stratified analysis on CommonVoice. The results show strong improvements for both age groups:
- Speakers under 59 improved from 31.3% → 12.3% (+19.0 points)
- Speakers 60 and older improved from 21.5% → 9.9% (+11.6 points)
This finding is important for two reasons:
- Our model does not overfit to older speakers.
Even though the project is motivated by aging-voice accessibility, the model actually improves transcription accuracy for younger speakers as well. - AgeVoicE is not just an “aging voice model,” but a stronger ASR model overall.
The improvements generalize across age groups, suggesting that techniques used to make ASR more inclusive for older adults can simultaneously enhance performance for everyone.
Age Group | Whisper | AgeVoicE | 𝝙 Improvement |
<59 | 31.3% | 12.3% | +19.0% |
60+ | 21.5% | 9.9% | +11.6% |
Together, these results show that AgeVoicE meaningfully reduces errors across diverse speech conditions and ages. Rather than trading off performance between user groups, our fine-tuning approach produces a model that is both more accurate and more equitable, helping advance voice technology that works reliably for people of all ages.
Key Learnings & Impact
- No Trade-off: Improvements Generalize Across All Age Groups
- A common concern with demographic-targeted fine-tuning is that improving performance for one group may degrade performance for others. Our results dispel this notion, as all groups show substantial improvement. This demonstrates that training on diverse, age-inclusive data with careful attention to underrepresented speech patterns creates a universally better model - not one that trades off performance between groups. Inclusive design benefits everyone.
- “Noise” can be a Signal: Preserve Domain-Specific Markers
- Standard ASR preprocessing strips filler words as noise. But in clinical speech, these disfluencies appear more frequently and carry diagnostic value. By adding special tokens instead of removing them, we preserved information that distinguishes our target domain. Domain expertise should guide preprocessing decisions - what looks like noise in one context may be a signal in another.
- Fine-Tuning Requires Balancing Domain Adaptation with Knowledge Preservation
- Fine-tuning large pretrained models presents a fundamental tension: the model must acquire domain-specific capabilities while retaining its generalizable representations learned from large-scale pretraining. Excessive adaptation leads to catastrophic forgetting, where the model loses robust performance on general speech; insufficient adaptation yields no measurable improvement on the target domain.
Acknowledgements
We would like to express our deepest gratitude to our professors, Korin Reid and Fred Nugen, whose guidance, feedback, and encouragement were essential to the success of this project. We also extend our sincere thanks to the DementiaBank participants and research collaborators. Finally, to our team - thank you for believing in the vision of this project and for the many hours, late nights, and weekends you dedicated to bringing it to life.
Resources
[1] Institute for Healthcare Policy and Innovation (2024). How Older Adults Use and Think About AI. National Poll on Healthy Aging, University of Michigan.
[2] Beach & Zhang (2018). Factors Affecting Seniors' Perceptions of Voice-enabled User Interfaces.
[3] Brewer, R. (2019). How Older Adults Use and Think About AI. AARP Research / University of Michigan.
[4] Benito-Gorron, Lozano-Diez, Toledano, Gonzalez-Rodriguez (2019). Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP Journal on Audio, Speech, and Music Processing.
