FairVoice: An Equitable Audio Deepfake Detector
Deepfake Detection That’s Both Accurate and Fair
Problem and Motivation
As synthetic audio technologies like text-to-speech (TTS) and voice cloning become increasingly accessible and realistic, audio deepfakes are rapidly being used in scams, misinformation, and impersonation attacks. Yet most detection systems today focus solely on improving overall accuracy—often overlooking performance disparities across accents, regions, and speaker identities.
Recent research like FairSSD¹ and Unveiling the Achilles’ Heel of Multilingual Deepfake Detection² have found that many state-of-the-art detectors unfairly flag certain voices—especially those with non-Western accents or less represented speaker profiles—while favoring others. This leads to higher false positive or false negative rates for certain speakers, creating blind spots that threaten both fairness and security.
Our project, FairVoice, is an equitable audio deepfake detector designed to address this critical gap. Our system aims to identify deepfake audio with high accuracy while promoting fairness—reducing the risk that any demographic group is disproportionately misclassified or left more vulnerable.
Data Sources and Data Science Approach
Curated Evaluation Dataset
We designed our evaluation dataset to enable fairness-aware assessment by including diverse speaker demographics across gender and region. Specifically, we aimed for a balanced distribution of male and female speakers across nine global regions:
- Regions included: British Isles, East Asia, Middle East, North America, South Asia, Southeast Asia, Africa, Western Europe & Central/ South America
- Gender balance: Male and female speakers were equally represented within each region (50 samples per gender when possible)
- Accent diversity: Within each region, we selected speech samples with varied accents to reduce linguistic homogeneity
- Speaker integrity: We ensured that speaker identities did not overlap across training, validation, and evaluation sets
We sourced this data from a combination of public datasets—including Mozilla Common Voice, the ASR Fairness dataset, and the Speech Accent Archive—to prevent overfitting to the style of any one corpus.
While we strove for balance across all regions, some datasets had limited representation. For example, the Middle East region had only 46 usable samples per gender. Even in these cases, we maintained proportional representation and data diversity to the greatest extent possible.
Training Dataset and Class Balance
To train our models effectively, we curated a high-quality dataset composed of both real and synthetic audio.
- Speaker separation: Unique speaker identities were used in training vs. validation to prevent data leakage
- Real audio was collected from:
- Mozilla Common Voice
- ASR Fairness dataset
- Fake-or-Real dataset (reals only)
- In-the-Wild dataset (reals only)
- Speech Accent Archive
- Singapore National Speech Corpus
Deepfake audio was generated using ElevenLabs TTS API and matched proportionally to the real samples in each demographic group.
We maintained balance not just by gender and region, but also by accent whenever possible. This meant stratified sampling from all unique speakers and accents across sources, followed by proportional spoof sampling to match the demographic distributions found in our real data. The only partial exception was the Middle East, where real data was limited; in that case, spoof counts were adjusted to reflect available real samples.
Addressing Key Dataset Limitations
A key insight from The Data Addition Dilemma³ paper is that blindly adding more data—especially from dissimilar sources—can introduce instability or performance drops due to distribution shift. In our context, this means simply increasing demographic variety without careful consideration could actually worsen bias or hurt model generalization.
To address this, we were intentional about:
- Ensuring data consistency across source domains
- Avoiding overlap with datasets used to pre-train the baseline models (e.g., ASVspoof, VCTK)
- Incorporating afairness-aware training techniques to improve model robustness without sacrificing generalizability
Modeling Approach
To build an equitable and high-performing audio deepfake detector, we began with three state-of-the-art pre-trained models and fine-tuned them using our curated training dataset:
- AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks)⁴ : Spectral and temporal features transferred through GAN layered architecture.
- RawNet2⁵: Lightweight Residual CNNs with shallow neural network.
- Res-TSSDNet (Time-Domain Synthetic Speech Detection Network)⁶: Convolutional layers with residual connections, GRU layer, and attention mechanism.
All three models are end-to-end architectures, meaning they learn directly from raw or minimally processed audio without relying on hand-crafted feature engineering. While their internal structures differ—from graph-based attention mechanisms to residual CNNs and gated recurrent units (GRUs)—they share a common goal: learning the subtle temporal and spectral cues that distinguish real speech from synthetic audio.
These models have previously achieved strong performance on benchmark datasets such as ASVSpoof, evaluated primarily using Equal Error Rate (EER). However, in real-world applications, especially those involving diverse populations, performance metrics like EER alone are not sufficient. A model can achieve low EER overall while still disproportionately misclassifying voices from certain demographic groups.
To address this, our goal was to fine-tune these models on our balanced and diverse training dataset, adjusting their internal weights to improve generalization across accents and gender identities.
Evaluation
In designing our system, we prioritized fairness alongside accuracy. While there are multiple frameworks for defining fairness in machine learning—such as equality of outcomes, equality of opportunity, and equality of odds—our project focuses specifically on equality of odds. In practice, this means striving for equal false positive rates (FPRs) across demographic groups like accent and gender.
Our goal is for every speaker—regardless of identity or background—to have an equal chance of being correctly classified as authentic. No group should be disproportionately flagged as a deepfake simply because the model underperforms on their speech patterns.
To assess this, we used Mean Absolute Deviation (MAD) of FPRs as our fairness metric. The MAD score measures how much each group’s FPR deviates from the overall average:
- Lower MAD values reflect more consistent and equitable performance across groups.
- Higher MAD values indicate potential bias or uneven detection.
We chose MAD because it goes beyond accuracy, offering a more granular and responsible lens for evaluating fairness—especially in contexts where the stakes of misclassification are high.
Of the three, RawNet2 showed the strongest fairness improvement after fine-tuning, achieving the lowest MAD score across accent groups. This indicates RawNet2 was not only accurate, but also the most consistent in how it treated different types of voices—making it our selected model for deployment in the demo.
Key Learnings & Impact
Through the development of FairVoice, we’ve learned that designing deepfake detection systems that are both accurate and fair requires more than just high-performing models—it requires careful, principled attention to who these models serve, how they are evaluated, and what data they’re trained on.
- Representation matters.
Our results reaffirmed that when certain voices are underrepresented or inconsistently modeled, detection accuracy can vary drastically across demographic groups. By curating a demographically balanced dataset and evaluating performance with fairness-specific metrics like Mean Absolute Deviation (MAD), we demonstrated that equitable performance is both measurable and achievable.
- Model performance isn’t just technical—it’s ethical.
A model that performs well overall but fails certain groups reinforces existing inequities. We learned that integrating fairness early in the development process—through evaluation, data selection, and architecture choices—is essential for creating inclusive technologies.
- Future systems must evolve with deepfake techniques.
As deepfake generation methods grow more sophisticated, robust detectors must evolve in parallel. We identified the importance of expanding our training data to include a wider variety of deepfake generation techniques to better prepare models for real-world scenarios.
- Fairness is not a constraint—it’s an opportunity.
By designing for fairness, we didn’t just mitigate harm—we improved model consistency and trustworthiness. Our project demonstrates that fairness-aware machine learning is not only possible but also beneficial for producing resilient, real-world-ready models.
Acknowledgements
We would like to thank our Capstone instructors, Cornelia Paulik and Ramesh Sarukkai, for their guidance and support throughout this project. We’re also grateful to Professor Hany Farid for his expertise in deepfake detection, which helped shape our research direction and evaluation approach.
- Yadav, Amit Kumar Singh, et al. “FairSSD: Understanding Bias in Synthetic Speech Detectors.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 1-10.
- Ranjan, Rishabh, et al. “Unveiling the Achilles’ Heel of Multilingual Deepfake Detection.” Proceedings of the IEEE International Joint Conference on Biometrics, 2024, pp. 1-8.
- Shen, Judy Hanwen, et al. “The Data Addition Dilemma.” Proceedings of the Machine Learning for Health Care Conference (MLHC), 2024, arXiv:2408.04154.
- Jung, Jee-weon, et al. “AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks.” arXiv preprint arXiv:2110.01200, 2021.
- Tak, Hemlata, et al. “End-to-End Anti-Spoofing with RawNet2.” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021, pp. 6369–6373.
- Hua, Guangluan, et al. “Towards End-to-End Synthetic Speech Detection.” IEEE Signal Processing Letters, vol. 28, 2021, pp. 1265–1269.