Eye2Voice logo with bold text
MIDS Capstone Project Spring 2026

Eye2Voice: Gaze-to-Speech Application for Individuals with Disabilities

Communication is a fundamental human right – losing the ability to speak shouldn’t mean losing your voice

Eye2Voice is a gaze-to-speech communication tool that gives people with physical disabilities a voice using only their eye movements. A custom-trained CNN detects gaze direction through a standard smartphone or tablet camera, translating eye movement into screen selections across two modes: a guided conversation tree, or an LLM-powered mode where a caregiver speaks and responses are dynamically generated in real time. No specialized hardware, no setup — just open a browser on any device, removing a typically expensive barrier to communication.

See our application live here: Eye2Voice


Why We’re Building Eye2Voice

There are really two people in our story.

The first is the individual. Communication isn’t a feature or convenience – it is how we express pain, make decisions, and maintain our dignity. When someone loses the ability to speak, whether permanently from ALS or temporarily in the ICU, that doesn’t go away. The need to be heard is still there.

The second is the caregiver – The family member, nurse or partner who is showing up every day, trying to understand what their loved one needs. Right now, they’re left guessing. And the tools that exist to help require thousands of dollars and weeks of specialized setup

Eye2Voice exists because both of these people deserve better. The individual deserves a voice. The caregiver deserves better experience and more accessibility.


Our Data Science Approach

Our approach models gaze direction as a real-time 4-class classification problem using a two-stage, multi-input vision pipeline that combines client-side feature extraction with cloud-based inference.

We first process raw webcam input directly in the browser using MediaPipe, which detects facial landmarks and extracts three normalized image regions—face, left eye, and right eye—ensuring that no raw video leaves the device. These crops, along with geometric signals derived from facial landmarks (such as iris position and head pose), form a structured representation of gaze.

These inputs are then passed to our GazeNet model, a lightweight CNN-based architecture designed to fuse image features with geometric features for robust gaze direction classification. By incorporating explicit geometric signals, the model improves performance on challenging cases such as vertical gaze.

To ensure stability in real-world usage, predictions are passed through a temporal smoothing layer, which aggregates predictions over a rolling window and applies a confidence threshold before triggering an output. This reduces noise and prevents unintended activations.

This end-to-end design—combining on-device preprocessing, feature-enhanced modeling, and temporal stabilization—enables accurate, low-latency gaze classification on standard consumer devices without specialized hardware.


Datasets

Our primary dataset is MIT GazeCapture

  • It was collected in the wild, on people's personal phones and tablets — so there's real variation in lighting, head pose, and device angle.
  • And it's large — around 850,000 samples in our training set — which gave us the scale we needed to train a model that generalizes.

We also incorporated supplemental data from XGaze, a public dataset created by Swiss Federal Institute of Technology.

  • It was designed to improve the robustness of gaze estimation methods
  • By focusing on extreme head pose and gaze variation,
  • Which is often encountered in real-world assistive contexts.

Evaluation

Our core objective is 4-class direction classification. We detect when a user is looking up, down, left, or right. To achieve this, our team built and tested 60 different model configurations.

We implemented three testing modalities to ensure Eye2Voice is reliable and impactful:

  • Model Evaluation - how well does the model perform on paper?

We evaluated Gazenet model performance on a held-out test set of 29,770 GazeCapture samples using accuracy, macro F1, and per-class recall and precision. These metrics were chosen because accuracy gives overall correctness, macro F1 helps account for class imbalance, recall measures how often intended user actions are captured, and precision measures how often the model avoids unintended triggers. Our baseline accuracy was 75%, and our final model achieved 92% test accuracy, representing a 23% improvement on the test set. We also used Confusion matrices to analyze misclassifications and Train vs validation gap to track overfitting. We also reduced the train-validation gap from 16.3% to 4.1%, indicating a substantial reduction in overfitting. Final recall by direction was 84% for Up, 93% for Down, 92% for Left, and 91% for Right. This ensured our model was not only accurate but also generalizable and stable.

  • Functional Evaluation - how well does it perform in practice?

To ensure the model held up in real-world usage, we built a testing application that walks a user through 24 randomly generated direction prompts. For each prompt, the user looks at or slightly beyond the designated screen region, and the application records:

  • Prediction accuracy: how often the model is correct
  • Time to prediction: how long it takes to lock in a confident response

Our team conducted over 50 tests across multiple devices, body postures, and tripod-mounted setups to simulate practical use conditions. This helped us compare top models not just on benchmark accuracy, but on responsiveness and stability in realistic environments. The top functional models reached ~98% accuracy in practice, while the weaker baseline-style model lagged substantially, reinforcing that offline model gains translated into better user experience.

  • Small Usability Study - how well does it perform for our target users?

Finally, we evaluated the MVP with domain experts, including a speech-language pathologist and a teacher/caregiver familiar with AAC users. Their feedback helped us understand whether the system was not only technically accurate, but also usable and meaningful in real communication settings. One expert noted that a first-time user had better success controlling choices than with other apps and AAC-specific devices, while another highlighted the promise of the AI-assisted response flow for helping users express what they truly want to say.

Key challenges and remedies:

  • Participant imbalance driving overfitting
    → Mitigated through per-user frame caps and balanced sampling, improving generalization
  • Weak vertical gaze discrimination (Up/Down)
    → Enhanced with geometric features (head pose, iris ratios) to capture subtle eye-position signals
  • Overfitting across model iterations
    → Reduced via regularization, simplified architecture, and participant-level data splits
  • Ambiguity in the “Straight” class
    → Resolved by removing/reweighting the class, resulting in more stable 4-class predictions

Key Learnings

  • Model performance is not enough — stability matters more
    High offline accuracy did not guarantee usability. Temporal smoothing and functional testing were critical to convert predictions into reliable user actions.
  • Data design mattered as much as model design
    Participant-level splits and balanced sampling were essential to prevent overfitting and ensure generalization across unseen users.
  • Feature engineering unlocked the hardest problem (vertical gaze)
    Pure CNN approaches struggled with Up/Down detection. Incorporating geometric features (iris position, head pose) significantly improved performance.
  • Simpler, well-regularized models outperformed complex ones
    Reducing model size and reformulating the task as classification led to better generalization and lower overfitting.
  • Ambiguity in problem definition impacts model quality
    The “Straight” class introduced noise and instability. Refining the problem to a clear 4-class setup improved consistency.
  • End-to-end system thinking is critical
    Combining MediaPipe (client) + Lambda inference + smoothing layer was key to achieving real-time, usable performance—not just a good model.

Impact

  • High-performance gaze model
    Improved from 75% → 92% test accuracy, with strong per-class recall and reduced overfitting.
  • Real-time usability achieved
    Sub-500ms response times with stable predictions enabled practical interaction without lag.
  • Robust across real-world conditions
    Validated across devices, postures, and environments through 50+ functional tests.
  • Accessible, hardware-free solution
    Works on standard smartphones/tablets, removing the approx $12K barrier of traditional AAC devices.
  • Privacy-first architecture
    No raw video leaves the device — only compressed features are transmitted, ensuring user trust.
  • Validated by domain experts
    Early usability feedback confirmed the system is intuitive, effective, and meaningful in caregiving scenarios.

Acknowledgements 

We’d like to express our gratitude to our professors, Korid Reid and Joyce Shen, whose guidance, feedback, connections and encouragement were essential to the success of our project.
We extend our sincere thanks to Bryan J.  Goeller Teacher and Caregiver, Emma Goeller, and Megen Smith a Speech-Pathologist for taking the time to speak to use and provide feedback and guidance for our application development.
And to our team – thank you for the long hours, the dedication to making something that helps people and never settling on just good, but wanting to make something great. It has been a pleasure to build this together.

 

Last updated: May 20, 2026