A person using phone and pointing to the street
MIDS Capstone Project Spring 2025

Safewalk for the Visually Impaired

Introduction

The streets we live on are often filled with various obstacles, and unexpected hazards can arise at any moment, making it challenging for visually impaired people to walk safely. SafeWalk is a tool designed specifically to aid visually impaired individuals navigate these streets securely. SafeWalk identifies obstacles near the user and provides timely audio alerts using the fine-tuned object detection model on-device. Additionally, through the integrated visual language model, users can receive detailed descriptions or ask questions about images captured in real-time through their phone's camera.

Problem & Motivation

  • Problem statement: Over 36 million people globally live with severe visual impairments — a number projected to double within the next 30 years. Meanwhile, more than 270,000 pedestrians die annually on roads. Navigating outdoor environments poses a daily challenge for many visually impaired individuals, with limited access to reliable tools to assist in real-time hazard detection and spatial awareness.
  • Target users: SafeWalk is designed for visually impaired individuals living in urban areas who have access to a smartphone. Our product supports voice interaction, multiple languages, and on-demand scene description to accommodate various levels of vision impairment and preferences for interaction.
  • Market impact: The assistive technology market was valued at $3.9 billion in 2021 and is expected to grow to $13.2 billion by 2030. We aim to support and empower this growing demographic by making mobility tools smarter, more personalized, and accessible to all.
  • Domain expert insight: In our user research, we interviewed a domain expert to better understand the daily challenges visually impaired individuals face. The conversation emphasized the need for proactive, adaptive, and context-aware assistance — affirming the direction of SafeWalk's real-time object detection and voice-guided features. Their feedback played a key role in shaping our alert rules and scene description functions.
  • Our mission: SafeWalk is dedicated to building a world equally accessible to all. Our mobile app combines cutting-edge object detection and visual-language models to empower visually impaired individuals with the safety, independence, and confidence to navigate their surroundings.

Data Source & Data Science Approach

Dataset Overview

  • Volume: ~15GB total, over 90,000 labeled images
  • Sources: COCO dataset, Public Images from Robowflow, Kaggle
  • Labeling Tools: Roboflow Annotation Tool + Manual QA

Object Classes

  • Common Classes (14): Person, Bicycle, Car, Motorcycle, Bus, Truck, Traffic Light, Fire Hydrant, Stop Sign, Parking Meter, Bench, Bird, Cat, Dog
  • Additional Classes (13): Pothole, Scooter, Tree, Trash Bin, Bollard, Fence/Barrier, Traffic Cone, Bad Roads, Crosswalk, Pole, Stairs, Upstairs, Downstairs

Challenges & Solutions

  • Challenges

    • Mislabeling and noise in public datasets
    • Minority classes like potholes hard to source
    • Discrepancies between real-world and training data
  • Solutions

    • Manual relabeling and quality control
    • Data augmentation and undersampling strategies
    • Custom datasets captured with GoPro / phone cameras

Preprocessing Workflow

  • Data augmentation for minority classes (rotation, flipping, brightness)
  • Undersampling of dominant classes (e.g., person)
  • Manual quality checks and relabeling
  • Dataset split: 80% train / 10% validation / 10% test

Project Objective 1: 

YOLOv5s - Obstacle Detection Model

We selected YOLOv5s as our base model for object detection due to its lightweight structure and fast inference speed.

Model Comparison

ModelPrecisionRecallmAP 0.5mAP 0.5-0.9
Yolo v5s0.650.480.520.31
Yolo v7s0.630.330.370.20
Yolo v11s0.660.530.560.37
  • Deployment: Optimized using TorchScript for Android compatibility
  • mAP 0.5: Mean Average Precision calculated using a single Intersection over Union (IoU) threshold of 0.5.
  • mAP 0.5–0.9: Mean Average Precision averaged over multiple IoU thresholds, ranging from 0.5 to 0.95 in increments of 0.05.

Fine-Tuning Strategy

  • Changed optimizer from SGD to Adam
  • Lowered learning rate from 0.01 to 0.001 for better convergence
  • Increased training epochs to 100
  • Implemented Class Weight Parameter
  • Addressed class imbalance with undersampling and augmented oversampling
  • Applied relative threshold by class by best F1 score
PR Curve for fine-tuned YOLOv5s

Examples of relative threshold

Final Result

After fine-tuning, the model showed improvements across all key performance metrics:

ModelPrecisionRecallF1 ScoremAP 0.5mAP 0.5–0.95
Basemodel0.650.480.540.520.31
Fine-Tuned model0.820.670.730.760.54

Precision-Recall curve moved to the top-right corner showing the improvement after fine-tuning

PR Curve for fine-tuned YOLOv5s

 

PR Curve for fine-tuned YOLOv5s

New Class Detected After fine-tuning

Project Objective 2: 

Alert Rule System

Key Objective

Our adaptive alert engine uses YOLO outputs to trigger spoken warnings based on real-time detection and object location. This system aims to enhance safety while minimizing alert fatigue.

  • To reduce fatigue of users by alerting only when it matters
  • To enhance usability with directional guidance
  • To provide safer decision-making by right-time warnings based on front-end rule sets

Alert Rule Development

Essentially, SafeWalk does not alert for every object detected by the model. It only provides alerts when users need to make safer decisions. To achieve this, the system checks various parameters of each object's bounding box—such as its center width, height, top, and bottom. As an object gets closer to the user, its bounding box becomes larger, indicating an increase in both height and width. Additionally, by analyzing the center position of the bounding box, the system can determine whether the user is approaching the object from the right, left, or directly in front.

Alert rule bounding box logic

Bounding box logic for directional alerts

Adaptive Alert

There are two modes. The detection mode is the initial alert that users receive. In this mode, objects are detected along with directional guidance (right, left, or in front), ensuring users are aware of them ahead of time. When an object moves closer to the user, a stronger alert is triggered, stating, “Be careful {obstacle}.” Different parameter conditions are used to differentiate these two modes.

Adaptive alerts illustration

Two-stage adaptive alert system

Challenges & Solution

Developing consistent alerting rules for all object classes due to drastically different shapes and sizes—such as wide cars, narrow and short bollards, and moving people—required extensive testing and adjustments. We employed a custom test mode to evaluate how bounding boxes behaved in real-world conditions and tuned our logic accordingly.

Bounding box testing and calibration

Custom test mode for bounding box calibration

Project Objective 3: 

Qwen2 - Scene Description Model

To provide clear and context-rich visual descriptions, we evaluated several Vision Language Models (VLMs). Qwen2 consistently outperformed Paligemma in both qualitative and quantitative metrics.

Overall Evaluation Summary

We scored models on three custom designed metrics across 30 real-world urban scenes:

  • Navigation Safety: Identifying obstacles, hazards, and safe paths
  • Spatial Orientation: Helping users build a mental map
  • Environmental Awareness: Providing relevant context (e.g., weather, time, ambiance)

Overall Evaluation Scores

ModelNavigation SafetySpatial OrientationEnvironmental AwarenessOverall
Paligemma3.02.12.22.4
Qwen23.93.83.93.9
Gap+0.9+1.7+1.7+1.4

Why Qwen2?

  • 67% better performance across all evaluation metrics
  • 4–6x more detailed descriptions (avg. 65–70 words vs. 11–12)
  • Higher accuracy in describing surroundings, layout, and spatial cues

Scene Example: Which Model Describes Better?

We prompted both models with the same question for the picture shown:

Q: Please describe what is in front of me
Example comparison between Qwen2 and Paligemma

Paligemma’s Response

“The image is a video of a crosswalk with a person and a car in the middle of the road.”

Qwen2’s Response

“The image shows a city street scene during the day time. There are several people crossing the street at the pedestrian crossing marked with yellow line. The street is relatively wide and has a few cars parked along the side. There are trees with blooming flowers likely cherry blossoms…”

Scoring Comparison

MetricPaligemmaExplanationQwen2Explanation
Navigation Safety2.5- Mentions person & car
- No crosswalk or signals
- Vague on safety context
3.0- Notes pedestrian crossing
- Mentions people walking
- No traffic signal info
Spatial Orientation1.0- Only says “crosswalk”
- No layout or direction
- No reference points
4.0- Describes street & parked cars
- Mentions crosswalk & layout
- Good scene structure
Environmental Awareness1.0- No weather/time info
- No surroundings
- Just car & person
3.8- Mentions trees & cherry blossoms
- Notes daytime setting
- Adds calm ambiance
Overall1.5- Object-level only
- Missing context
- Not helpful for navigation
3.6- Balanced description
- Good spatial & environmental cues
- Supports user needs

Metrics-Based Caption Evaluation

We also evaluated the generated descriptions using NLP similarity metrics between model captions and human-written ones, Qwen2 also has higher scores across:

Caption similarity metrics (ROUGE, METEOR)

Qwen2 outperforms in ROUGE and METEOR, showing more overlap and fluency with human references

How to use the Safewalk App

System Architecture

SafeWalk integrates mobile, vision, and language AI technologies into a seamless system that assists visually impaired individuals in real-time. Below is a breakdown of our core components.

SafeWalk System Architecture Diagram

Architecture overview: Front-end app orchestrates detection and description flows

1. Mobile Front-End

  • Developed in Flutter for cross-platform capability
  • Real-time video feed with overlayed obstacle detection results
  • Voice-based interaction and feedback with gesture control
  • Front/back camera toggle + fast alert trigger system

2. YOLOv5s Obstacle Detection

  • Trained on 90,855 labeled images across 27 object classes
  • Converted to TorchScript for on-device performance
  • Detects and classifies objects with bounding boxes
  • Provides spatial inputs for the voice alert logic

3. Qwen2 Visual Language Model (VLM)

  • Cloud-hosted using AWS SageMaker + Lambda
  • Receives cropped image input and returns natural language captions
  • Optimized prompt engineering for environmental awareness
  • Audio output is synthesized using TTS for user-friendly feedback

Deployment Strategy

  • YOLOv5s: Runs on-device via TorchScript for low-latency and offline use
  • Qwen2: Accessed via secure REST API for scalable and language-rich output
  • Designed for modular expansion: wearable camera support, iOS app, etc.

Key Learnings & Impact

  • A customized Yolo model: New outdoor classes, large datasets, decent performance, reasonable training costs.
  • Informative alerts: Additional colors on the position of objects based on the boundary of boxes; VLM integration for detailed descriptions.
  • Efficient deployment: Yolo model running on device for low latency; VLM model via API for light App size

Future Works

  • Expanding Operation System Compatibility: Currently our App is available on Android. We could expand to cover IOS version and enable functionalities available in Apple devices.
  • Optimizing Alert Rules with AI: Currently our alert rules are manually refined based on test users’ experiences. Hopefully, alert rules can be optimized with AI models to adapt to the size of objects in camera.
  • Expanding Global Services: We now have one customized Yolo model based on training images sourced disproportionally from Asian countries. Customizing object detection models for specific countries or cities based on users’ locations can improve the performance. While we already integrated multiple languages, such as Spanish, German, French and Russian, we still want to enhance the translation quality and provide more multilingual services.
  • Wearable Device Integration:  Ideally, our App can connect with wearable sports cameras for hands-free options Other cost-efficient solutions should also be under consideration.

Acknowledgements

We would like to sincerely thank Joyce Shen and Zona Kostic for their continued support and encouragement throughout this project.

Last updated: May 9, 2025