ROAM - Banner
MIDS Capstone Project Spring 2025

ROAM

Problem & Motivation

ROAM is an AI tour guide assistant that empowers users to learn about the world around them. The platform enables users to upload an image of a landmark, and receive instant information about its history and significance. From active travelers to passive wanderers, our mission is to empower individuals to learn in a simple and intuitive way that will help foster understanding, empathy, and global awareness!

Our world is becoming increasingly interconnected as technology continues to develop and improve. It is our collective job to continue teaching and learning about those with different backgrounds.

When traveling to or exploring a new destination, it is often difficult to learn about the culture, history, or significance behind interesting landmarks without the assistance of a tour guide or local expert. By combining image detection with retrieval augmented generation (RAG), ROAM brings the knowledge and charisma of local historians to the palms of travelers’ hands.

With image detection, users will first be able to confidently identify what surrounds them. Incorporating RAG then provides succinct summaries about historical significance and other relevant information for the user to learn from. This knowledge will help build stronger relationships across disparate communities and promote inclusivity.

Data Source & Data Science Approach

High Level System Architecture

ROAM’s architecture is fully deployed on AWS EC2 and Sagemaker Endpoints. The front end application is hosted on a secure domain (roam-pic.com), which is ran on an EC2 instance. Using a secure domain allows for users to access Streamlit safely on their phone via web-access and allows access to camera and location information. When users take a photo of a landmark, the photo is sent to our AWS Sagemaker endpoints, which include two computer vision models for landmark detection and a RAG model for historical retrieval. This is then shown back to the user on Streamlit.

About the Data

For landmark detection, we are using the Google Landmarks v2 dataset. It was filtered down from ~4.2 million to ~1.6 million images by researchers at Google. With over 81 thousand landmarks, this made the data much more suitable for training a high precision landmark identification model. Still, several instances remained where the main object in the image was not aligned with the image’s label.

To programmatically filter out these noisy images, a visual question answering model called BLIP is used. For every image in each class (landmark), BLIP identifies the main object in the image. The frequency of each unique BLIP label for each class is counted and only the images within a class with the most frequent BLIP label are kept.

For the RAG pipeline, we used primarily two data sources: UNESCO and Wikipedia. The UNESCO dataset includes over 1,000 landmarks across 168 countries and is highly regulated, with strict submission guidelines ensuring data quality. To enrich this, we dynamically append the top two Wikipedia entries for the identified landmark using LangChain’s Wikipedia API. These entries are only added on demand—when a user queries the app—to keep the RAG database lightweight and retrieval times fast.

ML Architecture

Let’s take a quick look at the Machine Learning Architecture. The process starts with the user provided image and location coordinates. The image is then passed to our Primary Classification model - we fined tuned a Vision Transformer from Google, using the landmark dataset described above. The finetuned model takes in the image, vectorizes it and breaks it up into sixteen patches, where a CLS token is included as well. The CLS token will contain the aggregated contextual representation of the entire image. After passing through the Vision Transformer, it then gets passed to a dense neural network containing one hidden layer and 200 neurons. Finally, a softmax activation function is applied, providing a probability distribution over all of the classes. The highest probability class is then compared against our confidence threshold - if the probability is below the threshold, the image will get passed to the Secondary Image Classification to try again.

For our secondary classification model, we are using CLIP from OpenAI in combination with AWS location services. CLIP is a neural net model that is trained on image and text pairs. The advantage of this model is that it can be used for zero-shot prediction where it is predicting classes it wasn’t directly trained on. By using the user’s location to narrow down landmarks CLIP could help us generalize to lesser known landmarks not seen by the first vision transformer. If the probability is above the threshold, the class label is displayed back to the user, and that label is used as the input to the RAG system to generate the landmark information to the user.

RAG Pipeline

Once the vision model identifies the landmark from the user-provided image, the RAG pipeline retrieves relevant information to simulate an interactive tour guide experience. Using a custom prompt that includes the landmark and any additional user-provided details, the system queries a RAG database containing curated data from UNESCO World Heritage Sites and relevant Wikipedia entries.

The retrieved content from UNESCO and/or Wikipedia is then passed, along with the prompt, into the Mistral model’s context window. This generates engaging, tour guide-style responses tailored to the identified landmark.

Evaluation

For the Primary Classification model, we were able to achieve an F1 score of 78.9% on over 21,000 testing images. We incorporated a confidence threshold to determine when images should be sent to the Secondary Classification model to try again. If the highest probability after applying a softmax activation is above the confidence threshold, that landmark will be returned to the user and used as an input for the RAG system. 

If the highest probability is below that threshold, our model has low confidence in the prediction, and the image will be sent to the Secondary Classification model for a second attempt. By setting a confidence threshold of 0.75, we have optimized the Primary Model such that 0% of the test images were misclassified and incorrectly passed on to the RAG system. 

For the Secondary Classification model, evaluation was performed on a 5,000 image subset of the testing data that included available Wikipedia location coordinates. It achieved a limited model accuracy of 44%, but allowed the application to generalize and identify landmarks not seen by the primary classification. 

For the RAG system, we are using RAGAS (Retrieval-Augmented Generation Assessment) as our primary evaluation framework. RAGAS provides granular metrics to assess different aspects of generated responses, including faithfulness (how well the response aligns with the retrieved context) and relevance (whether the response correctly addresses the user query). In our evaluation, we're specifically checking whether the generated response is about the correct landmark and if it accurately reflects the retrieved content. Across 100 samples, our relevance score is high at 0.923, indicating that the system consistently discusses the correct landmark. However, faithfulness scores are lower at 0.71, which is expected—our prompts intentionally encourage the LLM to generate more engaging, tour guide-style responses. This often includes descriptive language or slight embellishments that go beyond the exact retrieved context.

Key Learnings & Impact

When fine tuning the Primary Classification model, we noticed training images that did not do a good job of describing the landmark such as pictures of a statue or plaque inside of a church. If other statues or plaques appeared for other classes, the model became quite confused. We considered these images “noisy” and worked to clean them from our training set. Issues such as this often arise when working with a large amount of data. 

The Secondary Classification model was limited by availability of AWS locations and language differences. Without certain geographical areas covered by AWS, the Secondary model can’t generalize to detect nearby landmarks at all. As well, if landmarks are listed in different languages, it's unlikely for CLIP to match the landmark name with an image. Overall, having the Secondary Classification model is still beneficial to the application, allowing users to potentially identify unseen landmarks in the dataset. 

A limitation we encountered on the RAG side of the product is that our system primarily relies on English-language data sources. As a result, it struggles to retrieve relevant context when a landmark is referred to by its English-translated name rather than its native or locally used name. This mismatch often leads to erroneous responses. To address this, we can enhance the retrieval pipeline by incorporating multilingual data sources (e.g., native Wikipedia pages or localized tourism datasets) and integrating a translation or entity resolution step. This would help map translated landmark names to their original names, improving retrieval accuracy and overall response quality.

Despite the current limitations, ROAM makes cultural discovery more accessible by turning any smartphone into a knowledgeable, AI-powered tour guide. By combining image recognition with retrieval augmented generation, it allows users to instantly identify landmarks and learn about their historical and cultural significance. This approach empowers travelers to engage more deeply with their surroundings. The project supports global understanding and empathy by making learning about different cultures effortless and intuitive. Ultimately, ROAM helps bridge cultural gaps and fosters more inclusive and informed exploration. 

Acknowledgements

We would like to acknowledge our capstone instructors, Joyce Shen and Korin Reid, as well as the entire W210 course instructor team, for their tremendous guidance and support throughout the entire semester. Additionally, we would like to thank all the respondents of our initial user study survey, which played a huge part in directing the strategy and focus of this project.

Last updated: April 14, 2025