logo languagex
MIDS Capstone Project Summer 2023


Problem and Motivation

Over 20% of people in the United States speak a language other than English at home (72 million people), with 350 languages spoken in US homes. There are more than 120 million calls every year in the US that require interpretation due to language barriers. However, there are challenges of over-the-phone interpretation that arise from current solutions which utilize human interpreters. 

  1. The costs associated with human translators can be prohibitive, ranging from one to four dollars per minute. Moreover, they do not offer scalability, as interpreters can be unavailable during peak demand or for less commonly spoken languages.

  2. Some may wonder why Google Translate wouldn't suffice. The reason is that Google Translate is optimized for in-person translations and doesn't provide support for real-time phone call translations. It also necessitates the download of an app, creating an additional step for the user. 

  3. There are AI-based solutions like yous.ai that offer affordability and scalability; however, they still necessitate the use of an app or website from both parties, which introduces additional friction. 

In contrast, LanguageX provides an affordable, scalable solution that can handle any level of demand or language requirement and can be accessed simply through a phone call, rendering it the most seamless solution available.

Minimum Viable Product (MVP)

Our MVP consists of two primary components. The first is an AI interpreter integrated into a phone number we've set up using Twilio. When you dial this number, it triggers a request to our endpoint hosted on Google Cloud Platform (GCP), where our interpreter code resides, initiating the translation process. The second component is a monitoring dashboard that allows both our team and our users to track usage and performance metrics, ensuring a smooth user experience.

Within the scope of the Twilio phone number, we've designed the system to first allow users to select their desired language. Subsequently, speech from each caller is transcribed into text. This text is then processed through a translation model, which generates the translated text. Finally, we employ text-to-speech technology to vocalize the translated text, completing the cycle of real-time translation.

Model Components

In evaluating our model's performance, we focused on three key metrics: latency, transcription accuracy, and translation accuracy. Latency tests showed our model outperformed human translators, averaging a 2-second response time compared to the industry standard of 3 seconds. For transcription accuracy, GCP and Deepgram performed the best on the FLEURS benchmark dataset in terms of Word Error Rate (WER). For translation, GCP led in BLEU score metrics. Additionally, we carried out a head-to-head comparison between Deepgram and GCP for speech-to-text. Both performed comparably in WER, but GCP had more reliable API usage and a tighter latency range. Based on these findings, we've opted to proceed with GCP's Speech-to-Text model.

LLM Strategy

After integrating GCP's Speech-to-Text model, we identified limitations in transcription accuracy, particularly in sentences where phonetically similar words led to incorrect transcriptions. To address this, we incorporated a Large Language Model (LLM) into our workflow. Initially, our concern was that adding an LLM layer would increase latency. However, we optimized the process by parallelizing transcription correction and translation tasks, thus delivering quick yet accurate translations. We employed GPT-3.5 using prompt engineering and few-shot learning techniques to prompt the model. We first analyzed common transcription errors—like missing initial words in a sentence—and then designed tailored prompts to correct those specific issues.

Future Work

Although our current language model didn't make the cut for the prototype, we see significant potential for improvement. Following consultations with Professor Mark Butler, an NLP expert at MIDS, we plan to curate a dataset with programmatically introduced noise to train the model in denoising transcription errors. We intend to use open-source multilingual models like LLaMa or BLOOM for experiments, feeding them both sound and the curated dataset to fine-tune for our use case. Additionally, we'll separate the tasks of text correction and translation into two distinct models, introducing FastCorrect for the former and fine-tuning another for more natural, colloquial translations. To more accurately measure performance, we plan to use embedding-based metrics such as BERT score or Universal Sentence Encoder, ensuring that our evaluation captures semantic accuracy rather than just lexical similarity.


Our team is forever grateful to our Capstone professors Joyce Shen and Alberto Todeschini. We wouldn't have achieved this far without their strong support in our journey and ambition to launch the product.

In addition, the following individuals contributed significantly in various phases of the project.

  • Mark Butler, NLP Professor at MIDS
  • Dr. William Lee
  • Dr. Clancy Howard
  • Attorney Gary Mann
  • Fellow classmate Pedro


LanguageX Demo Video

LanguageX Demo Video

If you require video captions for accessibility and this video does not have captions, click here to request video captioning.

Last updated:

September 18, 2023