MIDS Capstone Project Fall 2023


Holding conversations is the main goal of most second-language learners. Yet, most existing solutions focus primarily on grammar and vocabulary. At Conversationally we believe that Generative AI powered natural conversations is key to effective second language learning.

92 million people in the United States are learning to speak a language other than English. By 2028 the language learning market is expected to grow to $190B. However, language learners are often stuck in grammar and vocabulary lessons and fail to make the leap to basic conversations. 

While immersion is the fastest way to learn a language, consistent conversational practice is the closest learning experience to immersion. To create this experience, we address the barriers to becoming conversationally fluent: 

  • Fear of making mistakes: Provide positive conversational experiences early in the learning process.
  • Build confidence: Sequence feedback and learning goals to increase exposure to common exchanges and delay learning edge cases.
  • Increase accessibility: Provide an always on language tutor to support conversational practice.

Minimum Viable Product (MVP)

The goal of our MVP is to increase the speed and satisfaction of language acquisition while reducing friction. The web app chatbot allows users to practice conversations using modeled microlessons. Users receive immediate feedback on successes and mistakes with <2s latency.  

Our MVP consists of three components. The first component is a sentence transformer model that uses cosine similarity to evaluate if the user’s response is on topic. If it is on topic, the response is passed to the next model. Else, the user is asked to retry answering the question up to three times.

The second component is a grammatical error correction (GEC) model with explanation that identifies and annotates grammatical errors in the user input. For MVP, our model annotates number and gender errors. The annotated sentence is then sent to the large language model. 

The third component is a generative large language model that determines how to model the corrected response and provides encouraging feedback and subtle hints while moving the conversation forward. 

Model Components

In evaluating our models’ performance, we focused on three key evaluation metrics: 

  • Content Classification Model: Accuracy and F1 score.
  • Grammatical Error Correction Model: Accuracy and F1 score.
  • Large Language Model: 
    • Is the response ethical? 
    • If there is an error is the correct answer is modeled?
    • Does it provide a scaffolding response? 

We assessed the user experience in terms of latency with the goal of <2 second response time. To evaluate our learning impact, we gave users pre- and post-session surveys that measured lesson learning objectives as well as a net promoter score.

Model Strategy and Challenges

Content Classification Model

The primary goal of our classification model is to understand whether users are staying on topic during scripted practice conversations. By doing so, we can reduce the chances of hallucination with the responses generated by the Large Language Model (LLM) and ensure users stick to learning objectives. This is achieved in two steps: first, we compute the semantic similarity using cosine similarity; second, we establish a decision threshold that determines if a response is off-topic. 

We employ a sentence transformer model, specifically the multi-qa-MiniLM-L6-cos-v1, which is trained on Q&A pairs. This model is used to encode both the question and user input into sentence embeddings. The semantic proximity between these embeddings is then determined by calculating their cosine similarity. 

Given the lack of an existing dataset with annotations of on-topic or off-topic conversations, we have crafted 240 Q&A pairs to aid in the evaluation process. This approach ensures that our model is rigorously tested and fine-tuned to maintain users' focus on their learning objectives.

Grammatical Error Correction Model

After testing multiple LLMs that had Spanish language capabilities, we faced a choice. Multi-language models are much larger and more expensive to train but would allow us to add more languages quickly. Smaller models trained only in Spanish would be smaller, more efficient. We chose the smaller Spanish-only model.

Our overall approach required us to identify the actual error instead of just generating a corrected version of input. So we fine tuned Beto, a Spanish BERT model on COWS-L2H dataset to create a NER (Named Entity Recognition) model that classifies each token for error.

Large Language Model

Conversationally prioritizes conversational fluency and aims to enhance the satisfaction from using our app. We employ a state-of-the-art generative AI model, Mistral 7B, specialized in natural language processing to ensure our app's seamless and intuitive responses. Our primary challenge is fine-tuning this large language model to function like a language tutor: it guides students through the language lesson scripts while providing encouraging feedback and subtle hints. This approach ensures that students can maintain the flow of conversation, facilitating a more engaging and effective learning experience.

The scaffolding learning method supports students in the initial language learning stages, gradually removed as students become more proficient in conversations. The key concept in scaffolding, introduced by psychologist Lev Vygotsky, is the difference between what a learner can do without help and what they can achieve with guidance and encouragement from a skilled partner. Scaffolding indirectly corrects user grammar while keeping the conversation flowing.

Our language model aims to provide a scaffolding learning method in natural language processing (NLP). The challenge is instructing our LLM to respond in a language tutor-like manner with scaffold hints and clues while maintaining the conversation flow. We use multi-stage recommender calls to chain small instructions and act as a micro-chain of thoughts for our processes. The benefits of doing this are flexibility in chaining function calls and fewer hallucinations with more rigid controls. Our model adapts its responses based on the user's current abilities and understanding, and then gradually reduces the level of assistance as the user's proficiency improves.

Future Work

Here are the top areas we want to work on for future iterations. 

  • First, to enhance the learning experience, we aim to add more dynamic responses from the Large Language Model (LLM). This will provide a broader range of interaction to make the learning journey more engaging.
  • Support more error types, like conjugation errors. Currently, we only support two error types (number and gender agreement) due to the limitation of the grammatical error correction dataset.
  • We are also looking into accepting user audio input and voicing over chatbot responses. This feature will assist learners in improving their pronunciation skills, and make the learning process more immersive.
  • Create a learning curriculum with additional microlessons. This enables users to have a more structured and in-depth understanding of the language.
  • In our summary explanations, we would also like to incorporate evidence words. These will provide detailed explanations backed by supporting evidence words, enhancing the learner's understanding 
  • Finally, we want to add customized learning experiences according to the learner's language proficiency, gender, and learning goals, thereby making the learning process more tailored and effective.


Our team is grateful to our Capstone professors Joyce Shen and Kira Wetzel. Their knowledge and dedicated support were invaluable to our work. 

In addition, the following individuals contributed significantly in various phases of the project.

  • Mark Butler, NLP Professor at MIDS
  • Dr. Lee Dennig, Linguistics Professor at Stanford
  • Polly Allen, AI Career Boost 
  • Frank Song, Ed Tech Entrepreneur
Last updated: December 14, 2023