Banner_image
MIDS Capstone Project Spring 2025

GenZen

Problem & Motivation 

36% of students seeking a bachelor’s degree considered withdrawing from their program, with 70% of them citing emotional stress as a key factor (Lumina Foundation-Gallup Study 2022). Mental health challenges among college students have been steadily increasing due to academic pressures, social challenges, and limited timely psychological support. As of 2024, as many as one-third of college students in the United States have reported moderate or severe anxiety (Healthy Minds 2024). Despite the growing need, campus mental health resources are often overburdened; college students often experience long wait times before receiving support, with the average delay for an initial counseling appointment reaching 8 days (AUCCCD Survey 2024). This gap in timely support can harm academic performance, proliferate feelings of isolation, and stunt professional development as the risk of college student attrition increases.

Our Solution 

GenZen is a 24/7 AI mental health assistant offering personalized, empathetic support. It helps students manage stress, provides academic/career guidance, and flags users needing crisis care. GenZen does not replace therapy but instead bridges the gap in supporting common college stresses and anxieties. Through the app, we push to make mental health support more scalable, proactive, and immediate for students during their most transformative years. 

Data Source & Data Science Approach

Data Sources

We sourced data from 2 Reddit-based (American online forum and social media platform) datasets to train our Suicide & Depression detection classifiers and an annotated counseling conversation dataset to train our Mental Health Expert: 

Suicide & Depression Dataset

  • Source: Reddit (SuicideWatch, Depression, Teenager pages from 2009-2021)
  • Size: 348k posts
  • Classes: SuicideWatch, depression, teenagers
  • Purpose: Build a classifier to detect suicidal ideation or depression in user messages.

Depression Severity Level Dataset

  • Source: Reddit, annotated by clinical experts
  • Size: 41.8k posts
  • Classes: minimum, mild, moderate, severe (moderate & severe merged during processing)
  • Purpose: Build a classifier to assess depression severity in user messages.

ESConv Dataset

  • Source: Annotated 15-minute online counseling conversations
  • Size: 1,300 conversations (~13.4k message turns after processing)
  • Purpose: Fine-tune GenZen's responses to be empathetic and conversational, using metadata such as:
    • Counselor strategy (e.g., Affirmation, Question, Suggestion)
    • User-rated empathy
    • User emotion labels (from HarmonyNote Fall 2024 capstone classifier)

Academic & Career Resource Corpus

  • Source: Curated open-source academic and career planning materials by college program experts
  • Purpose: Grounding GenZen’s responses using Retrieval-Augmented Generation (RAG).

Data Processing

Reddit Datasets:

  • Cleaned to remove emojis, usernames, and URLs
  • Minimal syntax alteration to preserve natural language
  • Depression dataset: moderate & severe classes merged for response strategy

ESConv Dataset:

  • Conversations split into 1 message turn per row
  • Consecutive speaker messages merged
  • User history constructed from prior messages
  • Strategy tags aggregated per turn; first strategy extracted separately
  • User’s emotion was classified using HarmonyNote (92% accuracy from a Fall 2024 capstone)
  • Filtered to high-empathy conversations involving key strategies (OARS model-aligned: Affirmation, Questioning, Suggestion, Restatement)

Data Science Approach

GenZen incorporates the following data science techniques within its main features: 

  • Text classification - Suicide & Depression detection
  • Supervised fine-tuning - Trained on ESConv to respond with empathy and conversational flow
  • RAG - Supplement LLM responses with academic and career resources
  • Agent Infrastructure using LangGraph - Dynamically determines user needs and routes to appropriate tools (e.g., classifier, RAG)

Agent Infrastructure 

When a user submits input, the system first runs the text through the piirahna-v1 model to detect and anonymize personally identifiable information (PII). The anonymized text is then passed to the GPT-4.0 assistant, which interprets the message and determines whether to respond directly or invoke an agent tool.

Suicide & Depression Detection Flow 

For each turn, the assistant invokes the Suicide & Depression Detection tool:

  • Initial Model: modernBERT
    • Predicts whether the user’s message indicates suicidal ideationdepression, or a miscellaneous topic.

       
  • Secondary Model: Ensemble: Avg Proba
    • Triggered only if depression is detected, triggers an ensemble of mental-BERT, mental-roBERTa, and modernBERT models to further classify severity into minimummild, or moderate/severe.

Response logic based on prediction:

  • No sensitive topic or minimum depression → Continue conversation or invoke another tool.
  • Mild depression → Send an affirming message encouraging the user to seek mental health support, with links to resources.
  • Moderate/severe depression or suicidal ideation → Send an urgent support message with crisis resources, including professional help guidance and the suicide hotline.

Mental Health Expert Tool

If the user displays stress or anxiety, the assistant calls the Mental Health Expert tool, powered by a fine-tuned DeepSeek model:

  • Internally infers the user's:
    • Emotion type and intensity
    • Problem type
    • Counseling strategy (e.g., affirmation, questioning, restatement, suggestion)
  • Generates an empathetic response based on the above reasoning.

If the user opts not to receive suggestions, the DeepSeek prompt is modified to focus solely on affirmation, questioning, and restatement.

If the strategy selected includes providing suggestions, the RAG (Retrieval-Augmented Generation) pipeline is triggered:

  • Relevant academic or mental health resources are retrieved from the vector database
  • The documents are passed as context for the agent’s final response

Contextual RAG Retrieval (Inspired by Anthropic) 

To improve the relevance of retrieved documents in GenZen, we implemented a Contextual RAG method based on Anthropic’s approach. Here's how we employed it:

1. Adding Context to Chunks

Using LLaMA 3.1-8B-Instruct, each chunk is paired with a summary based on the full document. This ensures that chunks carry meaningful context, not just isolated information, helping to improve search.

2. Dual Embedding Strategy

Chunks are embedded with BAAI/bge-large-en-v1.5 and stored in a Qdrant vector database. At the same time, TF-IDF encodings are created to capture keyword importance and term frequency.

3. Two Search Methods, One Result

When a user submits a query, we run in parallel:

  • Vector search (cosine similarity via Qdrant)
  • BM25 search (TF-IDF-based relevance)

Afterward, we merge, deduplicate, and rank the top N results.

4. Reranking for Precision

Finally, Cohere’s Rerank-English-v3.0 selects the most relevant K chunks, which are sent to the assistant as grounding context.

Scalable & Secure AWS Technical Architecture Overview

GenZen is built with a layered AWS architecture that ensures scalability, security, and speed.

Traffic Engineering

User requests flow through a multi-layered infrastructure: DNS resolution via Route53, TLS termination at AWS Load Balancer, and traffic management through Istio Gateway into our EKS cluster. This zero-trust architecture implements mTLS between services while the Next.js frontend handles API aggregation through dynamic route handlers that proxy requests to internal services.

Backend Microservices Architecture

Our Python-based backend leverages FastAPI for high-performance asynchronous API handling with comprehensive type safety. The system integrates:

  • Redis for distributed session management
  • Qdrant for vector embeddings storage for cosine similarity search
  • PostgreSQL with async connection pools for persistent storage
  • SageMaker endpoints running DeepSeek, ML classification, and a PII detection models
  • LangGraph for orchestrating multi-stage reasoning flows with hybrid RAG integration

Infrastructure & CI/CD

Services are containerized with multi-stage Docker builds, stored in ECR, and deployed via GitOps with Kustomize overlays for environment-specific configurations. The EKS cluster implements HPA based on custom Prometheus metrics. All persistent storage uses PostgresDB with AES-256 encryption and IAM-based access controls.

Evaluation

Suicide & Depression Detection

 

As the base models’ architecture increases in complexity, the test metrics improve. Although the boosted trees ensembled model has the best results for most metrics, the improvement is marginal compared to modernBERT. Running all base models and the boosted trees model would increase inference time for little gain, thus, modernBERT was selected for initial detection of suicide and depression. 

Assessing modernBERT’s misclassifications through the confusion matrix, 94% of misclassifications are due to the model mixing depression and suicide. The consequence of this misclassification is not too severe; the user might receive a response with a different level of urgency but would still receive mental health resources. 

Depression Severity Level

Contrasting from the above results, ensembling the base models demonstrates noticeable gains. By averaging the predicted probabilities of all base models, there is a 2% increase in performance across all metrics. Since the base models’ metric scores are lower, improving model performance was a priority. The longer latency time to run the mental-BERT, mental-roBERTa, and modernBERT models to ensemble for more accurate predictions seems justified. 
The trend for misclassified text is similar to the suicide and depression classifier; most misclassifications are due to the model mixing moderate/severe depression and mild depression. The consequence of this misclassification is also similar - the user would receive extra resources that might not be as relevant but would point them in the direction to address their mental health concerns. 

Mental Health Tool (Fine-tuned DeepSeek)

To assess the performance of our fine-tuned DeepSeek model in generating empathetic responses, we adopted an evaluation method based on LLM-as-a-judge, inspired by Tobias Cabanski’s Assessing Mental Health Responses with AI. Using the same prompts and scoring criteria, we evaluated model outputs on three dimensions - empathyappropriateness, and relevance - each rated on a scale from 1 to 5 with clearly defined descriptors for each level. To ensure evaluation diversity, we used the following LLMs as judges:

  • DeepSeek R1 Distilled Llama 8B
  • DeepSeek R1 Distilled Qwen 7B
  • Mistral 7B v0.3

Each model independently scored the DeepSeek outputs, and we averaged their ratings to compute final scores for each metric.

The fine-tuned DeepSeek model performed consistently well in generating appropriate and relevant responses, often scoring above 4 in those metrics. For empathy, it generally scored around a 4, with some variability in the upper and lower quartiles. Manual reviews of lower-empathy examples revealed a pattern: When empathy scores dipped, it was usually because the model provided advice without first acknowledging the user's emotional state. To address this, we built logic into the pipeline - if the DeepSeek model determines the best course of action is to provide suggestions, the assistant will defer to the RAG tool to enhance its response.

Key Learnings & Impact

Technical Insights

  • Deploying multiple models with similar inference structures under one endpoint significantly reduces inference costs.
  • For improved DeepSeek responses, separating historical conversation context from the current message enhanced performance. Additionally, increasing the rank and adjusting the lora_alpha during fine-tuning helped reduce loss and improve model stability.
  • Qdrant was favored over Postgres for vector storage due to its ease of use. However, future migrations to AWS RDS would improve scalability by offloading vector store initialization from each pod.

Engineering Challenges & Solutions

  • Moving from local development to a scalable, cloud-based system required a gradual, layered approach. Starting simple and scaling complexity worked best.
  • FastAPI and Next.js integration presented challenges in managing environment variables. We had to refactor the Docker setup to pass secrets securely and support proxy-based routing in production.
  • When building agent tools, designing them as asynchronous functions from the start ensures smoother integration with modern web frameworks and enables better performance within the app.

Research Takeaways

  • Our EDA on the ESConv dataset showed that all users, regardless of perceived empathy, experienced reduced emotional intensity post-session. The greater the empathy and intensity, the bigger the emotional improvement. This insight affirms GenZen’s potential to positively impact student mental health while easing the burden on campus counseling centers.
  • Combining BM25 with semantic search (via Qdrant) in the RAG pipeline helped the model retrieve the knowledge base more accurately than a vector-only search.

Team Workflow

  • Dividing research and engineering tasks early helped us efficiently build ML classifiers, fine-tune DeepSeek, and implement contextual RAG.
  • Regular Slack communication, weekly standups, and focused working sessions helped keep us aligned and minimized scope creep.
  • DevOps practices like structured GitHub workflows were key to collaboration.

Acknowledgements 

We would like to acknowledge and thank our capstone instructors, Cornelia Paulik and Ramesh Sarukkai, for providing continued feedback and project scoping. Additionally, we would like to acknowledge and thank our mental health subject matter experts, Nikkolson Ang and Cassandra Aguirre, who assessed the ethics of our product and helped develop our featured mental-wellness resources.  

Last updated: April 30, 2025