Teaser Image
MIDS Capstone Project Fall 2025

IntegraMS: A Multimodal Clinical Intelligence Platform Anchored in Proteomics for Predicting MS Progression

Introduction

Multiple Sclerosis (MS) is a chronic autoimmune disorder that affects nearly one million people in the US, with prevalence continuing to rise. In MS, the immune system -- particularly antibody-producing B-cells -- targets myelin, the protective sheath insulating nerve fibers in the central nervous system. Damage to the myelin disrupts normal nerve signaling and contributes to a wide variety of symptoms, including fatigue, numbness, balance issues, and cognitive changes that can impact daily functioning and long-term independence. Though B-cell depleting therapies have transformed MS treatment in recent years, understanding and anticipating disease progression in individual patients remains a major clinical challenge due to the variability and unpredictability of disease.

As a result, MS researchers generate diverse data to better understand the biology of MS. While these data sources offer valuable insight in isolation, they are often analyzed separately from clinical records, limiting their context in characterizing patient disease trajectories over time. There are currently few unified workflows that can integrate research-grade molecular data with clinical information using interpretable and reproducible analytical methods. IntegraMS was developed to close this gap: a cloud-based dashboard and modeling framework that integrates proteomic and clinical metadata to help researchers model disability trajectories and examine immune-reactive proteins through a structured, repeatable pipeline.

Problem & Motivation

Multiple Sclerosis (MS) diagnosis and management are multifactorial, yet real-world research workflows often resemble a patchwork of disconnected systems. One increasingly important tool in immunology research and drug discovery is phage display–based antibody profiling, a high-throughput technique that enables rapid, unbiased screening of patient antibodies against large “libraries” of proteins. These experiments give a detailed view of immune activity at scale by revealing which specific proteins—or patterns of proteins—are recognized by patient antibodies. In parallel, circulating biomarkers such as neurofilament light chain (NfL) are increasingly measured in blood as a proxy for neuronal injury.

In a recent study, Zamecnik, Sowa et al. (2024) used phage display to screen antibodies from patients with MS and demographically matched healthy controls against a library spanning all the possible human proteins. The authors identified a subset of patients with MS whose antibodies were enriched for a specific protein pattern detectable in blood samples collected up to five years prior to symptom onset. This work represented an important advance, demonstrating that antibody-based proteomic signatures can reflect disease-associated immune activity well before clinical presentation. While this study cohort included clinical information, the relationship between these proteomic patterns and downstream clinical outcomes such as disability progression was not explored.

IntegraMS began as an effort to systematically link these high-dimensional proteomic measurements with available clinical data from this study cohort. As this work progressed, it became clear that the challenges encountered—including data harmonization, time alignment, interpretability, and scalability—were not unique to a single study, but instead reflected a broader field-wide gap between cutting-edge proteomic research and clinically relevant insights.

A bird’s eye view: 

  • Phage display experiments tell us how patient antibodies from before and after diagnosis react to 500000+ proteins
  • Neurofilament Light Chain (NfL) levels tell us about how neuronal injury changes from before and after diagnosis
  • Clinical information includes demographics, symptom history, and a disability severity score (DSS)
    • DSS is the gold standard used in MS care to measure disability progression

Some challenges: 

  1. Harmonizing these siloed datastreams
  2. Interpreting the highly-dimensional phage display data
  3. Inconsistent temporal alignment across disease milestones
  4. No standardized pipeline to connect immune fingerprints to clinical outcomes

The results: fragmented and slow analysis; difficulties in translating research findings across cohorts and institutions. 

IntegraMS addresses these challenges by providing a cloud-based dashboard and modeling framework that integrates proteomic and clinical data within a structured, reproducible workflow. By combining peptide-level filtering, clinical timeline alignment, cohort construction, and interpretable modeling, the platform enables clinician-scientists to explore relationships between immune activity and disability trajectories in consented research populations. IntegraMS is designed to bridge the gap between research-grade molecular data and clinically relevant insight—supporting discovery, hypothesis generation, and translational research.

Data Source & Preprocessing

The core dataset for IntegraMS originates from the Department of Defense Serum Repository (DoDSR) as described in Zamecnik, Sowa et al. (2024). The cohort includes de-identified clinical and proteomic data for 250 MS patients, and demographically matched healthy controls. Each patient has phage display proteomics measured at two timepoints, producing roughly 500,000 raw peptide signals per sample per timepoint. While the team initially explored approaches such as the Mann–Whitney test and principal component analysis or PCA-based selection for filtering for enriched peptides, the final approach uses a different filtering strategy grounded in biological thresholds and curated criteria. This method does not incorporate Mann–Whitney in the final feature engineering pipeline, although it was considered earlier in the project.

Clinical metadata includes age, sex, race, disease onset year (D_Onset), diagnosis year (D_Dx), DSS score history, and clinical progression timelines. Because measurements occur at non-uniform points in the disease course, carefully aligning years relative to diagnosis and disability progression was a central preprocessing challenge.

To expand the dataset, the team generated 1,020 synthetic patient records using a Hierarchical Modeling Algorithm (HMA) synthesizer. After applying a quality control requirement ensuring availability of DSS three years post-diagnosis, 930 synthetic records remained. These were split into training and testing sets; the training set contained 744 samples.

To improve demographic representation, the training dataset was enhanced using SMOTE (Synthetic Minority Over‑Sampling Technique), balancing gender and race to 30% of the majority class.

  • Original training samples: 744
  • Balanced training samples: 1,107
  • New synthetic samples created: 363

This combination of the DoDSR dataset, curated peptide features, and demographically balanced synthetic data provides a unified foundation for downstream modeling.

The final feature space used in the model includes:

  • Demographics (age, sex, race)
  • Disease duration and milestone years (onset, diagnosis, current year)
  • Longitudinal DSS information aligned to diagnosis
  • Engineered peptide-level features representing top-ranked proteomic signals

Together, these integrated data sources form a cohesive dataset for predicting MS disability progression and analyzing proteomic–clinical relationships.

Data Science Approach

The IntegraMS modeling pipeline focuses on a single primary prediction target: the current Disability Status Scale score (DSS_Cur) on its 0–10 integer scale. To generate these predictions, the model incorporates demographic variables, disease timeline features such as time from symptom onset to diagnosis and time since diagnosis, aligned DSS history, and a curated peptide-level feature set derived from phage display proteomics. By integrating these diverse inputs, the system is able to model disability status in a way that reflects both clinical progression and underlying proteomic signatures.

Beyond point prediction of current disability, IntegraMS also implements extrapolation techniques to forecast disability progression over time. A key component of this is a survival-based model parameterized with the Weibull_min distribution, which estimates the time until patients reach clinically meaningful disability thresholds such as DSS6 or DSS8. We chose the Weibull distribution because it is widely used in medical progression modeling and naturally captures how disease evolves over time, including accelerating, decelerating, or constant hazard patterns. In parallel, an ensemble progression module averages predictions from exponential, logistic, and power-law progression curves, providing a flexible family of trajectories that can represent both steady and abrupt clinical changes. Together, these methods allow the system not only to estimate current status, but also to project plausible future trajectories under multiple progression dynamics.

To enable robust learning, the pipeline uses a combined dataset consisting of real DoDSR patient records and a large synthetic cohort, with demographic distributions balanced through SMOTE. This blended dataset helps the model capture a broader range of clinical and demographic variation. The modeling approach emphasizes interpretability, ensuring that researchers can understand how different features—both clinical and proteomic—as well as the parameters of the survival and trajectory models, contribute to disability prediction and progression forecasts, while still taking advantage of modern machine-learning methods capable of handling high-dimensional data.

Developing this model required overcoming several key challenges. One major hurdle was the extreme dimensionality and noise inherent in raw proteomic data, which necessitated careful filtering and feature engineering to distill meaningful biological signals. Another challenge was the inconsistent timing of clinical measurements, requiring rigorous temporal alignment to maintain a coherent disease timeline. The coarse, integer-based nature of DSS introduced additional modeling difficulty, as did the limited size of the real-world patient cohort used for validation. The current model configuration addresses these issues through targeted preprocessing, synthetic data augmentation, a carefully structured train/test split, and survival/curve-fitting procedures that are robust to sparse and irregularly spaced longitudinal data.

Looking forward, several planned extensions aim to deepen the clinical usefulness and predictive strength of the system. Future iterations may incorporate richer survival or time-to-event modeling to refine estimates of when patients are likely to reach critical disability milestones such as DSS6 (when a patient requires a cane or walking assistance) or DSS8 (when a patient is wheelchair bound), and to quantify uncertainty around those forecasts. Additional work will include systematic hyperparameter tuning, exploration of feature interactions to capture nuanced relationships between demographic and clinical factors, and grouping peptides into biologically meaningful pathways to enhance interpretability and clinical relevance. Together with the survival-based Weibull_min modeling and the exponential/logistic/power ensemble of progression curves, these enhancements will help evolve IntegraMS into an even more powerful tool for understanding and forecasting MS progression.

System & Architecture

IntegraMS is implemented as a cloud-native biomedical research platform using AWS and Firebase. The frontend, built with React and Tailwind, serves as the main interface for researchers. Through it, users can upload patient-level data, inspect curated peptide feature summaries, and view model-generated DSS predictions.

The backend is written in Node.js and is responsible for authentication, data validation, preprocessing, and coordination with AWS. Firebase Authentication secures user access, and the backend enforces schema validation for uploaded CSVs containing clinical and proteomic information. Once validated, data are stored and prepared for model inference.

Model inference is containerized and deployed on AWS ECS. Model images are stored on ECR, while SQS manages inference job queues and CloudWatch provides monitoring and logging. Raw and processed data—including proteomic inputs, model outputs, and intermediate artifacts—are stored in S3. Processed predictions and associated metadata are surfaced via Firestore to support responsive, queryable views in the dashboard.

The minimum viable product is designed to move researchers efficiently from raw data to model prediction:

  1. Upload a CSV file containing proteomic and clinical features.
  2. Allow the backend to validate, parse, and preprocess the input.
  3. Trigger the ECS-based inference service to generate DSS_Cur predictions.
  4. Display these predictions, along with key contextual information, in an integrated dashboard.

This architecture provides a reproducible, scalable environment for running MS disability prediction models on proteomic and clinical data, and it lays the groundwork for future expansion to additional cohorts, features, and prediction tasks.

Key Learnings & Impact

In building IntegraMS end to end, the team learned that synthetic data plays a critical role in strengthening models trained on small, specialized clinical cohorts. Aggressive yet biologically grounded peptide filtering is essential to transform hundreds of thousands of raw proteomic signals into a feature space that ML models can learn from reliably. The DSS outcome variable, while clinically useful, poses modeling challenges due to its coarse, integer-based scale and its collection at irregular intervals, making temporal alignment indispensable.

Interactions with clinicians and domain experts emphasized the importance of interpretability and biological plausibility, not just numerical performance. These insights shaped how features were engineered and how the pipeline was designed and documented. Finally, transitioning from exploratory notebooks to a fully deployed cloud system underscored the importance of robust infrastructure—containerization, monitoring, data management, and a usable front-end.

Taken together, these components make IntegraMS a practical, extensible platform for investigating MS disability progression through the lens of proteomics and clinical data, and a foundation for future work in precision neurology.

Achievements & Next Steps 

Unifying Data. Forecasting Progression. Empowering Care. We have successfully bridged the gap between scarce data and actionable intelligence. By fusing automated phage analysis with clinical forecasting, our FusionMS platform delivers a comprehensive view of Multiple Sclerosis—predicting long-term DSS trajectories while identifying the top molecular drivers unique to each patient.

Next-Generation Clinical Decision Support We are now scaling this foundation into a fully AI-driven ecosystem. Future iterations will leverage Generative AI to provide instant clinical summaries and a RAG-enabled engine that connects molecular findings to the latest research. Our goal is clear: to move beyond simple analysis and create a dynamic, self-improving decision-support system for the next era of MS diagnostics.

 

Acknowledgements and References

We thank Dr. Michael Wilson and the team at UCSF Neurology for their guidance and providing access to the primary dataset. We also thank Dr. Mitchell Wallin at the Department of Veterans Affairs for his guidance and advice for interpreting the clinical data.  We thank Joyce Shen and Korin Reid for their guidance and expertise throughout this project. 

Zamecnik, C.R., Sowa, G.M., Abdelhak, A. et al. An autoantibody signature predictive for multiple sclerosis. Nat Med 30, 1300–1308 (2024). https://doi.org/10.1038/s41591-024-02938-3 

Multiple sclerosis, Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/multiple-sclerosis/symptoms-causes/syc-20350269 

How Many People Live With Multiple Sclerosis?, National MS Society. https://www.nationalmssociety.org/understanding-ms/what-is-ms/who-gets-ms/how-many-people 

Greenfield AL, Hauser SL. B-cell Therapy for Multiple Sclerosis: Entering an era. Ann Neurol. 2018 Jan;83(1):13-26. doi: 10.1002/ana.25119. 

Meet the Team

Archived Files

Last updated: December 19, 2025