pathogenemap banner
MIDS Capstone Project Spring 2024


Problem & Motivation

Biomedical researchers face the daunting task of identifying which genes to target for disease treatment among the 20,000 genes in the human genome. Traditionally, this process has been manual, relying on researchers to read through countless abstracts in databases like PubMed—a time-consuming and inefficient task. PathoGeneMap revolutionizes this approach by automating the extraction of pertinent experiment data from research abstracts, greatly accelerating the research process and reducing the workload on scientists.

Data Source & Data Science Approach

The challenge of automating data extraction from biomedical literature lies in the complexity of the language used to describe experiments. Various expressions can describe identical experiments, making automated classification and extraction difficult. Until recently, the technology capable of understanding such nuances—large language models (LLMs)—was neither sophisticated enough nor widely accessible. Additional barriers include a lack of labeled training data and lack of funding for a large scale resource for the research community. To tackle these challenges, we initially considered using GPT-4, which shows competence in zero-shot learning with an F1 score of 0.84 on test data. However, the cost implications of processing 23 million abstracts were prohibitive, estimated at $170,000. Instead, we designed a more cost-effective strategy using a smaller T5-analog model known as SciFive, which is specifically trained on biomedical language. Our approach involved developing a novel dataset of labeled abstracts, a task that required extensive manual annotation by our domain expert. This dataset not only provides the training foundation for our model but also adds significant value to the biomedical research community by offering new insights into gene perturbation studies.


Our fine-tuned SciFive model outperformed GPT-4, achieving an impressive F1 score of 0.94 on test data, thus validating its effectiveness in classifying relevant biomedical research abstracts. The challenge then was to scale this solution to handle the 23 million-article corpus of PubMed efficiently. Through optimized model architecture and cost-effective computing strategies, we managed to process all articles at a fraction of the initial estimated cost, totaling only $820.

After efficiently processing the entire PubMed corpus, we undertook rigorous post-processing steps. We performed detailed sanity checks to ensure data accuracy, consolidated causal information from the abstracts, and updated the metadata. This led to the launch of our website,, which condenses years of manual research into minutes of automated search to provide researchers easy access to essential gene perturbation experiments. 

Moving forward, we are dedicated to the continuous improvement and expansion of PathoGeneMap. Our aim is to further enhance its features and database aiming to make it a vital tool for biomedical researchers worldwide. By nurturing its growth, we strive to sustain and increase PathoGeneMap’s impact, contributing to the progression of medical research and treatment methods.

Key Learnings & Impact

PathoGeneMap has demonstrated that with targeted data science strategies, it is possible to transform the accessibility and usability of biomedical research data. The platform significantly reduces the time taken to identify relevant studies, thereby accelerating the pace of research and development in the field. The broader impact of this tool is substantial, with potential benefits for hundreds of thousands of researchers globally.


This project was a collaborative effort involving team members Almicia, Andrew, Christian, Max and Wesley whose diverse expertise drove the development of PathoGeneMap. We like to thank our fellow classmates and course instructors for their valuable feedback and continued support. We are grateful to the broader scientific community whose ongoing research contributions have been integral to our data compilation. Biomedical research abstracts were obtained from PubMed. As the base model for our fine-tuned model we used SciFive.

Last updated:

April 17, 2024