Dec 12, 2022

Berkeley Students Use Data Science to Model Air Pollution and Pediatric Health Outcomes

A team of Master of Information Management and Systems (MIDS) students recently presented at the Stanford Maternal & Child Health Research Institute Symposium. Trevor Johnson, Matt Lyons, Anand Patel, and Michelle Shen, who will graduate in December 2022, came together from across the country to showcase their research on the impact of air pollution on certain pediatric health conditions. 

This all began as an idea for their capstone project, taking inspiration from advisor Cornelia Ilin’s previous research on the relationship between wildfires and respiratory conditions. “It was a uniquely challenging situation for this team because they had to write code to analyze data they could not directly interact with,” Cornelia said, “Still, the team navigated this challenge successfully, running two-stage least-squares regression and building a two-stage XGBoost model for counterfactual analysis. This team is the first to present at an academic conference so early in the capstone semester, working very hard to deliver their work at the Maternal and Child Health Symposium. They went above and beyond what is expected of a team during the capstone.” 

Alberto Todeschini, the other advisor for the project, also chimed in, “Current studies have shown that, after decades of improvements, air quality in California has been deteriorating rapidly because of smoke from the wildfires. Other studies have shown a correlation between increased temperatures and the severity of Valley fever, a potentially deadly disease that is common in the southwest of the United States and that is spread through air.” 

The four came from significantly different backgrounds—Trevor works as an actuary, Matt in clinical analytics, Anand is a software engineer, and Michelle in research—but they were all interested in discovering what data science could do to address such a prevalent issue in healthcare. 

But first, the group was forced to tackle a crucial question. “How do we define pollution?” Anand asked. This question was inspired by Cornelia’s previous focus on wildfires, which the group felt they wanted to expand upon. “We’re going from a framework where there are a few wildfires, to a framework where there is always going to be pollution sources,” he added. 

And expand they did, eventually choosing to look at PM 2.5 over PM 10. PM, or atmospheric particulate matter, are categorized by how fine they are, so PM 10 refers to larger particles. PM 2.5, the particle size the team chose to focus on, is able to enter bloodstreams due to its small size and therefore can lead to conditions related to blood or blood circulation. The four then compiled a list of over 1300 pollution sources, which included petrochemical facilities, power plants, mining operations, and large facilities such as airports. 

By using a framework called instrumental variable regression, the team used wind to predict how much PM 2.5 is around schools in California. “Given that wind is impacting the amount of PM 2.5 around schools, we can use that predicted amount of PM 2.5 value to estimate the impact on healthcare outcomes,” Trevor explained. 

Their project, named HealthCAir, was eventually recommended to the Stanford symposium by Cornelia, who was also presenting her own research. “It was really nice to be supported,” Michelle said, “The [I School’s] master’s travel grant made it possible for those of us on the East Coast to attend.”

“The research process is sort of a dialogue... And while we’re just looking at California, I think that this process could be used over different time frames, over larger geographic areas.”
— Matt Lyons

As a result, the team has made significant progress in creating their own usable map to display pollution data at a zip code level. Currently, they found that increased PM 2.5 exposure had a statistically significant causal relationship with increases in diagnosis rates of all malignant cancers and of blood vessel diseases among children, but other conditions are not as explicitly clear. They hope to eventually also create a map feature that will give audiences an estimate on how to reduce the incidence rate of certain diseases and see where mitigation techniques like filtration will make the most difference, as well as identify other diagnosis groups that may have been missed previously. 

“The research process is sort of a dialogue. We took some of the things from the project that inspired the instrument design, but we’re taking that and adding some of the other things we’re looking at. And while we’re just looking at California, I think that this process could be used over different time frames, over larger geographic areas,” Matt added.

With a well-cleaned data set and lots of dedication, the team members are well on their way to reaching their goals and making a difference in the realm of pediatric healthcare.

logo of I school healthcare project
HealthCAir Project Logo
picture of I school students Anand Patel, Trevor Johnson, Michelle Shen, Matt Lyons at Stanford
Matt Lyons, Michelle Shen, Trevor Johnson, and Anand Patel (left to right) at Stanford MCHRI


HealthCAir Instrument Explanation

HealthCAir Instrument Explanation

If you require video captions for accessibility and this video does not have captions, click here to request video captioning.

Last updated:

August 31, 2023