MIDS Capstone Project Spring 2025

AirAwareTX

Team members

Project Overview

AirAwareTX is an AI-driven web suite that gives Texans real-time, actionable insights into industrial chemical air emissions. By simplifying complex regulatory data with large language models (LLMs) and retrieval augmented generation (RAG) systems it delivers clear summaries and timely notifications. Using predictive machine learning (ML), AirAwareTX also forecasts potential future emissions, helping residents take proactive health and safety measures. The mission of AirAwareTX is to promote transparency, raise public awareness, and protect individuals and communities by empowering individuals to stay informed.

The MVP is online at airaware-tx.com and additional project information can be found online here.

Problem & Motivation

Chemical air emission events are the release of potentially hazardous substances into the environment, often occurring during routine manufacturing and refining processes at industrial facilities. These emissions pose significant health risks, particularly for individuals living near these operations. In Texas this is especially troublesome, as the state is home to a significant portion of the nation’s chemical manufacturing and refining infrastructure, making emissions a frequent and pressing concern for many communities.

Despite the seriousness of the problem, these events are often overlooked in traditional air quality reporting. Generalized metrics fail to capture the localized, event-specific nature of emissions, leaving residents unaware of what pollutants they are exposed to, how dangerous those chemicals may be, and what actions they can take to protect their health. While emissions data is available publicly, it is fragmented across multiple agencies, and often buried in dense, highly technical reports that are difficult for the average person to read, interpret, and take action on.

AirAwareTX is compelled to solve these challenges. The tool converts emission reporting data into an accessible, user-friendly platform that delivers real-time updates, plain-language summaries, and predictive insights. The tool fills the current gap in clarity, accessibility, and preparedness, offering a critical resource for individuals and communities living amongst Texas’s industrial landscape.

Data Source & Data Science Approach

TCEQ Dataset

At the core of the AirAwareTX is a dataset of Texas Commission on Environment Quality (TCEQ) industrial chemical emission reports. The team scraped and curated this dataset, conducting an exploratory data analysis (EDA) of the web scraped data and doing cleaning and preprocessing. Additionally, the tool is continuously updating the dataset by actively scraping TCEQ Emissions Reports, beginning with a bulk historical collection spanning 2015 to 2024 and extending to include newly published reports in real time (scraping every 15 minutes). Each of the regulatory reports includes details such as incident locations, chemicals and quantities released, and incident summaries—forming the foundation for both user-facing insights and model training.

To strengthen its predictive capabilities, AirAwareTX also integrates historical and real-time weather data via the OpenWeather API. This environmental context plays a key role in supporting the predictive machine learning model, which uses atmospheric conditions to forecast the likelihood of future emission events. By combining regulatory data with dynamic weather information, AirAwareTX delivers a comprehensive and context-aware understanding of Texas’s industrial emissions landscape.

Data Science Applications

AirAwareTX harnesses the power of machine learning and LLMs to turn complex industrial emissions data into clear, actionable insights for communities across Texas. At the heart of this capability is the Anthropic Claude 3.5 Haiku LLM, which distills dense regulatory reports into concise, easy-to-understand summaries. These summaries are then delivered as timely email notifications, ensuring users stay informed about air quality developments in their area.

To further enhance accessibility and understanding, AirAwareTX features a chemical-trained chatbot, also built on the Haiku LLM but integrated with a retrieval augmented generation (RAG) system populated with chemical fact sheets and wiki pages, chemical safety data sheets (SDS), and emission event information. This chatbot allows users to ask questions and receive contextually relevant answers about emissions, pollutants, exposure, and safety in real time.

Complementing these features is a machine learning model that predicts the likelihood of future emission events based on weather forecasts, empowering users to take proactive steps to protect their health and safety.

With an emphasis on usability, AirAwareTX offers a functional tool suite that allows users to effortlessly register, manage notifications, explore emissions data, and find a deeper understanding of emissions. The platform is purpose-built to support transparency, fostering public awareness, and equipping individuals with the knowledge they need to stay safe and make informed decisions about chemical air emissions.

Evaluation

Emission Prediction Model

As part of the AirAwareTX development process, the team conducted early user research to ensure the tool aligned with public interest. A survey posted across various Texas-related subreddits gathered feedback from over 120 respondents. One of the most compelling insights was that many users expressed strong interest in a feature that could predict the likelihood of emissions occurring in their area. This response directly informed the decision to pursue a predictive modeling component for the platform.

To support this effort, the team collected and prepared three core datasets: lagged emissions data, one-hot encoded historical county emission data, and weather data. The initial modeling objective was to predict a binary outcome: whether an emission event would occur in a given county on a given day. The team experimented with multiple machine learning models for this classification task, including neural networks and XGBoost, but surprisingly, Logistic Regression consistently outperformed the more complex approaches, even after hyperparameter tuning. However, despite achieving high accuracy (due to the rarity of emissions on most days), the model struggled with precision and recall, yielding only ~30% in both at optimal threshold levels which resulted in a high rate of false positives and missed emissions. This led to a strategic pivot: rather than forcing a binary prediction, the team chose to display raw probability scores directly to users, along with clear guidance to help interpret the data.

The final product focuses on short-term predictions (today, tomorrow, and the day after) as forecast reliability diminishes over time, particularly with weather uncertainty. The model, deployed on AWS Sagemaker, runs daily and consumes live weather data from the OpenWeatherAPI, calculates emission probabilities, and generates visual assets. These assets are published to a S3 bucket and seamlessly integrated into the core application, enabling users across Texas to access timely and meaningful insights about local air emissions.

LLM Emission Event Summarization

Unlike typical NLP tasks, our application did not have a standard set of "Gold Answers" since it directly sources publicly available data from TCEQ. To effectively evaluate and enhance the quality of AI-generated summaries, we conducted a two-phase optimization process.

In the first phase, participants ranked summaries generated by Claude 3.7 Sonnet, Claude 3.5 Haiku, ChatGPT-4o, and a generic summary directly from TCEQ. While participants slightly preferred 3.7 Sonnet, we selected 3.5 Haiku because it provided nearly identical quality at a significantly lower cost, achieving approximately 73% savings per token.

The second phase involved refining the prompts using an approach called "LLM-as-Judge." We tested three prompt variations: a baseline prompt, a one-shot prompt featuring the highest-ranked human summary as context, and a composite prompt that combined the one-shot approach with optimized language from additional prompt engineering. The composite prompt consistently yielded the most clear, accurate, and concise summaries.

RAG-based LLM Chatbot

While configuring both the back-end and front-end of a RAG system was a key technical milestone, ensuring the quality and appropriateness of its outputs was equally critical. To validate the system’s performance, the team conducted a structured RAG evaluation study (mirroring the 267 GenAI final course project) aimed at measuring the accuracy and relevance of the responses generated by the AI.

The evaluation began with the development of a ‘golden set’ of 30 expert-level question-and-answer pairs, created with the guidance of an experienced chemical engineer from the oil and gas industry. These pairs served as a benchmark against which the system’s responses could be assessed. To evaluate performance, the team applied a combination of well-established NLP metrics: BLEU, ROUGE, and BERTScore F1. BLEU and ROUGE offered complementary views on linguistic overlap while BERTScore F1 focused on semantic similarity, capturing how closely the generated responses matched the intent and meaning of the expert-provided answers. While all factors were considered, the BertScore F1 was used as the primary metric. This comprehensive approach ensured the chatbot was not only functional, but also capable of delivering reliable and domain-relevant insights to end users.

Key Learnings & Impact

Throughout the development of AirAwareTX, we learned the critical value of user feedback at every stage of the project. By using platforms like Reddit to gauge market interest, conducting surveys to refine features like summary optimization, and consulting with industry experts for specialized guidance, we were able to ensure the product met the needs of its intended users. This feedback loop was essential in shaping a product that delivers real, actionable value.

AWS proved to be both a powerful tool and a challenge. Its extensive capabilities were indispensable for managing the entire project, but we quickly realized how easy it was to get sidetracked by its vast array of options for implementation, integration, and optimization—especially when working towards an MVP. While AWS provided the flexibility we needed, it also required careful focus to avoid getting lost in its numerous features and possibilities.

Lastly, we discovered that meaningful machine learning predictions are possible even with limited data. By linking diverse data sources and working with incomplete in-factory information, we were able to successfully deploy an MVP that provides valuable insights to users. This experience demonstrated that, with the right approach, insightful predictions can still be made even when not all ideal data is available.

Next Steps for AirAwareTX

Looking beyond capstone, there are several key priorities to enhance the AirAwareTX platform. One major focus will be expanding user customization for notifications. Building on the MVP’s basic notification system, we aim to offer more granular control, allowing users to select specific geographies, set personalized delivery schedules, and integrate forecasting for more tailored alerts.

Another priority is to expand the predictive power of the model. The MVP was focused on weather relationships, and we plan to incorporate more datasets to broaden the range of emission factors considered in predictions. Additionally, by leveraging more computing resources and incorporating advanced physics and geospatial algorithms, we aim to improve the model’s ability to deliver hyper-localized predictions, such as at the zip code level.

Finally, we plan to expand the health and medical guidance capabilities of the tool. While we intentionally limited this aspect in the MVP to avoid the complexities of providing medical advice, we recognize the potential value in offering more health-related insights. By collaborating with experts, we aim to integrate accurate and reliable health and medical recommendations into the platform, further enhancing its value to users.

Acknowledgements

The AirAwareTX Project Team would like to acknowledge the support, feedback, and direction of our project professors - Uri Schonfeld and Todd Holloway. We would also like to thank the 120+ Texans and our Texas based beta-testers who engaged with our various surveys, discussions, and provided feedback on various versions of the AirAwareTX tool. As well as chemical engineer and industry expert, Ryan Manser, for insight on emission, chemical prioritization, golden question development, and the industry value of our tool’s design.

Course

Data Science 210. Capstone , Spring 2025

Class Project Gallery

More Information

AirAwareTX MVP Tool

AirAwareTX Informational Website

AirAwareTX Final Presentation

AirAwareTX

Video

Last updated: April 14, 2025