MIDS Capstone Project Summer 2025

Flight Failures Decoded: Mining Aviation Data To Prevent Future Fatal Air Crashes

Problem

Commercial aviation accidents, though statistically rare, represent some of the most catastrophic transportation failures in modern society. Each incident results in devastating loss of life, massive economic costs, and profound psychological impact on communities worldwide.

  • Scale Challenge: 2.9 billion+ passengers fly annually, making even rare events statistically significant.
  • Data Fragmentation: Reports scattered across multiple agencies with inconsistent formats.
  • Pattern Recognition: Complex interdependencies make prediction extremely difficult.

The Core Challenge: Aviation accident patterns are obscured by fragmented data with inconsistent formats and standards across FAA, NTSB, and international authorities, making it extremely difficult to identify critical safety trends before they cause tragedies.

Solution

  1. FlightAlert: A CatBoost Machine Learning model that predicts the likelihood of a flight incident or accident based on input features such as airport, weather, etc.
  2. FlightSage: A RAG based AI agent that uses NTSB and FAA reports as a knowledge base to answer aviation safety questions.

We hope to equip both flight policymakers and air traffic controllers to accurately and quickly triage a situation using both historical (past reports from FlightSage) and existing flight data (present data entered into FlightAlert), to make well-informed decisions that improve airline safety.

Data Sources

1. Federal Aviation Administration Accident and Incident Database (FAA AIDS)

Link: FAA AIDS

Background from the FAA: The FAA Accident and Incident Data System (AIDS) database contains incident data records for all categories of civil aviation . Incidents are events that do not meet the aircraft damage or personal injury thresholds contained in the National Transportation Safety Board (NTSB) definition of an accident. For example, the database contains reports of collisions between aircraft and birds while on approach to or departure from an airport. While such a collision may not have resulted in sufficient aircraft damage to reach the damage threshold of an NTSB accident, the fact that the collision occurred is valuable safety information that may be used in the establishment of aircraft design standards or in programs to deter birds from nesting in areas adjacent to airports.

Project specific context: FlightAlert in particular was trained using the "a" files from years 1980 to 2025; only data from commercial flights were used.

2. National Transportation Safety Board (NTSB) Investigation Reports

Link: NTSB Investigation Reports

Background from the NTSB: Accident Reports are one of the main products of an NTSB investigation. Reports provide details about the accident, analysis of the factual data, conclusions and the probable cause of the accident, and the related safety recommendations. Most reports focus on a single accident, though the NTSB also produces reports addressing issues common to a set of similar accidents.

Project specific context: FlightSage has the following reports in its repository and was tested (using RAGAS) on them:

Data Pipeline

FlightAlert Data Pipeline:

  • FAA Text files are uploaded to S3
  • Parsed using Python in a Jupyter notebook
  • Labeled (using a separate mapping lookup file)
  • Exported as CSV back to the S3 bucket
  • SageMaker AI used for EDA and model hosting

FlightSage Data Pipeline:

  • NTSB reports: LLM used to create comprehensive summary, balancing technical details and brevity; Pinecone used as vector database
  • FAA reports: Report Id and text separated, and then embedded and uploaded onto Pinecone
  • OpenAI used for embeddings and to power the agent

Front-end:

  • Streamlit used for front-end interface

Model Design and Evaluation

FlightAlert Modeling Approach

Model Chosen

Gradient Boosting over Decision Trees (CatBoost)

Reasoning for CatBoost Selection:

  • Handles categorical features automatically
  • Fast, efficient, accurate, robust to overfitting
  • Offers thorough interpretability
  • Good performance on imbalanced datasets

Model Features

Model features include 100+ attributes about an individual flight and flight conditions, including:

  • Weather conditions
  • Plane model
  • Pilot details
  • Actions taken

FlightAlert Model Evaluation

  • The model was evaluated using F1 score (0.94).
  • The original model had an F1 score of 0.99, with iterations looking to solve overfitting.
  • Current model represents a balance between high performance and generalization capability.

 

 

Feature Impact Analysis

Shapley Values Analysis

Shapley values for each feature value indicate what is most associated with accidents:

Flying Conditions:

  • Turbulence
  • Freezing rain
  • Mountain waves

Phase of Flight:

  • Maneuver
  • Forced landing
  • Unauthorized low flying level

Engine Certification Region:

  • Southern
  • Eastern
  • Great Lakes
  • Western Pacific
  • Northwest Mountain region

FlightAlert Model Analysis

Challenges:

  • Determining predictors vs outcomes: Distinguishing which features were outcomes of an accident vs predictors of an accident
  • Data engineering complexity: FAA data is downloadable via raw text files, without column names
  • Explainability limitations: FAA feature values are abbreviations and not documented

 

 

FlightSage Agent Architecture and Evaluation

Evaluation Results (NTSB):

TypeMetricBaselineFinal
RAGASFaithfulness~0.61.00
RAGASAnswer relevancy~0.80.885
RAGASContext precision~0.50.955
RAGASContext recall~0.60.939
NVIDIAAnswer accuracy~0.40.826

 

Agent Performance Analysis

What agent is good at:

  • Answering specific questions about NTSB reports
  • Giving high level overview on vague questions (supported by FAA/NTSB reports)

What agent isn't good at:

  • Analytics type questions

Challenges:

  • FAA reports quality too low, too many: Large volume of low-quality FAA reports creates noise
  • Multi-turn isn't optimal: Agent has to re-query from vector database each time
  • Prompting limitations: Low context doesn't yield good results

Key Learnings & Impact

FlightAlert:

Performing EDA and building the FlightAlert product taught us a lot about analysis in an industry I (Dan) had no experience in. Coming from a tech background, where data is substantial and plentiful, I was surprised to learn that there was very limited useful data on accidents that could be used to reduce accident likelihood. The NTSB and FAA were the two primary sources we found, but neither had any data that lent itself to analysis - both seemed like they had built a data infrastructure near the turn of the century and hadn't updated anything since. Considering the magnitude and importance of airline safety, this seems like a major gap in the airline industry's practices. As a continuation of this project, I'm inclined to reach out to the FAA, NTSB, and other airline entities to share the work we've done. While the work is likely not fully baked enough to inform policy, the processes we used are repeatable and it could inspire a new way to improve airline safety.

FlightSage: 

Working across FAA and NTSB reports taught me (Jon) how data shape dictates design: FAA sources are plentiful, short, and noisy, while NTSB reports are fewer, long, and rich. It was really good for me to learn firsthand how RAG behaves differently on each—summarizing section-aware chunks worked best for NTSB, whereas FAA data did not even need much pre-preparation. I also learned that sometimes simple is better— a pared-down, agentic setup with just two focused tools, with an LLM call inside the retrieval tool to produce a strict JSON query for semantic similarity—delivered reliable, repeatable behavior without over-engineering.

Impact:

From the feedback we received from Aerospace engineers, software engineers, systems engineers, and data analysts in the aerospace industry who tested our project, we learned that users really valued how easy the site was to navigate. They found the tabs useful in helping guide them through complex data, and the interactive model made predictions feel transparent and engaging. They enjoyed being able to adjust parameters and see how the risk levels changed. One big takeaway for us was the need to improve clarity in feature definition to assist more users in a better understanding. Overall, the experience confirmed we’re on the right track, and it gave us clear ideas on how to improve transparency and usability.

Future Work

FlightAlert

  • Combine FAA data with NTSB data to create more robust model
  • Build new model (e.g. XGBoost) to determine which combinations of variables are most risky
  • Include short descriptions next to variables to make them more accessible to less technical audiences.

FlightSage

  • Create analytics semantic layer for FAA reports
  • Flesh out and evaluate multi-turn, using clarify-request-tool more
  • Make FlightAlert a tool that FlightSage can call on, for a more integrated experience.

Acknowledgements

We would like to thank our capstone mentors, Joyce Shen and Korin Reid for their guidance and insight in our project. We would like to thank the Software Engineers, Data Analysts, Systems Engineers, and Aircraft Maintenance Engineers who tested our product and gave us quality feedback to make future improvements. Additionally, we also would like to thank Stanley (MIDS 2025) for his guidance on FlightSage's architecture. 

Last updated: August 4, 2025