Flight Failures Decoded: Mining Aviation Data To Prevent Future Fatal Air Crashes
Problem
Commercial aviation accidents, though statistically rare, represent some of the most catastrophic transportation failures in modern society. Each incident results in devastating loss of life, massive economic costs, and profound psychological impact on communities worldwide.
- Scale Challenge: 2.9 billion+ passengers fly annually, making even rare events statistically significant.
- Data Fragmentation: Reports scattered across multiple agencies with inconsistent formats.
- Pattern Recognition: Complex interdependencies make prediction extremely difficult.
The Core Challenge: Aviation accident patterns are obscured by fragmented data with inconsistent formats and standards across FAA, NTSB, and international authorities, making it extremely difficult to identify critical safety trends before they cause tragedies.
Solution
- FlightAlert: A CatBoost Machine Learning model that predicts the likelihood of a flight incident or accident based on input features such as airport, weather, etc.
- FlightSage: A RAG based AI agent that uses NTSB and FAA reports as a knowledge base to answer aviation safety questions.
We hope to equip both flight policymakers and air traffic controllers to accurately and quickly triage a situation using both historical (past reports from FlightSage) and existing flight data (present data entered into FlightAlert), to make well-informed decisions that improve airline safety.
Data Sources
Link: FAA AIDS
Background from the FAA: The FAA Accident and Incident Data System (AIDS) database contains incident data records for all categories of civil aviation . Incidents are events that do not meet the aircraft damage or personal injury thresholds contained in the National Transportation Safety Board (NTSB) definition of an accident. For example, the database contains reports of collisions between aircraft and birds while on approach to or departure from an airport. While such a collision may not have resulted in sufficient aircraft damage to reach the damage threshold of an NTSB accident, the fact that the collision occurred is valuable safety information that may be used in the establishment of aircraft design standards or in programs to deter birds from nesting in areas adjacent to airports.
Project specific context: FlightAlert in particular was trained using the "a" files from years 1980 to 2025; only data from commercial flights were used.
Link: NTSB Investigation Reports
Background from the NTSB: Accident Reports are one of the main products of an NTSB investigation. Reports provide details about the accident, analysis of the factual data, conclusions and the probable cause of the accident, and the related safety recommendations. Most reports focus on a single accident, though the NTSB also produces reports addressing issues common to a set of similar accidents.
Project specific context: FlightSage has the following reports in its repository and was tested (using RAGAS) on them:
Data Pipeline
FlightAlert Data Pipeline:
- FAA Text files are uploaded to S3
- Parsed using Python in a Jupyter notebook
- Labeled (using a separate mapping lookup file)
- Exported as CSV back to the S3 bucket
- SageMaker AI used for EDA and model hosting
FlightSage Data Pipeline:
- NTSB reports: LLM used to create comprehensive summary, balancing technical details and brevity; Pinecone used as vector database
- FAA reports: Report Id and text separated, and then embedded and uploaded onto Pinecone
- OpenAI used for embeddings and to power the agent
Front-end:
- Streamlit used for front-end interface
Model Design and Evaluation
Reasoning for CatBoost Selection:
- Handles categorical features automatically
- Fast, efficient, accurate, robust to overfitting
- Offers thorough interpretability
- Good performance on imbalanced datasets
Model Features
Model features include 100+ attributes about an individual flight and flight conditions, including:
- Weather conditions
- Plane model
- Pilot details
- Actions taken
FlightAlert Model Evaluation
- The model was evaluated using F1 score (0.94).
- The original model had an F1 score of 0.99, with iterations looking to solve overfitting.
- Current model represents a balance between high performance and generalization capability.
Feature Impact Analysis
Shapley Values Analysis
Shapley values for each feature value indicate what is most associated with accidents:
Flying Conditions:
- Turbulence
- Freezing rain
- Mountain waves
Phase of Flight:
- Maneuver
- Forced landing
- Unauthorized low flying level
Engine Certification Region:
- Southern
- Eastern
- Great Lakes
- Western Pacific
- Northwest Mountain region
FlightAlert Model Analysis
Challenges:
- Determining predictors vs outcomes: Distinguishing which features were outcomes of an accident vs predictors of an accident
- Data engineering complexity: FAA data is downloadable via raw text files, without column names
- Explainability limitations: FAA feature values are abbreviations and not documented
Evaluation Results (NTSB):
| Type | Metric | Baseline | Final |
| RAGAS | Faithfulness | ~0.6 | 1.00 |
| RAGAS | Answer relevancy | ~0.8 | 0.885 |
| RAGAS | Context precision | ~0.5 | 0.955 |
| RAGAS | Context recall | ~0.6 | 0.939 |
| NVIDIA | Answer accuracy | ~0.4 | 0.826 |
Agent Performance Analysis
What agent is good at:
- Answering specific questions about NTSB reports
- Giving high level overview on vague questions (supported by FAA/NTSB reports)
What agent isn't good at:
- Analytics type questions
Challenges:
- FAA reports quality too low, too many: Large volume of low-quality FAA reports creates noise
- Multi-turn isn't optimal: Agent has to re-query from vector database each time
- Prompting limitations: Low context doesn't yield good results
Key Learnings & Impact
FlightAlert:
Performing EDA and building the FlightAlert product taught us a lot about analysis in an industry I (Dan) had no experience in. Coming from a tech background, where data is substantial and plentiful, I was surprised to learn that there was very limited useful data on accidents that could be used to reduce accident likelihood. The NTSB and FAA were the two primary sources we found, but neither had any data that lent itself to analysis - both seemed like they had built a data infrastructure near the turn of the century and hadn't updated anything since. Considering the magnitude and importance of airline safety, this seems like a major gap in the airline industry's practices. As a continuation of this project, I'm inclined to reach out to the FAA, NTSB, and other airline entities to share the work we've done. While the work is likely not fully baked enough to inform policy, the processes we used are repeatable and it could inspire a new way to improve airline safety.
FlightSage:
Working across FAA and NTSB reports taught me (Jon) how data shape dictates design: FAA sources are plentiful, short, and noisy, while NTSB reports are fewer, long, and rich. It was really good for me to learn firsthand how RAG behaves differently on each—summarizing section-aware chunks worked best for NTSB, whereas FAA data did not even need much pre-preparation. I also learned that sometimes simple is better— a pared-down, agentic setup with just two focused tools, with an LLM call inside the retrieval tool to produce a strict JSON query for semantic similarity—delivered reliable, repeatable behavior without over-engineering.
Impact:
From the feedback we received from Aerospace engineers, software engineers, systems engineers, and data analysts in the aerospace industry who tested our project, we learned that users really valued how easy the site was to navigate. They found the tabs useful in helping guide them through complex data, and the interactive model made predictions feel transparent and engaging. They enjoyed being able to adjust parameters and see how the risk levels changed. One big takeaway for us was the need to improve clarity in feature definition to assist more users in a better understanding. Overall, the experience confirmed we’re on the right track, and it gave us clear ideas on how to improve transparency and usability.
Future Work
FlightAlert
- Combine FAA data with NTSB data to create more robust model
- Build new model (e.g. XGBoost) to determine which combinations of variables are most risky
- Include short descriptions next to variables to make them more accessible to less technical audiences.
FlightSage
- Create analytics semantic layer for FAA reports
- Flesh out and evaluate multi-turn, using clarify-request-tool more
- Make FlightAlert a tool that FlightSage can call on, for a more integrated experience.
Acknowledgements
We would like to thank our capstone mentors, Joyce Shen and Korin Reid for their guidance and insight in our project. We would like to thank the Software Engineers, Data Analysts, Systems Engineers, and Aircraft Maintenance Engineers who tested our product and gave us quality feedback to make future improvements. Additionally, we also would like to thank Stanley (MIDS 2025) for his guidance on FlightSage's architecture.
