Inference
Motivation
Emergency dispatchers and first responders face a structural information problem during active incidents. As an event unfolds, incoming data accumulates faster than operators can manually process it. Social media generates high volumes of conflicting, unverified reports. Official government feeds lag real-time conditions by hours. CAD systems, many over 20 years old, were built to log events but have no capacity to reason about them. Dispatchers routinely monitor thousands of signals per hour across more than 10 disconnected systems simultaneously.
The 2025 Palisades and Eaton fires brought this into focus. Independent reviews documented coordination failures, information gaps between field units and incident command, and decision delays that worsened response outcomes. Inference was developed in direct response to those documented failures.
Inference is a decision support platform for emergency dispatchers and first responders. It ingests multimodal data from social media and field sources, structures it through an embedding and clustering pipeline, and surfaces confidence-rated situational intelligence through an incident timeline, operational map, and AI-powered decision engine — reducing the time between information arrival and informed action.
Data Sources
The primary dataset used for development and evaluation is approximately 84,000 unique tweets from the January 2025 LA wildfires, approximately 15,000 of which include media attachments. Initial analysis revealed that traditional keyword and hashtag filters left over 50% of posts uncategorized, with relevance unknown — illustrating the core data challenge the system is designed to address.
Posts were classified across operational categories including infrastructure, medical and rescue, fire operations, evacuation and safety, and community support, alongside large volumes of political commentary, media coverage, and general noise. The central engineering question was how to reliably extract operationally relevant signals from this volume of mixed-relevance data.
In addition to social media, the system draws from NOAA/NWS active alerts via api.weather.gov, FEMA OpenFEMA declaration summaries, OpenSky Network aircraft telemetry over defined incident bounding boxes, and GeoJSON-encoded geospatial data including road closure LineStrings, fire perimeter Polygons, and ground asset Point features from CAL FIRE and LA County OES feeds.
Data Science Approach
The Ingestion Problem
The initial approach to signal extraction used Named Entity Recognition to identify and classify entities within raw tweet text. In practice, NER proved poorly suited to this problem. Its reliance on keyword patterns rather than semantic context produced low cohesion across clusters, binary cluster membership resulted in high sparsity, and sensitivity to informal language made it brittle across different incident types.
Embeddings + HDBSCAN
To address these limitations, the pipeline was redesigned around multimodal embeddings and density-based clustering. Tweets — including text, images, and video — are embedded using Amazon Nova 2 via AWS Bedrock, producing 1024-dimensional vector representations that capture semantic content rather than surface keywords. These embeddings are then clustered using HDBSCAN, a density-based algorithm that groups semantically similar content into cohesive clusters while automatically routing low-density, irrelevant posts into a noise cluster that is discarded. Because the approach operates on embedding space rather than trained classification rules, it generalizes across incident types without retraining.
Cluster summaries are generated using Claude Haiku. The result is a set of dense, semantically coherent clusters representing distinct operational themes within the incident data.
Event Generation
Clusters are matched against existing timeline events using vector similarity. If a cluster is sufficiently similar to an existing event, it is used to enrich that event with new information. If no match is found, a new event is created. This produces a continuously updated incident timeline that reflects the evolving state of the situation as new data arrives. Event generation is handled by Claude Sonnet.
System Architecture
The pipeline runs asynchronously: raw tweet data is ingested, embedded, clustered, and summarized before being written to a PostgreSQL database. A synchronous backend then makes this structured data available to three services — event generation, decision generation, and chat — each of which queries the database through a RAG layer that allows Claude models to search and reason over live incident data.
Platform Features
Incident Timeline
The timeline presents a chronological reconstruction of the incident broken into discrete events, each tagged as Confirmed or Inferred based on available evidence. Selecting an event updates an AI-generated summary panel and a linked evidence section showing the underlying source material — bystander reports, news coverage, and official feeds — that supports it. A confidence score is displayed at the event level so operators always have a clear indication of information reliability. The timeline updates in real time as new clusters are processed and matched.
Operational Map
The map displays live coordinates of relevant field resources, color-coded by type: green for EMS, orange for fire engines, blue for police units. The affected area is highlighted, road closures are rendered as overlays, and shelter locations are plotted and color-coded by capacity. This gives dispatchers a current spatial picture of resource positions and access constraints without requiring them to consult a separate system.
Decision Engine
The decision engine takes the current timeline events and map state as inputs and runs two sequential LLM calls. The first call produces a structured situational assessment — what is happening, what is credible, and where information gaps exist. The second call takes that assessment along with map data and generates a prioritized set of recommended actions, each with a stated rationale, time horizon, desired outcomes, identified risks, and suggested resources. Two calls are used rather than one to reduce hallucination by lowering the complexity handled in each individual prompt. Output is validated through a Pydantic model before reaching the frontend.
Natural Language Query
The chat interface allows dispatchers to query the platform in plain language at any point during an incident. Queries are processed through three retrieval tools available to the LLM: semantic search over Nova 2 embeddings in the database (which also surfaces relevant image content, since image and text embeddings share the same space), keyword search over cluster summaries, and text search over timeline event summaries. Retrieved evidence is then summarized and returned to the user. This allows operators to ask specific questions about the timeline, the map state, or the decisions — and receive grounded, source-backed answers.
Application Architecture
The platform is built on Next.js (TypeScript) on the frontend and FastAPI (Python) on the backend, with a PostgreSQL database storing embedded tweet clusters and event records. Embeddings are generated using Amazon Nova 2 via AWS Bedrock at 1024 dimensions. Clustering uses UMAP for dimensionality reduction followed by HDBSCAN. Event and decision generation use Claude Sonnet; summarization and chat use Claude Haiku. The chat interface communicates with the frontend via WebSocket. The map layer renders geospatial data from GeoJSON and live resource coordinates. The data pipeline runs asynchronously, decoupled from the synchronous backend services that query it.
Evaluation
Three features were evaluated independently, with a shared goal: maximize high-value, actionable information while reducing noise.
Timeline Clarity
After clustering and event generation, an event filter scores each candidate event before it is added to the timeline. Tuning this filter reduced the number of displayed events from 55 to 7. A panel of firefighters from a local fire department evaluated timeline clarity on a 1–5 scale based on how clear and actionable the events were given their operational experience. The filter adjustments raised the average clarity score from 3.6 to 4.3.
Chat Relevancy
Chat performance was evaluated by comparing the ability of different model configurations to extract precise, quantifiable operational information from incident data — specific locations, incident scale, time sensitivity, and similar indicators. A baseline using GPT open-source models with NER scored 1.6 out of 5 under LLM-as-a-judge evaluation. The production configuration — Nova 2 embeddings paired with Claude Sonnet 4.6 — scored 4.7 out of 5, with clear advantages in identifying key quantifiable indicators across incident categories.
Decision Accuracy
Decision accuracy was evaluated by comparing the system's recommended actions against the actions documented as actually taken during the Palisades Fire. All three of the system's top recommendations — deploying emergency water tankers to the Sunset Blvd fire perimeter, opening additional shelter capacity at Paul Revere Middle School, and establishing search and rescue operations in the Altadena fire zone — corresponded directly to actions taken by LA emergency services during the event, as confirmed by contemporaneous reporting from the LA Times, NBC News, and the New York Times.
Key Learnings & Impact
Three technical areas drove the most meaningful improvements in system performance.
The shift from NER to multimodal embeddings with HDBSCAN clustering was the single most impactful architectural decision. Embedding-based approaches capture semantic relationships that keyword-based methods miss, produce denser and more coherent groupings, and generalize across incident types without retraining.
Retrieval quality required iterative tuning. Balancing relevance and confidence scoring across the three retrieval tools — semantic search, cluster search, and timeline search — was necessary to improve response helpfulness without increasing noise. Small changes in retrieval thresholds had measurable effects on output quality.
Interface design proved as consequential as model performance. The platform handles a large volume of information, and presenting it without overwhelming the operator requires deliberate decisions about information hierarchy, progressive disclosure, and the layout of the timeline, map, and decision panels relative to each other.
Current data sources are open and public. The most significant near-term improvement would come from integration with proprietary agency systems — CAD feeds, radio transcripts, and internal dispatch logs — which represent the sources of truth operators already rely on. Enriching the decision engine with agency-specific standard operating procedures and best practice documentation would further improve recommendation relevance. Expanding the map layer with live geospatial data — outage zones, evacuation boundaries, gathering points — would give the decision engine a richer spatial input and operators a more complete operational picture.
Acknowledgements
We thank our capstone instructors, Frederick Nugen and Ramesh Sarukkai, for their guidance throughout this project. We also thank the first responders from our local fire department and emergency medical service for their operational perspectives, which directly shaped the design decisions behind this platform.
