MIDS Capstone Project Spring 2026

Citadel

Team members

Motivation

Risk hides in the gap between what an organization knows and what it sees.

A university knows federal regulations. It sees expense reports. A researcher knows atmospheric chemistry. She sees aircraft measurements. A grant writer knows the political landscape. He sees his own abstract.

In each case, the knowledge exists and the data exists, but connecting them is manual, slow, and error-prone. 

Citadel closes that gap. We encode domain knowledge as a graph, connect it to operational data, and let an AI reason across both. 

We believe it's important to meet customers where they are, and we engineered a solution to empower universities to face crises on multiple fronts. 

To prove Citadel's approach, we built multiple domains that have almost nothing in common except the platform underneath them.


How Citadel Works

Every domain on Citadel follows the same shape. That is: "One Platform, Any Ontology"

We demonstrate Citadel through a fictional university called Learning Center University, running three operational domains that demonstrate real needs amongst research universities. 

These include atmospheric chemistry analysis for earth scientists studying atmospheric oxidation, grant policy defense for researchers navigating politically sensitive language, and grant audit compliance for internal auditors.

We allow users to create their own domains, and provide three quickstart domains to demonstrate Citadel's ability to generalize strongly across ontologies. These include university curriculum, clinical trial safety, and supply chain compliance.

Orchestrated Intelligence

A context enrichment pipeline is run for every query. Relevant knowledge is retrieved from the domain's graph, enriched with domain-specific tool calls employing data science techniques, and generates a single response grounded in the user's context.  

Our initial research identified extremely poor semantic performance as a hallmark of AI reasoning failures. For instance, regulatory language contained 75% semantic similarity and neural silhouette scores (which are used to evaluate data's ability to form clusters) that were less than one half the silhouette scores produced by using pure statistics with rules-based NLP. 

We, therefore, felt standard retrieval-augmented generation was not an equitable solution, and conceived of a more performant system — then compared them. 

In a typical RAG system, the model receives whatever a single retrieval pass returns and generates from that context. In Citadel's orchestrated pipeline, either a ReAct agent selects which tools to invoke based on the query, or tools are called based on the specific UI context. The pipeline then retrieves information from the knowledge graph, calls domain-specific data functions, and synthesizes a response from structured results. The user can ask follow-up questions rather than working only with what the first retrieval pass returned — which are typically near-static queries. 

In the compliance domain's Q&A evaluation, orchestration reduced hallucination by 8% and improved semantic keyword quality by 39% compared to retrieval-only generation.

Citation Verification

For evaluation, each citation was classified as grounded (verified to exist in the graph and present in the retrieved context), known cross-reference (exists in the graph but found through traversal rather than direct retrieval), pre-training data (matches a real source but came from the model's training data), or hallucinated (does not exist in the knowledge graph or any known source). This classification runs on every response, in production, for every domain, so that users can see which claims are backed by source material and which are not.

In the Citadel UI, citations are checked against the knowledge graph in nearly every AI output, and displayed as either verified or unverified


Domain 1: Fiat Fulmen (Atmospheric Chemistry)

Citadel gives field scientists the ability to interpret complex experimental conditions that previously took years to generate publishable insights from

Fiat Fulmen is an atmospheric chemistry analysis platform. The platform helps researchers identify and characterize atmospheric oxidation events from direct sensor measurements

This matters because Earth Systems research is an increasingly pressing need amongst research universities. 

Moreover, our domain-agnostic platform is compatible with almost any edge computing solution, by land, air, sea, or space. 

Evaluation

We did not have access to expert-annotated ground truth for atmospheric chemistry Q&A. Creating a benchmark equivalent to RuleSense would require atmospheric chemists to write and score expected answers, which was outside the scope of this project. Instead, we evaluated the components we could verify independently. 

The dataset contains 41,036 observations at 10-second intervals across 18 research flights, with 57 measured variables per observation. And, the Atmos AI assistant is grounded in a verified knowledge graph of gas-phase reactions and JPL-recommended rate constants. 

When Atmos is asked a question, or interprets sensor measurements, it's essential that we: 1) surface verified citations in the response against our knowledge graph, and 2) evaluate the domain's ability to produce reliable quantitative and temporal measurements

Therefore, we tested whether including false statistics in knowledge graph nodes caused the AI to cite those values as facts (e.g. "CH2O increases +155%"), rather than leveraging the quantities computed at runtime through tool-calling. In 105 queries, intentional knowledge graph poisoning caused the AI to cite inaccurate statistical claims in 34% of responses. After removing all statistical claims from the graph and restricting nodes to species properties and verified rate constants, zero statistical hallucinations were observed from a sampling of outputs. 

This evaluation informed our roadmap, as we realized citation verification alone is insufficient. For Citadel to be truly reliable, it must also verify quantitative metrics in the response against tool-calling and time-step-dependent measurements.


Domain 2: Grant Defender

Grant Defender exists because the current regulatory environment creates real consequences for researchers who use certain terms in federal grant applications. 

Since the publication of three executive orders in 2025 directing agencies to review grants for specific kinds of speech, more than fifteen billion dollars in research funding has been frozen or cancelled

Thus, we give universities the ability to be aware of which terms in their writing may trigger review, and we use a carefully constructed, knowledge graph-backed, domain to  offer alternative phrasing that preserves the scientific meaning. 

Our UI respects the voice of the writer, Grant Defender generates a paraphrase similarity score (i.e. BERTScore) for every suggested rewrite —and has built-in functionality to co-edit drafts, track and change their statuses, and create new ones. 

Our goal is to protect science, and protect researchers' ability to do their work.

Detection Pipeline

The scanner draws on 381 sensitive terms compiled from PEN America's aggregated list of federal government restricted words, sourced from reporting by NYT, Reuters, Politico, Washington Post, ProPublica, and Science, and from federal agency guidance issued by CDC, USDA, DOD, DOE, FEMA, FDA, NASA, NCI, NSA, and NSF. 

Detection scanning is deterministically conducted against a knowledge graph of 854 nodes and 618 edges, providing instant and reproducible results.

Many sensitive terms appear as substrings inside legitimate scientific words, like "trans" in "transcriptome" and "gender" in "engendered." 

Therefore, a whitelist of 460 scientific phrases suppresses these false matches, so that phrases like "gene expression" can be exactly retained, to minimize the risk of modifying researchers' intent.

Each whitelist phrase was extracted by human review of false positives, augmented through a pipeline that enriched the false positives with their surrounding spans. In other words, we found false positives, and created a filter for scientific terminology surrounding them.

 We did this by using 88 grant abstracts, including 32 real NIH grants, and 56 synthetically generated grants. 

Evaluation

During evaluation, the scanner detected 100% of injected terms.

When the scanner flags a sentence, the LLM (Gemma 3 27B) rewrites it to remove the sensitive term while preserving scientific meaning, and the rewrite is then re-scanned with the same deterministic scanner. 

Across 679 evaluated rewrites, 99.7% successfully removed the primary trigger term, and 98.8% removed all trigger terms in the sentence. The 1.2% that retained secondary triggers would require a second rewrite pass.

Replacing a flagged term succeeds only when the rewritten sentence conveys the same scientific intent as the original

Semantic preservation is quantified using BERTScore, which measures meaning similarity through contextual embeddings rather than surface-level word overlap.

Across all evaluations, a BERTScore average rewrite score of 94.9% was observed, meaning all rewrites retained approximately 95% of their original meaning.

The minimum observed score of 77.5% represented the most aggressive rewrite in the dataset, where the model had to restructure the sentence significantly to remove the sensitive term. 


Domain 3: Grant Audit Compliance 

The compliance domain links a hand-crafted knowledge graph of approximately 32,000 nodes and 38,000 edges to grant expenditure data

The graph encodes three bodies of regulatory knowledge: the Code of Federal Regulations (CFR) for federal grant spending rules, the NIH Grants Policy Statement (NIHGPS) for NIH-specific compliance requirements, and historical audit failures from the Federal Audit Clearinghouse (FAC), filtered for university single audit findings from 2016 to 2025. The AI then triages every transaction against this corpus and cites the specific regulatory provisions that apply. 

Because regulatory information is normatively hierarchical, a graph is its natural shape. However, hand-crafting it required optimizations for its intended purpose. For instance, the electronic CFR (eCFR) was obtained through XML, and the NIHGPS was obtained via raw HTML. Both were cleaned, parsed, converted, and finally joined based on their textual cross-references. 

Afterwards, nodes were divided into many child nodes, based on how many sub-parts and sections they had, to allow for maximum resolution. At this resolution (where the data are composed of a larger number of individually less representative parts), both the internal representation of the nodes and the retrieval behavior required balancing this scale with optimizing for machine interpretability. This balance was achieved by automatically resolving retrieved child nodes up to their normative parent node (e.g. a retrieved sub-section is resolved by the pipeline up to its normative parent). In turn, the parent nodes were augmented to contain up to 15,000 characters from their underlying children — this felt like the best balance, as some nodes contained over 40,000 words. 

These enhancements allowed us to evaluate the system's ability to retrieve subsections with nuanced legal interpretations, which felt more honest as an assessment.

Evaluation

Special thanks to Maia Kennedy, Shreshta Keta, Aman Kumar, and Erica Landreth for creating synthetic general ledgers and evaluation queries for the grant compliance domain. 

We devised a benchmark, to test regulatory reasoning across 144 domain-specific queries, in 4 categories: Adversarial (questions designed to provoke hallucination), Discovery (open-ended regulatory exploration), Risk (questions about audit risk for specific scenarios), and Allowability (questions about whether specific costs are permitted under federal rules). 

These 144 queries were evaluated across 6 models, 4 embedding strategies, 2 system modes, and 16 hyperparameter configurations, producing 17,732 total evaluation jobs with over 30 metrics captured per query.

What These Metrics Mean

Answer Quality:

Semantic Keyword F1 (90.2%) measures whether the response contains the right regulatory concepts. Each query has expected keywords derived from expert reference answers, and this score reflects how well the model's answer covers them.

BERTScore F1 (0.826) measures the overall fluency and semantic similarity between the generated response and a reference answer, using contextual embeddings rather than exact word matching. We note that the top scoring models for this metric were generally not the most accurate. Rather, smaller models with lower token budgets scored higher because of their conciseness — not performance.

Citation & Retrieval Quality:

Grounded (65.1%) is the percentage of citations in the response that were verified to exist in the knowledge graph and were present in the retrieved context. These are citations the system can prove it found through retrieval.

Known Cross-Reference (5.1%) is the percentage of citations that exist in the knowledge graph but were not in the directly retrieved set. The model cited a real provision it found through graph traversal or cross-reference rather than through the initial retrieval pass.

Pre-training Data (25.3%) is the percentage of citations that match real regulatory provisions but came from the model's training data rather than from the knowledge graph. The citation is real, but the model knew it before seeing our corpus.

Hallucinated (4.5%) is the percentage of citations that do not exist in the knowledge graph or in any known regulatory source. These are fabricated references.

Key Findings

Orchestration outperforms RAG across the board. The orchestrated pipeline reduces hallucination rates and improves grounding at every measured quality level, with an 8% improvement on hallucination and a 39% improvement on semantic keyword quality.

Embedding choice mattered much more than model choice. BGE-Large consistently outperformed LegalBERT, Nomic, and MiniLM across all models tested. 

LegalBERT hallucinated citations most frequently, due to the appearance that it had overfit on superficial legalistic language present across legal texts, with approximately 75% cosine similarity overlap across the regulatory index. In practical terms, LegalBERT was retrieving text that sounded legal rather than text that was relevant to the query.

The hallucination-quality relationship is monotonic: as answer quality increases, citation reliability improves

Higher-quality answers are both more fluent and more grounded in the knowledge graph, which means quality metrics serve as early warning signals for citation reliability.

Gemma 3 (27B) + BGE-Large-v1.5 was selected for production over the best performing Semantic model, GPT-OSS:20B. Gemma 3 achieved higher grounding (44.3% vs 37.9%), lower hallucination (4.4% vs 5.0%), and higher BERTScore F1 (82.4% vs 81%). 

At the deployed top_k=12 setting, Gemma 3 reached 65.1% grounded citations with 4.5% hallucination, which was the highest grounding rate of any configuration tested. 

In science or compliance, citation verifiability outweighs answer fluency. That is, a grounded citation a user can check is worth more than a fluent answer they cannot trace back to a source of truth.

Transaction Risk Triage

14,624 simulated transactions were evaluated using a two-stage pipeline. First, the orchestrated LLM agent evaluates each transaction against the knowledge graph. Second, the system uses "Autolearned" lenses to create a hybrid bridge between rules-based and semantic detection. 

Autolearning is a roadmap feature that analyzes the gap between the LLM's predictions and known violations using chi-squared tests. A chi-squared test measures whether a pattern, like a specific word appearing in transaction descriptions, shows up more often in violations than you would expect by chance. If a word appears 157 times in missed violations and zero times in compliant transactions, that is not a coincidence. The test quantifies exactly how unlikely that distribution is under randomness, and patterns exceeding chi² > 50 (p < 0.0001) become AutoLearning "lenses" that run alongside the LLM.

The system is calibrated to flag aggressively, because the cost of an extra review is trivial compared to the cost of a missed violation reaching external audit. 

Of 111 synthetically generated grant descriptions, Citadel detected non-compliant transactions 74.4% of the time. 


Domain N: Bring Your Own Graph (BYOG) & Build Your Own Domain

Anyone can upload a knowledge graph (in GraphML format) and a spreadsheet or JSON file. Field mappings join them. The platform then produces a working analytical workspace with predefined risk triage, AI assistance, knowledge graph inspection, and portfolio analytics capabilities. Three quickstart templates are included to demonstrate the workflow.

The scaffold wizard is what makes Citadel a platform rather than a collection of applications. If the architecture only worked for three hand-picked domains, it would be three applications sharing a codebase. The user-scaffolded curriculum domain, running alongside the three primary domains, demonstrates that any ontology can be deployed without code changes.


Evaluation Philosophy

Every metric on this project has a system card behind it. System cards are published at fiatfulmen.com and contain the evaluation methodology, data sources, known limitations, AI tool specifications, and roadmap for each domain.

Where we have ground truth, we report grounding rates, hallucination rates, and precision-recall metrics across thousands of evaluation jobs. The compliance domain reports results across 17,732 evaluation jobs with over 30 metrics per query. 

Grant Defender reports detection recall across 263 injected terms and rewrite quality across 679 evaluated rewrites. Where we do not have ground truth, we evaluate what we can verify independently and scope the claims accordingly. 

Fiat Fulmen's system card contains details regarding evaluations where the knowledge graph was statistically poisoned, finding that ~34% of queries would produce the erroneous statistics. 

Graph Composition & Answer Quality

We asked: does where a regulatory node sits in the graph affect whether our embeddings retrieve it first? We took 1,440 query-node pairs, measured seven graph features, including how deep in the hierarchy a node is, how many citations it has, how long its text is, then ran Spearman correlation against retrieval rank for two embeddings: BGE-large and Legal-BERT. BGE-large showed three statistically significant signals: it slightly favors deeper, more specific sections; slightly favors nodes that audit findings cite; and slightly penalizes long text. Legal-BERT showed nothing significant at all. But here's the key: even BGE's significant results still imply embeddings are responding to semantic content, not to where a node lives in the graph.  This is actually what we wanted from our context enrichment pipeline, to automatically resolve small subsections to their parent provision, and to ensure parents had the written context of their children. This implies our retrieval isn't biased by structural artifacts.

Model Linguistic Fingerprints

This radar chart gives each LLM a linguistic fingerprint. Six axes measure different aspects of how each model writes: length, vocabulary, how often it hedges with words like 'may' or 'could,' how often it cites specific knowledge graph citations, and so on. A model with a tight polygon writes short and vague. One with a wide, specific shape is writing longer, more precise regulatory language. This tells you whether models differ in style, which matters when you're building an auditor that needs to cite exactly. We find that OpenAI's model tended to provide the most confident and generic responses, while models developed by Chinese companies provide the most succinct, but inherently uncertain responses. Google's model provided the best overall balance, demonstrating compromises in length, while favoring answer quality.  

Answer Quality vs. Latency Pareto Frontier for Best Model Configuration Selection 

This is a latency-quality Pareto chart. Every data point is a model-embedding-mode configuration. The X axis is how long it takes per response; the Y axis is overall composite quality. The dashed line is the Pareto frontier. Any point on it is optimal for its speed. Points below-right of the line are dominated because something faster and better exists. This informed us directly of which configurations were worth deploying, and what the likely relative latency profiles would look like between them.


Architecture

A YAML configuration file tells the platform how to map any dataset onto this shape. The platform reads that configuration and produces a working analytical workspace —including an inbox for risk triage, case detail views with event timelines, an AI assistant with citation verification, portfolio-level analytics, and a knowledge graph explorer with semantic search and neighborhood expansion. Every DomainObject receives these capabilities automatically, meaning no code changes are required to deploy new ones. 


Roadmap

AutoLearning allows a domain to automatically discover patterns in operational data that predict the target field, without model training. It uses a chi-squared framework, which we demonstrated in the compliance domain, as the foundation. By providing tooling for customer-defined evaluations, users can benchmark their knowledge graphs directly through Citadel's API. 

Cross-domain reasoning will enable knowledge graph transfer learning between ontological domains. For example, newly written grant abstracts can be analyzed for compliance risks by crossing Grant Defender's unstructured data with the compliance domain's knowledge graph of historical audit failures, surfacing risk profiles that neither domain could produce alone.

User-defined agentic tool-calling will let domain operators define their own data retrieval functions that the AI can call during analysis. This extends our library per domain without modifying platform code. 

Workspace plugins will connect Citadel to existing ERP systems so that operational data streams directly into the platform. 

Customer-managed encryption will provide encryption keys for sensitive data at rest, and even in-transit.


References

Fiat Fulmen

Brune, W. H., McFarland, P. J., Bruning, E., Waugh, S., MacGorman, D., Miller, D. O., Jenkins, J. M., Ren, X., Mao, J., & Peischl, J. (2021). Extreme oxidant amounts produced by lightning in storm clouds. Science, 372(6543), 711–715. https://doi.org/10.1126/science.abg0492

Brune, W. H., Ren, X., Zhang, L., Mao, J., Miller, D. O., Jenkins, J. M., Anderson, B. E., Diskin, G. S., Huey, L. G., Liu, S., Ryerson, T. B., Weinheimer, A. J., Wisthaler, A., Mikoviny, T., & Cubison, M. (2018). Atmospheric oxidation in the presence of clouds during DC3. Atmospheric Chemistry and Physics, 18(20), 14493–14510. https://doi.org/10.5194/acp-18-14493-2018

Jenkins, J. M., & Brune, W. H. (2025). Spatially separate production of HOx and NO in lightning. Atmospheric Chemistry and Physics, 25, 5041–5052. https://doi.org/10.5194/acp-25-5041-2025

Burkholder, J. B., Sander, S. P., Abbatt, J., Barker, J. R., Cappa, C., Crounse, J. D., Dibble, T. S., Huie, R. E., Kolb, C. E., Kurylo, M. J., Orkin, V. L., Percival, C. J., Wilmouth, D. M., & Wine, P. H. (2019). Chemical kinetics and photochemical data for use in atmospheric studies (JPL Publication 19-5). Jet Propulsion Laboratory. https://jpldataeval.jpl.nasa.gov

Grant Defender

PEN America. (2025). Federal government's growing banned words list. https://pen.org/banned-words-list

Levy, R. (2025, January 28). FDA staffers told that 'woman,' 'disabled' among banned words. Reuters. https://www.reuters.com/business/healthcare-pharmaceuticals/fda-staffer…

Grant Audit Compliance

U.S. Government Accountability Office. (2025). NIH grants: Oversight of questioned costs (GAO-25-107362). https://www.gao.gov/products/gao-25-107362

Uniform Administrative Requirements, Cost Principles, and Audit Requirements for Federal Awards, 2 C.F.R. § 200 (2026). https://www.ecfr.gov/current/title-2/subtitle-A/chapter-II/part-200

National Institutes of Health. (2026, March). NIH grants policy statement (Rev. March 2026). U.S. Department of Health and Human Services. https://grants.nih.gov/policy-and-compliance/nihgps

National Institutes of Health. NIH RePORTER: Research portfolio online reporting tools. U.S. Department of Health and Human Services. https://reporter.nih.gov/

Federal Audit Clearinghouse. Audit findings database. U.S. General Services Administration. https://www.fac.gov/


Acknowledgements

I'd like to thank all the friends and family who provided me with support and feedback throughout the wild process of bringing Citadel to life. Thank you!!!

Last updated: April 22, 2026