complyraAI
A compliance intelligence platform that helps university researchers prepare for audits with confidence. By auditing transaction-level spending and evaluating grant-specific risks, it surfaces issues early, strengthens defensibility, and reduces the likelihood of audit findings or funding loss.
Problem, Motivation, and Product Description
Compliance audits are a central part of how university researchers account for and maintain federal grant funding; noncompliance issues can lead to funding freezes or clawbacks. For example, between 2021 and 2023, independent audits found $9.3M in questioned costs for NIH grant awardees, $3.3M of which had to be paid back by the researchers (GAO-25-107362). Though large universities may provide support from an internal audit team, it is often the responsibility of principal investigators and research assistants to ensure the compliance of their spending and research practices with federal regulations. Since researchers aren’t typically experts in auditing, they may turn to external resources. But there has historically been no purpose-built tool for researchers: regulatory documents and blogs are laden with legal jargon, large language models provide generic or untrustworthy responses, and hiring an external auditor can be prohibitively expensive.
This is where complyraAI comes in. Reading the name backwards, it’s an AI tool for research assistants (RAs) to comply with federal rules and prepare for audits with confidence. By auditing transaction-level-spending and evaluating grant-specific risks, it surfaces issues early, strengthens defensibility, and reduces the likelihood of audit findings and funding loss. The tool aims to bridge the gap between a researcher’s funding knowledge and that of an audit professional, by translating legal jargon into understandable terms and providing comprehensive risk assessment.
The product includes two modules: Ledger Transaction Audit and Grant Risk Evaluator.
The Ledger Transaction Audit evaluates expense sheets at the line item level, flags noncompliance risks, and provides feedback for reasoning. Key features include document upload and result export, human-in-the-loop approvals and note-taking, and linked regulatory citations. See the demo in the sidebar.
The Grant Risk Evaluator calculates risk level and mitigation strategies for a specific NIH grant ID by linking grant features with past cases of noncompliance. Key features include pulling grant data directly from the NIH official website, both deterministic and predictive methods of risk calculation, and grant specific feedback. See the demo in the sidebar.
Data Sources & Data Science Approach
We leveraged a variety of data sources to characterize grant content, regulatory context, and historical audit findings:
- NIH RePORTER: Provides information about NIH grant projects including researcher information, budget, and research topic
- Code of Federal Regulations (CFR): Legal compliance rulebook for federal grant spending in general
- OMB R&D Compliance Supplement: Provides guidance on internal control practices for research
- NIH Grants Policy Statement (GPS): Provides NIH-specific compliance rules
- Federal Audit Clearinghouse (FAC): Historical audit results, including questioned costs and corrective actions in cases of noncompliance
Synthetic Ledger Generation
The classifier model underlying the Ledger Transaction Audit module was trained on ground truth ledger data. Since procuring real ledgers from universities is difficult due to privacy considerations, we generated our own using large language models through a multi-step process. First, we prompted Gemini Deep Research to gather relevant information about a real NIH grant, including the abstract, budget, and research topic. We then instructed Claude or Gemini Thinking models to construct a comprehensive, compliant ledger from the Deep Research report. This gave us a set of clean ledgers. In parallel, we developed a corpus of noncompliant expenses, based on our exploratory analysis of both the CFR rules and historical FAC findings. Finally, we prompted Claude or Gemini Thinking models to incorporate a sample of noncompliant expenses into the clean ledgers, to generate the poisoned ledgers. We used this process to generate 54 poisoned ledgers with 7,402 total expenses, 679 of which were injected noncompliances.
Modeling: Ledger Transaction Audit Module
Under the hood, the Ledger Transaction Audit Module is powered by a multiclass classifier based on a BERT-style model. The input consists of the vendor, transaction type, and description from an individual expense. The BERT model encodes this text into embedding feature vectors, and uses a sequence of fully-connected layers to determine the most probable class. This output maps to either “No Violation” or a specific section of the CFR rules that the line item is at risk of violating.
During development, we experimented with three base embedding models. Our baseline, MiniLM, is a task-agnostic sentence encoder. LegalBERT and FinBERT are BERT models that were trained on legal and financial data, respectively. In addition, we experimented with introducing a weighting scheme into the cross-entropy loss function, with weights inversely proportional to class frequency so the model pays more attention to under-represented classes.
Modeling: Grant Risk Evaluator Module
This module leverages a three-layer architecture. The entire engine is grounded in the CFR regulations, NIH GPS, and the OMB Compliance Supplement.
- Rules & Feature engine: Evaluates structured fields–such as award size, indirect cost ratio, and number of PI’s–against predefined risk tiers that can be adjusted by the user
- Multi-level encoder model: Uses distillation techniques from frontier GPT models to analyze the grant title and abstract text, to surface nuance areas that aren’t captured by rules alone
- Prioritization layer: Fuses rules-based and encoder-based signals to produce a holistic risk assessment
In the development of the deterministic risk engine, we developed a grant compliance risk ontology, which includes:
The multi-level encoder layer is a tuned DistilRoBERTa model with a sigmoid multi-label head. This is compared against a baseline encoder model of deBERTa-v3-small and a separate TF-IDF technique plus a one-vs-rest logistic regression model.
Evaluation
Ledger Transaction Audit Module
The key evaluation metric for the module was F2 score, which penalizes false negatives more heavily than false positives, meaning we focus on minimizing missed noncompliance risks rather than false alarms. Below is a comparison of the test set F2 score for our experimental cases. The left two bar groups show the results for the “No Violation” class and the right two for the violation classes, for both the unweighted and weighted loss cases. For our application, performance over the true violations is most important, and the weighted models perform best in this regard. In particular, the weighted FinBERT model has the strongest performance, with an F2 score of 0.76. This is the model that powers our MVP.
The model does still struggle with over-predicting the more common violation classes. To address this, in the future we’d like to explore further sampling and weighting schemes or data augmentation.
Grant Risk Evaluator Module
Grant compliance tags were treated as a multilabel prediction problem and scored with Micro F1 (pooled F1 over all grant-label pairs), so the metric reflects overall tagging quality across labels jointly. On a 200-example holdout, with evaluation restricted to labels that appear at least once on a 796-example training split, deBERTa‑v3‑small with a sigmoid multilabel head achieved Micro F1 = 0.678. DistilRoBERTa with the same head achieved 0.731, an absolute improvement of 0.053 and a 7.8% gain relative to the deBERTa baseline. For context, a TF‑IDF + one‑vs‑rest logistic regression baseline reached 0.722 (1.2% below DistilRoBERTa on a relative basis).
Qualitatively, we also benchmarked the system against common generative AI workflows that might be used by research assistants. Using Google Gemini, outputs tended to surface broad, generic compliance areas without grant-specific grounding. When ChatGPT was questioned about NIH data access, the response was a confusing “Yes–I don’t have direct/live integration with NIH RePORTER, but I can absolutely access and use the same information it contains.” This highlights a lack of direct integration with sources like NIH RePORTER and a general risk of hallucination or nonsensical responses. In contrast, the complyraAI module is purpose-built to resolve and analyze real grant data, delivering precise, context-aware compliance insights.
Key Learnings & Impact
Through this project, we contributed three key innovations to the field:
- Ledger Transaction Audit Module: Leveraged classical NLP techniques to design a classifier specific to the grant spending application
- Grant Risk Profiler Module: Distilled frontier model insights to train an LLM classifier, while creating a unified ontology for risk mapping
- Researcher-focused design: Modeling approach and UI/UX designed to target the grant compliance risk problem from the researcher point of view
The power of complyraAI lies in its ability to support researchers through a human-in-the-loop architecture. We hope that this tool enables researchers to proactively mitigate compliance risks in order to reduce funding freezes and clawbacks, keeping research funding in the hands of researchers.
Acknowledgements
We’d like to thank our instructors Joyce Shen and Dr. Zona Kostic for feedback, encouragement, and support throughout the project, and the 210 TA team and Robert Wang from AWS for their technical guidance. We’d also like to thank Dr. Alex Hughes, Dr. Cornelia Paulik, Michael Lecaroz, and an anonymous research assistant for providing feedback on our ideas and prototypes. We’d also like to thank our Capstone peers for their support, encouragement, and inspiration. In particular, we’d like to thank Jonathan Hernandez as a key collaborator during the early stages of the project.
