Court Logic
Court Logic — AI-Powered Supreme Court Deliberation Simulator
Problem & Motivation
The United States Supreme Court has shaped the trajectory of American laws since its establishment in 1700s; yet, its decision-making process remains unclear and opaque to the general public, even to experienced level professionals. Historically, oral arguments span hours of legal reasonings across numerous judicial philosophies. Therefore, written opinions often can be quite lengthy, which lead to a continuously-widening gap between Court's output and public comprehension.
Currently, many existing legal prediction tools on the market treat the US Supreme Court as a black box, where they ingest cases' metadata and provide outputs mainly as binary predictions. These tools tend to not provide great explanations on why justices might rule a certain way, how coalition dynamics may shift the outcome, and what doctrinal tensions may drive a split decision. As the result, there is not a truly interactive system that allows users to pose a constitutional question and receive a structured, multi-perspective simulation on how the existing justices bench would deliberate. Those reasons leave our team some room to build our product, Court Logic, to address this gap directly.
Our primary target users include law students and educators seeking interactive tools to explore constitutional reasoning, legal researchers and practitioners looking to quickly simulate how the current bench might deliberate on emerging issues, and civic-minded professionals who want to understand Supreme Court dynamics without wading through hundreds of pages of legal opinions. By combining generative AI with retrieval-augmented generation grounded in real judicial writing, Court Logic bridges the gap between raw legal data and actionable, multi-perspective insight.
Data Source & Data Science Approach
Our platform is a multi-agent AI platform that simulates Supreme Court oral arguments in real time, where users can submit any legal question or description and receive a full adversarial debate either between ideologically distinct coalitions, Liberal versus Conservative, or from their selected custom personas. Afterward, a well-trained Chief Justice agent would then preside over the proceedings and present arguments across all rounds and deliver a court-like structured opinion with great details such as vote tallies, majority reasonings, concurrences, and dissents.
The system is built on a retrieval-augmented generation (RAG) framework using a curated corpus of Supreme Court opinions sourced from CourtListener, including majority, concurring, and dissenting writings.
Each “justice” is modeled as an AI agent conditioned on judicial philosophy and historical reasoning patterns. These agents engage in a structured, multi-round debate, where each argument is grounded in retrieved legal context.
Multi-Agent Judicial Reasoning Framework
Court-Logic simulates Supreme Court deliberation using a structured, multi-agent architecture. Each agent embodies a justice-aligned perspective and participates in constrained, multi-round debate, producing adversarial yet coherent legal reasoning rather than single-pass outputs.
Hybrid RAG with GraphRAG (Beta)
The system integrates Retrieval-Augmented Generation (RAG) with a GraphRAG knowledge graph to capture relationships across cases, doctrines, and judicial opinions. This hybrid approach enables deeper contextual retrieval and more faithful reasoning over precedent. GraphRAG is currently in beta.
Grounded in Real Legal Data (CourtListener)
All reasoning is anchored in a curated corpus derived from CourtListener, including majority opinions, dissents, and concurrences. This ensures outputs reflect authentic legal language, structure, and argumentative patterns.
Rigorous, Multi-Dimensional Evaluation
We evaluate system performance beyond surface-level text quality using four core metrics:
- Outcome Accuracy (vote prediction)
- Ideology Alignment (consistency with judicial philosophy)
- Key-Point Coverage (completeness of legal reasoning)
- Realism Score (LLM-as-judge assessment of judicial quality)
This framework combines quantitative evaluation (RAGAS) with qualitative judgment, aligning closely with real-world legal reasoning standards.
Explainable, Decision-Grade Outputs
Court-Logic prioritizes transparency by generating structured debates, final votes, and majority opinions. This transforms the system from a black-box predictor into an interpretable reasoning engine suitable for legal education, analysis, and decision support.
Evaluation
Court-Logic is evaluated using a dual-framework approach designed to measure both predictive performance and quality of legal reasoning. Rather than relying on a single metric, we assess the system across multiple dimensions aligned with how judicial decisions are actually formed and evaluated.
Evaluation Design
We construct a 13-case gold-standard dataset from the 2024–2025 Supreme Court term, using publicly available opinions from CourtListener. Each case includes ground-truth outcomes, ideological leanings, and key legal arguments.
The system is evaluated over 22 experimental runs, iterating on retrieval strategy, chunking, embedding models, and prompt design.
Core Metrics
We evaluate performance across four dimensions:
- Outcome Accuracy – Measures whether the system correctly predicts the final court decision (affirm/reverse).
- Ideology Alignment – Assesses whether generated opinions reflect expected judicial philosophy and voting patterns.
- Key-Point Coverage – Evaluates how well the system captures the core legal arguments and reasoning present in the ground truth.
- Realism Score (LLM-as-Judge) – Uses an independent model to evaluate the coherence, structure, and authenticity of generated judicial opinions.
Retrieval Quality (RAGAS)
To isolate retrieval performance, we use RAGAS metrics including context precision and recall. Results indicate that retrieval quality is consistently high, suggesting that performance bottlenecks lie not in information access, but in how effectively the model synthesizes and applies retrieved evidence.
Results & Performance Gains
Through systematic RAG optimization, performance improved significantly:
- Outcome Accuracy: 15% → 46.2%
- Ideology Alignment: 30% → 53.8%
- Key-Point Coverage: 0.442 → 0.527
- Realism Score: 3.77 → 3.92
Notably, these gains were achieved without major model changes, but through improvements in chunking strategy, retrieval depth (k), and prompt design.
We evaluate performance using both qualitative and quantitative metrics across a 13-case gold-standard dataset from the 2024–2025 Supreme Court term.
Core Metrics:
- Outcome Accuracy: Alignment with actual court decisions
- Ideology Alignment: Consistency with expected judicial leanings
- Key-Point Coverage: Completeness of legal reasoning
- Realism Score (LLM-as-Judge): Quality and authenticity of generated arguments
Key Results:
- Outcome Accuracy improved from ~15% (baseline) to 46.2%
- Ideology Alignment increased from 30% to 53.8%
- Key-Point Coverage improved from 0.442 to 0.527
- Realism Score increased from 3.77 to 3.92
RAGAS metrics confirmed strong retrieval performance, indicating that improvements were driven primarily by better reasoning and context utilization.
Key Insights & Impact
Context Engineering Drives Performance
The primary driver of improvement was not model selection, but RAG system design. Optimizing chunk size, overlap, and retrieval depth led to substantial gains, demonstrating that well-structured context consistently outperforms larger models in reasoning-heavy tasks.
Preserving Reasoning Chains is Critical
Best results were achieved when chunking preserved multi-step legal arguments. Larger, overlapping chunks allowed the model to maintain continuity across doctrines and precedents, directly improving outcome prediction and argument quality.
Retrieval is Solved—Application is Not
RAGAS evaluation shows high retrieval precision and recall, indicating that relevant information is successfully surfaced. The remaining challenge lies in reasoning over that information—specifically, synthesizing evidence into coherent, decision-aligned arguments.
Structured Debate Improves Explainability
The multi-agent framework transforms the system from a black-box predictor into a transparent reasoning engine. By exposing intermediate arguments, votes, and opinions, Court-Logic enables auditable, step-by-step decision making.
Impact: From Prediction to Decision Intelligence
Court-Logic shifts AI from predicting outcomes to generating decision-grade dialogue. This has clear applications in:
- Legal education (understanding competing arguments)
- Case analysis (surfacing precedent and reasoning paths)
- General decision systems (extending to domains like policy and investing)
Scalable Blueprint for Reasoning Systems
The architecture provides a reusable pattern for building AI systems that require structured reasoning over complex knowledge bases. With the integration of GraphRAG (beta), the system points toward future capabilities in relationship-aware, multi-hop reasoning at scale.
Court-Logic demonstrates that the frontier of applied AI is not just smarter models—but systems that can reason, debate, and explain.
Inspiration & Mission
Finally, we are inspired by the broader goal of making complex systems more accessible. Court-Logic is driven by the mission to bring structured, transparent reasoning to domains traditionally limited to experts—starting with the U.S. legal system and expanding beyond.
Mission: To bring the Supreme Court to everyone. Turning complex judicial reasoning into clear, compelling AI-powered debate.
Acknowledgements
This project was developed as part of the UC Berkeley Master of Information and Data Science (MIDS) Capstone program.
We thank the instructional team, peers, and collaborators for their guidance and feedback throughout the project. Additional appreciation goes to the developers and researchers behind the open-source tools and model providers that made this work possible.
