RAG with Iterative Critic
RAG with Iterative Critic - Elevating RAG Accuracy
Problem & Motivation
Standard RAG is effective but limited by single-pass generation and lack of post-hoc validation
Most RAG systems follow a retrieve → generate paradigm, producing an answer in a single forward pass. While effective for simple queries, this approach lacks mechanisms for post-hoc evaluation or targeted refinement after the initial generation. As a result, the system cannot determine whether an answer is complete, accurate, or sufficiently grounded in the retrieved contexts.
These limitations are more pronounced for complex, multi-hop queries that require synthesizing information across multiple documents. Standard RAG often struggles with connecting evidence across documents, handling multi-step reasoning, and identifying answers that are incomplete or insufficiently grounded. Even when relevant documents are retrieved, the model may fail to integrate them correctly or omit necessary intermediate reasoning steps.
Without an explicit feedback mechanism, errors in retrieval or reasoning propagate directly to the final answer. The system cannot identify when an answer is insufficient, nor can it trigger additional retrieval or refinement steps. This leads to missing context, incomplete answers, and unverified reasoning, reducing reliability—particularly on queries that require multi-hop reasoning or deeper evidence aggregation.
Existing Solution
While several approaches attempt to mitigate single-pass limitations, they often fall short when handling complex, multi-hop queries:
- Rerankers (Cross-Encoders): Improve context precision by re-scoring and filtering retrieved documents. However, they cannot fetch new information if the initial retrieval completely misses the required nested facts.
- Query Expansion (e.g., Multi-Query, HyDE): Rewrite the user prompt to cast a wider search net. While this improves initial recall, it remains a single-pass mechanism with no post-generation verification to ensure the answer is correct.
- Chain-of-Thought (CoT) Prompting: Enhances the model's reasoning over the retrieved text, but inevitably fails when the underlying retrieved context is disconnected or insufficient.
Unstructured Agents (e.g., standard ReAct): Allow LLMs to iteratively search for data, but frequently suffer from high latency, unpredictable routing, and getting stuck in infinite loops without a strict, structured evaluation phase.
Our Solution
Critic-Guided Iterative RAG with Conditional Decomposition
The system extends a standard RAG pipeline by introducing a critic-driven feedback loop on top of an initial retrieve→rerank→ generate pass. As shown in the workflow, the process begins with initial retrieval and reranking, followed by generation of an initial answer using the retrieved contexts.
An LLM-based critic then evaluates the answer for grounding (faithfulness to retrieved contexts) and completeness. If the answer passes the critic check, it is returned as the final answer.
If the answer fails, the system enters the critic loop, where it performs decomposition and regeneration. The original query is broken into targeted sub-queries, additional evidence is retrieved, and a new answer is generated using the expanded context. This updated answer is then re-evaluated by the critic.
This retrieve → generate → critique → (decompose + regenerate) loop continues for a bounded number of iterations (up to three rounds) or until the critic passes the answer.
By introducing this structured feedback mechanism, the system moves beyond single-pass generation and enables targeted recovery from common RAG failure modes, including missing or incomplete context and errors in multi-hop reasoning. The result is a more reliable and grounded answer generation process, particularly for complex queries requiring multi-step reasoning.
Evaluation
Significant Gains with Targeted Impact
Overall Performance (n=300)
Targeted Impact (Critic-Failed → Decomposed Cases, n=111)
Insights
- Human evaluation showed true accuracy reached 87% confirming strong real-world performance on multi-hop bridge style questions.
- The system delivers substantial gains on failure cases, nearly doubling accuracy by recovering from missing context and multi-hop reasoning errors.
- Improvements are driven by higher recall and better reasoning, not just retrieval noise.
- Faithfulness increases significantly, indicating stronger grounding in retrieved evidence.
- A minor precision tradeoff suggests broader retrieval, but with a net positive impact on answer quality.
Impact
Selective Critique and Decomposition Improve RAG Reliability
Single-pass RAG is insufficient for complex queries: Performance degrades on multi-hop and compositional questions without mechanisms for validation or recovery.
The critic step enables targeted intervention: The system avoids unnecessary overhead by only decomposing when answers fail grounding or completeness checks, 59% of initial answer fialrues were successfully identified by critic check and routed to critic loop for correction.
Decomposition is effective when applied selectively: Breaking queries into sub-steps significantly improves performance on hard cases while preserving efficiency on simpler ones, proved by the result that 23% of initial failures were fully corrected by critic loop.
Improvements are driven by recall and reasoning gains: Higher context recall (up from 70% to 93%) and faithfulness (up from 73% to 86%) indicate better evidence aggregation and more reliable multi-hop reasoning.
Minor precision tradeoff, but net quality improves: Slight decreases in context precision (down 1%) reflect broader retrieval but result in substantially better final answers.
A generalizable pattern for production RAG systems: Iterative critique and refinement provide a scalable approach to improving reliability without requiring fully autonomous agents.
Applications & Generalizability
A Domain-Agnostic Framework for Multi-Hop Reasoning and Evidence Synthesis
This approach is designed for domains that require multi-hop reasoning and complex evidence synthesis. Because the pipeline is domain-agnostic, it generalizes across knowledge bases where answers depend on connecting multiple pieces of information, rather than retrieving a single fact. It is particularly effective in scenarios where single-pass RAG fails due to missing context, implicit relationships, or the need for iterative validation.
Enterprise Knowledge Search: Enables employees to retrieve accurate answers from large, fragmented internal knowledge bases, where information must be synthesized across multiple documents rather than directly retrieved.
Legal Research: Supports reasoning across statutes, case law, and precedents that must be connected to form valid arguments, reducing missed citations and improving confidence in complex interpretations.
Medical Q&A: Facilitates synthesis across clinical guidelines, research literature, and patient context, helping ensure answers are both evidence-based and complete.
Financial Analysis: Combines insights from reports, filings, and market data to support decision-making, enabling identification of relationships and trends not explicitly stated in a single source.
Academic Research: Supports deeper literature review by connecting findings across multiple papers, enabling more comprehensive synthesis of complex research topics.
