- Published on
Agentic RAG: Multi-Turn Retrieval With Self-Editing Context and Specialist Subagent Models
Traditional RAG worked when the question and the answer both fit inside a single retrieval call. The interesting questions don't. Multi-hop queries, ambiguous user phrasing, follow-up searches that depend on what the first search returned — none of that is well-served by embed -> top-k -> stuff -> generate. The field has been quietly shifting toward what people now call agentic RAG: retrieval as a loop the model controls, with explicit tools for searching, reading, and — the new piece — un-retrieving. Once you accept the loop shape, the cost calculus changes too. Most of the tokens are tool I/O, not generation, and the model's job is decomposition, scanning, and pruning rather than world knowledge. Chroma's recent research note Context-1: Training a Self-Editing Search Agent is the cleanest worked example of where this leads: a 20-billion-parameter LoRA-tuned gpt-oss-20b that matches GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on multi-hop retrieval benchmarks at up to 10x faster inference, and where four parallel rollouts of the small model still come in cheaper than a single frontier API call.
This post is half primer, half case study. The first three sections are about agentic RAG as a pattern — what it is, what it inherits from ReAct, and the new failure mode (context bloat) that its loop shape creates. The rest of the post walks Chroma's design choices in Context-1: the prune_chunks tool, the token-budget regime, the CISPO training recipe with its asymmetric recall-biased reward, the synthetic-data verification trick, and the cost math that lands the whole thing on a different point of the price/performance frontier than any frontier API.
Traditional RAG: One-Shot Retrieval as a Stateless Pipeline
The original Retrieval-Augmented Generation formulation by Lewis et al. (2020) is structurally a stateless pipeline. The user asks a question, an embedding model maps it into the same space as the corpus chunks, an approximate-nearest-neighbour index returns the top-k, those chunks are concatenated into the prompt under a "use this context" instruction, and the LLM generates an answer.
┌───────────────────────────────────────────────────────────────┐
│ │
│ user query ──► embed ──► ANN search ──► top-k chunks ──┐ │
│ │ │
│ system prompt + chunks ◄────┘ │
│ │ │
│ ▼ │
│ generate ──► answer │
│ │
└───────────────────────────────────────────────────────────────┘
The shape works, and it works well, when three conditions hold:
- The relevant chunks are reachable from the query in one hop. The user's phrasing is semantically close enough to the supporting text that the embedding similarity ranks the right passages near the top.
- The answer fits inside what those chunks say. No chaining across documents; no "find X, then use X to find Y" structure.
- k is large enough to cover the answer with margin, but small enough not to wash out the prompt. Usually 4–20 chunks.
When all three hold, traditional RAG is hard to beat on cost. It's one embedding call, one index lookup, one generation call. Latency is dominated by generation, not retrieval.
The cracks show up the moment any of the three conditions slips:
- Multi-hop questions. "Which company that filed for IPO in 2024 had its lead underwriter previously involved in a settled SEC case?" requires finding the IPO list, then finding underwriter relationships, then cross-referencing SEC enforcement history. No single ANN call surfaces all of that, because the chunks aren't lexically close to the original query.
- Ambiguous queries. "Show me what we know about the Q3 incident" — which incident, in which product, on which team? A single retrieval call can't disambiguate; it can only return the top-k for an under-specified query and hope.
- Vocabulary drift. The user asks about "rate limiting" and the relevant doc calls it "request throttling". Dense embeddings often bridge this; sparse retrieval doesn't. When neither does, top-k is wrong and the answer is wrong with confidence.
- Follow-up reasoning. "If the chunks I just got are about the wrong product, search again for the right one." Traditional RAG has no second turn — the prompt either does or doesn't contain the answer, and if it doesn't, the model hallucinates or refuses.
The standard mitigations — query rewriting, HyDE, multi-query expansion, reranking — all push on the first retrieval call, trying to make it richer. They don't change the loop shape: still one retrieval call, still stateless, still no recovery if the call missed.
Agentic RAG: Retrieval as a Loop the Model Drives
Agentic RAG inherits its loop primitive from ReAct (Yao et al., 2022): the model alternates between reasoning steps (free-text thoughts about what to do next) and acting steps (calling tools whose results come back as observations). Plug retrieval primitives into the tool slots and you have agentic RAG.
The minimum viable tool surface is roughly:
- A search call that takes a query and returns ranked chunks.
- A read-by-id call that fetches a full document by ID, for when search snippets aren't enough.
- A regex / exact-match call for when the agent knows the literal string it's looking for.
- A prune / discard call to remove material from the context — the new piece, and the one this post is mostly about.
Chroma's Context-1 harness instantiates these as search_corpus, read_document, grep_corpus, and prune_chunks respectively. The first three are unsurprising; the fourth is the one most agentic-RAG implementations skip, and skipping it is the structural mistake.
┌────────────────────────────────────────────────────────────────┐
│ │
│ user query │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ ┌────────► │ reason: what │ │
│ │ │ to do next? │ │
│ │ └───────┬───────┘ │
│ │ │ │
│ │ ┌──────────────┼──────────────┐ │
│ │ ▼ ▼ ▼ │
│ │ search_corpus read_document grep_corpus │
│ │ │ │ │ │
│ │ └──────► observation ◄────────┘ │
│ │ │ │
│ │ ▼ │
│ │ context window │
│ │ (filling up) │
│ │ │ │
│ │ ▼ │
│ │ prune_chunks ── (free space) │
│ │ │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ final answer │
│ │
└────────────────────────────────────────────────────────────────┘
The change in capability is qualitative. Decomposition emerges: the agent breaks the multi-hop question into sub-questions and searches each. Recovery emerges: a bad first retrieval is just an observation the agent can reason over and route around with a second call. Vocabulary drift gets fixed mid-trajectory: the agent reads what came back, notices the corpus uses different terms, and re-searches in the right vocabulary. None of this is magic — it's just what falls out of giving the model the loop and the tools.
The price is that you've replaced one generation call with five, six, often ten. And that price is mostly paid in tokens, not in headline generation cost — which is exactly why specialist small models eventually win.
The New Failure Mode: Context Bloat Across Turns
Multi-turn search has a shape that single-shot RAG doesn't: chunks accumulate. Every search_corpus call returns somewhere between 1K and 2K tokens of new chunks; every read_document returns several thousand. After six turns of normal retrieval activity, a naive agent's context can easily look like this:
- ~500 tokens of system prompt and tool definitions
- ~1,000 tokens of original user query and early reasoning
- ~9,000 tokens of accumulated retrieval results, most of which were relevant to one sub-question and have nothing to do with the current one
- ~1,500 tokens of intermediate reasoning
That's 12K tokens before the agent has even started to formulate a final answer, and the useful fraction of those 9K retrieval tokens is often under 20%. By turn ten the budget is exhausted and the agent either truncates (losing the original query) or starts producing degraded output as the relevant signal is drowned by stale chunks.
There are two existing mitigations and neither is sufficient on its own:
Bigger context windows. The frontier-model answer. It works, in the sense that 200K-token windows accommodate a lot of accumulated retrieval. But cost scales linearly with input tokens, and prefix-cache invalidation gets worse the longer the context — every turn that adds new content forces the next turn to re-encode more. For an agent doing five to ten retrieval turns per query, the per-trajectory bill on a frontier API gets meaningful fast.
Harness-level summarisation. A middleware layer that, every N turns, replaces retrieval results with an LLM-generated summary. This loses information by definition, and the harness has worse signal than the model about which details still matter. Summarising "the chunk about Q3 earnings, mentioning underwriter X" down to "Q3 earnings info" loses the underwriter that the next turn was going to chain on.
The structural fix — and Chroma's contribution in Context-1 — is to recognise that the model itself has the best signal about what's still relevant and to give it an explicit tool to act on that signal. Don't summarise; un-retrieve. Don't expand the window; spend the existing window on what matters.
Self-Editing Context: Pruning as a First-Class Tool
The Context-1 research post makes the design choice explicit: alongside search_corpus, read_document, and grep_corpus, expose a prune_chunks(ids) tool, and train the model to use it well.
The token-budget regime they implement around it:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 0 ─────────────── 24,576 ──── 28,000 ──────── 32,768 │
│ │ │ │ │ │
│ │ free zone │ soft │ hard │ │
│ │ │ prune │ cutoff │ │
│ │ all tools allowed │ signal │ only │ │
│ │ │ │ prune_chunks │ │
│ │ │ │ or │ │
│ │ │ │ finish_ │ │
│ │ │ │ answer │ │
│ │ │ │ │ │
│ └────────────────────┴────────────┴────────────────┘ │
│ │
│ current usage appended after every turn │
│ │
└─────────────────────────────────────────────────────────────┘
Three thresholds, each doing different work. Below 24,576 tokens the agent operates freely. Between 24,576 and 28,000, the system surfaces a soft signal — the agent sees its current usage and is nudged to prune before the next big retrieval. Above 28,000, only prune_chunks and the final answer call are permitted; the agent has been backed into a corner where it has to free space before it can do anything else useful.
The other thing the harness does — which is mechanically simple but easy to forget — is trajectory-wide deduplication. Every chunk ID the agent has ever seen is tracked, and on subsequent search_corpus calls those IDs are passed as exclusion filters. The agent literally cannot get the same chunk back twice. Without this, agentic RAG converges to "the model keeps re-retrieving the same near-duplicate cluster of chunks" — which is exactly what the base gpt-oss-20b does before training.
The training delta this enables, as Chroma reports it:
| Metric | Base gpt-oss-20b | Context-1 |
|---|---|---|
| Pruning accuracy | 0.824 | 0.941 |
| Avg trajectory length (turns) | 6.7 | 5.2 |
| Parallel tool calls per turn | 1.52 | 2.56 |
| Trajectory recall | 0.640 | 0.739 |
| Output recall | 0.361 | 0.641 |
Pruning accuracy rising from 0.824 to 0.941 is the headline number. It says the model learned not just to prune, but to prune the right chunks — keeping evidence that will still be useful three turns later, dropping chunks whose relevant content has already been quoted into the reasoning trace. Trajectory length dropping from 6.7 turns to 5.2 is the second-order effect: when you stop wasting turns on context-management thrashing, the average trajectory shortens by ~22%.
Why a Specialist 20B Beats a Frontier Generalist Here
The headline result from the Context-1 report is the comparison plot against frontier models. On the axes that matter for production agentic RAG — final-answer correctness and chunk-level F1 against gold evidence — Context-1 at a single rollout matches or exceeds:
- GPT-5.2 and GPT-5.4
- Claude Sonnet 4.5 and 4.6
- Claude Opus 4.5 and 4.6
- Gemini 3.1 Pro
- Kimi K2.5
- gpt-oss-120b (the same base family, six times the parameters)
The benchmarks span Chroma's four generated suites (web, finance, legal, email) and the public canon (BrowseComp-Plus, HotpotQA, FRAMES, SealQA's Seal-0 and LongSeal variants, and Humanity's Last Exam filtered to search-relevant questions).
Headline numbers worth keeping in mind:
- BrowseComp-Plus: 0.87 final-answer found, 0.65 F1 (Context-1, 1x)
- HotpotQA: 0.97 final-answer found — near saturation; this benchmark no longer separates frontier and specialist
- Generated web (BrowseComp-style, difficulty 2+): 0.88 / 0.64 F1
- Generated finance (SEC filings, difficulty 1+): 0.89 / 0.64 F1
- Generated legal (USPTO patent prior-art): 0.89 / 0.65 F1
- Generated email (Epstein + Enron): 0.92 / 0.75 F1
The latency story is just as important. With MXFP4 quantization and vLLM on a single Nvidia B200, Chroma reports 400–500 tokens per second, and "up to 10x faster inference" than the frontier comparators. In a five-turn trajectory, 10x generation speed is the difference between sub-second and ten-second user-perceived latency — and latency compounds across turns in a way that wall-clock matters more than raw throughput per token.
The cost lever is the four-parallel-rollouts mode. Instead of one expensive call to a frontier model, you run four independent Context-1 rollouts in parallel and reconcile the answers. Chroma reports that this 4x ensemble approaches or surpasses frontier-model performance on the harder benchmarks while remaining cheaper than a single frontier API call. That's an unusual point on the cost/performance frontier — you don't usually get to take the small-model price and the ensemble's recall lift at the same time.
The intuition for why a 20B specialist wins is structural, not magical. In an agentic-RAG loop the model's job is:
- Decomposing the user query into sub-questions.
- Rewriting sub-questions into corpus-vocabulary search queries.
- Scanning ranked search results and deciding what to read in full.
- Pruning chunks that are no longer load-bearing.
- Stitching evidence into a final answer.
These are operationally narrow skills. The vast majority of a frontier model's parameter count is dead weight for retrieval scaffolding — you're paying for medical knowledge, multilingual fluency, code generation, multimodal grounding, none of which the loop exercises. A LoRA on gpt-oss-20b, trained specifically on this task shape, captures the load-bearing behaviour at a fraction of the parameter count.
The Cost Math, Concretely
Run the numbers on a typical multi-hop retrieval trajectory:
- Turns per trajectory (post-Context-1 training): ~5.2.
- Tokens per turn (search results + agent reasoning): ~2,000–4,000.
- Total trajectory I/O: ~10,000–20,000 input tokens, plus a few thousand output.
At frontier-model list prices, that's measurable per-query cost — and unlike a chat workload, the agent context can't be cached effectively because every turn appends new retrieval. Multiply by query volume and the bill is real.
At Context-1 economics, the same trajectory runs on a self-hosted B200 in roughly the time a frontier model takes to do one turn. Per-query cost is dominated by the GPU-hour amortisation, not by per-token pricing, and the throughput headroom on a B200 is large enough that batching multiple users' trajectories drives marginal cost toward zero.
The 1x-vs-4x decision then becomes a tunable knob, not an architectural commitment. On HotpotQA, where 0.97 final-answer-found is already near saturation, 1x is sufficient. On BrowseComp-Plus and the harder generated benchmarks where 4x ensembling lifts recall meaningfully, you spend the extra rollouts. Either way you're below frontier API cost on per-trajectory price and below frontier inference time on latency.
How Context-1 Was Trained
The training pipeline in the Chroma research post has three pieces worth understanding individually.
Stage 1 — SFT warmup. Trajectories from a stronger model are used as supervised fine-tuning data to bootstrap the base gpt-oss-20b into producing well-formed tool-calling agentic-RAG trajectories at all. This is conventional and exists mostly to give RL a reasonable starting policy.
Stage 2 — CISPO RL. Clipped Importance-Sampled Policy Optimization is the on-policy RL algorithm. The training scale: 1,024 agent trajectories per training step, 8,000+ synthetic tasks across the four domains, convergence around step 230 over five epochs.
Stage 3 — Curriculum. Two phases: difficulty scaling (start with easier tasks, ramp up) and reward annealing (start with a recall-focused reward, anneal toward precision-focused as the policy stabilises).
The reward design is the part most worth internalising for any team training a retrieval agent:
- 16:1 F1 bias toward recall. This is the single most opinionated choice in the recipe. Standard F1 weights precision and recall equally; Chroma weights them roughly 16:1 in favour of recall, because Context-1 is positioned as a retrieval subagent whose output is consumed by a parent agent. Missing a relevant document is much worse than including an irrelevant one — the parent can filter; it can't conjure missing evidence. This is the kind of asymmetry that's obvious in retrospect and easy to get wrong if you copy a generic F1 reward from somewhere.
- Trajectory-recall credit. The agent gets credit for encountering a relevant chunk during search, even if it later prunes it. This decouples "can find" from "must keep" — the model is allowed to explore and discard without being penalised for the exploration. Without this, the model becomes risk-averse about pruning.
- Final-answer bonus. A
+1.0bonus when the agent directly retrieves an answer-containing chunk, reinforcing the search-then-answer trajectory shape. - Degenerate-behaviour penalties. A penalty of 0.1 per excess
prune_chunkscall when the agent enters streaks longer than three consecutive prunes (preventing context-thrashing). A turn-count penalty that scales linearly from 0 at 64 turns to 0.5 at 128 turns (preventing unbounded trajectories).
The reward shape is the API by which the desired behaviour is communicated to the policy. Every term in this list maps to a specific failure mode the team observed and patched.
The Synthetic Eval Generation Trick Worth Stealing
Independent of the model, Chroma's data-generation pipeline contains one trick that any retrieval team should adopt regardless of which model they're calling: extract-and-verify, instead of LLM-as-judge.
The conventional approach to verifying that a generated retrieval task is well-formed is to ask an LLM "does this document contain evidence for this clue?" and trust the binary answer. This is noisy. The judge LLM has its own opinions about what counts as evidence, applies them inconsistently across long-tail topics, and has no grounding in the actual textual span.
Chroma's alternative: have the LLM extract the literal span quotes from both the document and the clue, then have a deterministic system normalise both and check textual grounding. The verification reduces to a span-overlap check, not a yes/no judgement. Human verification then only has to confirm that the extracted spans actually support each other — a much narrower task than evaluating relevance from scratch.
The alignment numbers Chroma reports against human gold:
- Web domain: 84.4% alignment with human verification
- Finance domain: 93%
- Email domain: 87.5%
These numbers are meaningfully higher than typical LLM-as-judge baselines on retrieval relevance tasks. The implication for any team building synthetic retrieval evals: replace your "is this relevant?" judge with an extract-then-check pipeline, and a lot of the noise in your eval suite goes away.
This connects directly to the harness-engineering thread in an earlier post on this blog about synthetic data generation harnesses: the Chroma extract-and-verify is exactly the kind of deterministic-validator step that turns a noisy LLM-driven generation loop into a reliable one. If you're building synthetic retrieval evals and you're not yet doing span extraction, that's the cheapest reliability win available.
Where Agentic RAG (and This Approach) Still Doesn't Work
Context-1 is a focused result, not a universal one. The Chroma report is honest about the scope, and so should this post be:
Needle-in-haystack queries dominate the training distribution. The benchmark set (BrowseComp-Plus, HotpotQA, the generated suites) is overwhelmingly questions of the form "find the one document or chunk-pair that answers this." Breadth-search and aggregation queries — "list every company that filed a 10-K mentioning AI risk in 2024," "summarise all rulings on this topic in the last two years" — are not in the training set, and the pruning policy isn't shaped for them. An agent trained to prune aggressively will discard exactly the chunks that an aggregation answer needs.
The tool surface is search-only. No code execution, no SQL, no pandas, no metadata or schema introspection. Domains where the right tool is structured-data manipulation, not free-text search, fall outside what this kind of agent can do. There's no fundamental reason the approach can't extend to those tools, but Context-1 specifically doesn't.
The corpus shape is implicit. Pruning is a learned policy on the chunk-size and document-density distributions Chroma trained over. Behaviour on radically heterogeneous corpora — very long documents mixed with very short ones, mixed-modality content, or non-English text — is unevaluated.
Output shape is single-answer. The reward design assumes a clear gold answer to compare F1 against. Open-ended synthesis tasks ("write a summary of what we know") don't have that target shape and aren't covered.
The honest framing: agentic RAG with self-editing context is the right shape for the multi-hop, search-driven, single-answer subset of retrieval problems. That subset is large and economically important — most enterprise RAG sits inside it — but it's not all of retrieval, and a different problem shape may need a different agent shape.
Takeaways
A few that generalise beyond Context-1 specifically:
Pick by question shape, not by trend. Traditional RAG is still the right answer when the user's question and its answer fit in one retrieval call. Agentic RAG earns its complexity only when multi-hop, decomposition, or follow-up search is genuinely needed. Don't pay the loop cost on questions that don't need a loop.
In any loop, you eventually need to un-retrieve. Build a
prune_chunks-equivalent into your harness even if your model isn't trained to use it well. Exposing the tool is a prerequisite for ever fixing context bloat — and the model can do something useful with it via prompting alone, with much more headroom available if you ever fine-tune.Audit how much of your frontier-model spend is buying generalist capacity the loop never exercises. For a retrieval-heavy workload, most of the answer is "all of it." A LoRA-tuned 20B is now a real alternative for this class of work — Chroma's Context-1 is the existence proof, with weights and harness released under Apache 2.0.
Replace LLM-as-judge with extract-and-verify for retrieval eval generation. This is cheap to adopt regardless of what model you're calling; the alignment-with-humans gap is large enough to be worth the migration even before any other change.
Test 4x parallel rollouts before scaling the model. It's a cost-shaped lever that often gets you most of the lift of a bigger model at lower price. The Context-1 data shows this clearly on the harder benchmarks; it's plausibly true for any well-trained specialist agent.
The deeper claim under all of this is that purpose-trained specialists at the 20B scale are now competitive with frontier generalists on operationally narrow tasks — and agentic RAG is one of the cleanest examples of an operationally narrow task. The economics are no longer on the side of "send everything to the biggest model." The interesting engineering question is which other agent shapes admit the same treatment, and what the next prune_chunks-equivalent insight will be in those domains.