Chroma

  • Published on
    Traditional RAG is a one-shot pipeline: embed the query, fetch top-k, stuff into a prompt. Agentic RAG turns retrieval into a loop the model drives — decompose, search, read, prune, search again. The shape of that loop creates a new failure mode (context window bloat across turns) and a new cost lever (a specialist 20B subagent can match frontier LLMs on multi-hop benchmarks at up to 10x lower latency). This post walks the contrast between traditional and agentic RAG, explains why a learned `prune_chunks` tool is the missing piece, and uses Chroma's Context-1 research as the worked example showing how a LoRA-tuned gpt-oss-20b with a 16:1 recall-biased CISPO reward beats GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on BrowseComp-Plus and HotpotQA.