Latest

A blog about AI, coding and tech

  • Published on
    Running Claude Code as an autonomous agent inside a GitHub-centric SDLC rests on one discipline: GitHub is the system of record, the agent's conversation is ephemeral, and everything that must survive across pull requests lives in repo files or GitHub itself. This post walks the full issue-to-merge loop — trigger modes, the Explore→Plan→Implement→Commit cycle, cross-session context persistence via CLAUDE.md, the four distinct automated-review surfaces, and the merge gates that keep an agent that literally cannot self-approve from ever merging by fiat. It is built from a fact-checked research pass (25 claims verified, 0 refuted, overwhelmingly first-party Anthropic sources) and is deliberate about separating verified mechanics from the viral stats that did not survive verification.
  • Published on
    For two years, getting useful work out of a coding agent meant being the loop yourself — prompt, read, prompt again. As models hold a hard problem for hours, the bottleneck moves: not 'can it write good code' but 'can it keep making progress on its own without losing the thread or declaring victory early.' Loop engineering is the discipline that answers that — you design the system that prompts the agent: discover work, attempt, get a feedback signal, self-correct, verify in a separate context, persist state on disk, decide what's next. This post lays out the architecture, the five building blocks, a worked worker/verifier loop in Python, the loops worth building first, the best practices, and an honest look at the risks (the 'confident token furnace'), with every flow rendered as a diagram.
  • Published on
    Anthropic's Clio is a privacy-preserving pipeline — extract facets from each conversation with Haiku, embed with sentence-transformers, cluster bottom-up with k-means into a ~10/100/1000 three-level hierarchy, label each cluster with Sonnet, and enforce minimum unique-account thresholds at every step. The whole 100K-conversation run costs $48.81 and recovers a known taxonomy at 94% accuracy versus 5% for random guessing. The architecture lifts almost unchanged to agent traces, which is exactly what Distributional has been doing: traces become the unit of analysis, facets become tool-call sequences and failure fingerprints, and clusters surface the lazy-tool-call hallucinations and resource-conservation regressions that pre-defined evals never thought to look for. This post walks Clio's pipeline stage by stage, maps each stage onto the agent-trace setting, and pins down what the 'analytics' layer above telemetry and monitoring actually buys you.
  • Published on
    Traditional RAG is a one-shot pipeline: embed the query, fetch top-k, stuff into a prompt. Agentic RAG turns retrieval into a loop the model drives — decompose, search, read, prune, search again. The shape of that loop creates a new failure mode (context window bloat across turns) and a new cost lever (a specialist 20B subagent can match frontier LLMs on multi-hop benchmarks at up to 10x lower latency). This post walks the contrast between traditional and agentic RAG, explains why a learned `prune_chunks` tool is the missing piece, and uses Chroma's Context-1 research as the worked example showing how a LoRA-tuned gpt-oss-20b with a 16:1 recall-biased CISPO reward beats GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on BrowseComp-Plus and HotpotQA.
  • Published on
    LangChain's harness-engineering recipe treats evals as the training data for agents — but what do you do when you don't have evals, can't touch production customer data, and need to probe very specific corners of agent behaviour? You build a harness whose output is the dataset itself. This post walks through a synthetic-data-generation harness that runs Claude Code in a loop over Phoenix traces and generates edge-case companies designed to break the agent under test, grounded in what Meta-Harness and the LangChain canon actually say (and don't say) about how such loops should be built.