Ai-agents

All Posts

Published on
July 13, 2026
superquant-bench: A Ground-Truth-Verifiable Benchmark for Autonomous Quantitative Research Agents
Benchmark AI-Agents Quantitative-Research Synthetic-Data Evaluation Look-ahead-Bias False-Discovery-Rate Alpha Claude-Code Frontier-Models Walk-forward Multiple-Testing Claude-Fable-5 Claude-Opus
On real market data there is no answer key, so you can never tell whether an agent discovered real structure or overfit noise. superquant-bench inverts the trade-off: a fully synthetic 100-asset price panel with 22 known alpha patterns injected into it, giving the grader the exact true conditional mean of every return. That buys three things reality cannot — a prediction score with a literal zero noise floor (oracle 100, iid noise 0), a directly measured false-discovery rate via a pattern-free twin universe, and statistical look-ahead enforcement. A red-team of seven exploits validates the metrics, then four frontier models run under one identical Claude Code scaffold. The spread is 10×, model identity explains 75% of prediction variance — and the strongest model is the least disciplined, spraying false discoveries on the universe that contains nothing.
Published on
June 17, 2026
Claude Code as a GitHub-Native Agent: The Issue-to-Merge Development Loop
Claude-Code AI-Agents GitHub-Actions SDLC Code-Review Context-Engineering Agentic-Workflows CI/CD Anthropic
Running Claude Code as an autonomous agent inside a GitHub-centric SDLC rests on one discipline: GitHub is the system of record, the agent's conversation is ephemeral, and everything that must survive across pull requests lives in repo files or GitHub itself. This post walks the full issue-to-merge loop — trigger modes, the Explore→Plan→Implement→Commit cycle, cross-session context persistence via CLAUDE.md, the four distinct automated-review surfaces, and the merge gates that keep an agent that literally cannot self-approve from ever merging by fiat. It is built from a fact-checked research pass (25 claims verified, 0 refuted, overwhelmingly first-party Anthropic sources) and is deliberate about separating verified mechanics from the viral stats that did not survive verification.
Published on
June 17, 2026
Loop Engineering: Designing the System That Drives the Agent Instead of Prompting It
Loop-Engineering AI-Agents Claude-Code Agentic-Workflows Context-Engineering Autonomous-Agents Sub-agents MCP Verification Anthropic
For two years, getting useful work out of a coding agent meant being the loop yourself — prompt, read, prompt again. As models hold a hard problem for hours, the bottleneck moves: not 'can it write good code' but 'can it keep making progress on its own without losing the thread or declaring victory early.' Loop engineering is the discipline that answers that — you design the system that prompts the agent: discover work, attempt, get a feedback signal, self-correct, verify in a separate context, persist state on disk, decide what's next. This post lays out the architecture, the five building blocks, a worked worker/verifier loop in Python, the loops worth building first, the best practices, and an honest look at the risks (the 'confident token furnace'), with every flow rendered as a diagram.
Published on
May 11, 2026
Hierarchical Clustering of Agent Traces for Discovering Unknown Failure Modes
Clio AI-Agents Observability Agent-Traces Anthropic Distributional Clustering Hierarchical-Clustering k-Means Privacy Telemetry OpenTelemetry
Anthropic's Clio is a privacy-preserving pipeline — extract facets from each conversation with Haiku, embed with sentence-transformers, cluster bottom-up with k-means into a ~10/100/1000 three-level hierarchy, label each cluster with Sonnet, and enforce minimum unique-account thresholds at every step. The whole 100K-conversation run costs $48.81 and recovers a known taxonomy at 94% accuracy versus 5% for random guessing. The architecture lifts almost unchanged to agent traces, which is exactly what Distributional has been doing: traces become the unit of analysis, facets become tool-call sequences and failure fingerprints, and clusters surface the lazy-tool-call hallucinations and resource-conservation regressions that pre-defined evals never thought to look for. This post walks Clio's pipeline stage by stage, maps each stage onto the agent-trace setting, and pins down what the 'analytics' layer above telemetry and monitoring actually buys you.
Published on
April 28, 2026
Agentic RAG: Multi-Turn Retrieval With Self-Editing Context and Specialist Subagent Models
Agentic-RAG RAG AI-Agents Retrieval Chroma Context-1 LLM Context-Engineering CISPO LoRA
Traditional RAG is a one-shot pipeline: embed the query, fetch top-k, stuff into a prompt. Agentic RAG turns retrieval into a loop the model drives — decompose, search, read, prune, search again. The shape of that loop creates a new failure mode (context window bloat across turns) and a new cost lever (a specialist 20B subagent can match frontier LLMs on multi-hop benchmarks at up to 10x lower latency). This post walks the contrast between traditional and agentic RAG, explains why a learned `prune_chunks` tool is the missing piece, and uses Chroma's Context-1 research as the worked example showing how a LoRA-tuned gpt-oss-20b with a 16:1 recall-biased CISPO reward beats GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on BrowseComp-Plus and HotpotQA.
Published on
April 23, 2026
A Synthetic Data Generation Harness: Hill-Climbing the Eval Set Itself
AI-Agents Synthetic-Data Evaluation Harness-Engineering Claude-Code Phoenix Meta-Harness LangChain
LangChain's harness-engineering recipe treats evals as the training data for agents — but what do you do when you don't have evals, can't touch production customer data, and need to probe very specific corners of agent behaviour? You build a harness whose output is the dataset itself. This post walks through a synthetic-data-generation harness that runs Claude Code in a loop over Phoenix traces and generates edge-case companies designed to break the agent under test, grounded in what Meta-Harness and the LangChain canon actually say (and don't say) about how such loops should be built.
Published on
April 21, 2026
Inside Hermes Agent: What 'Self-Improving AI Agent' Actually Means in Production
AI-Agents Self-Improving-AI Hermes Nous-Research LLM-Agents Agent-Architecture Memory-Systems Skill-Learning
A source-level audit of Nous Research's hermes-agent. The 'self-improving' label cashes out as four orthogonal mechanisms — skill auto-generation, persistent memory, offline RL fine-tuning, and a pull-based update protocol — none of which is autonomous code rewriting. The interesting engineering is in the operational plumbing that makes all four robust in messy real-world deployments.
Published on
March 2, 2026
Recursive Self-Improvement for Trading: How LLMs Can Teach Themselves to Invest
AI-Agents Recursive-Self-Improvement Quantitative-Finance LLM-Trading Mind-Evolution Evolutionary-AI Poetiq
How Recursive Self-Improvement turns LLMs from static prediction machines into trading systems that get better on their own -- learning from news, research reports, and price data through evolutionary search, self-critique, and memory-based refinement loops.
Published on
February 19, 2026
Building an AI-Native Hedge Fund with OpenClaw: Multi-Agent Systems for Quantitative Trading
AI-Agents OpenClaw Quantitative-Finance Multi-Agent-Systems Autonomous-AI Risk-Management Open-Source
How to build a multi-agent trading firm with OpenClaw — specialized analyst agents running in parallel, adversarial bull/bear debate, dedicated risk management gates, sandboxed execution, and persistent memory that compounds institutional knowledge over time.
Published on
February 9, 2026
Build an AI Investment Analyst with OpenClaw: Automated Research, Financial Modeling, and Market Monitoring
AI-Agents OpenClaw Quantitative-Finance Autonomous-AI Investment-Research Financial-Modeling Open-Source
How to turn OpenClaw into a full-stack AI investment analyst — parsing sell-side research PDFs, building code-based financial models, running 24/7 market monitoring with heartbeat alerts, and conducting deep research with parallel subagents.
Published on
February 6, 2026
Automated Quant Research with AI Agents: How Microsoft's RD-Agent Achieves 2x Returns with 70% Fewer Factors
AI-Agents Quantitative-Finance LLM Autonomous-AI Machine-Learning Alpha-Generation
A deep dive into Microsoft's RD-Agent framework — the first multi-agent system that automates the full quant research pipeline from hypothesis generation to backtesting, achieving 2x higher annualized returns than classical factor libraries while using 70% fewer factors.
Published on
February 6, 2026
OpenClaw Agentic Framework: How Autonomous AI Agents Execute Long-Running Tasks with Heartbeat Monitoring
AI-Agents Open-Source Agent-Orchestration Heartbeat-Monitoring LLM Autonomous-AI
A deep dive into OpenClaw's architecture — how it runs persistent AI agents across messaging platforms with lane-based queuing, session persistence, context compaction, and a built-in heartbeat system for proactive monitoring.
Published on
December 27, 2025
Using AI Agents to Forecast Prediction Markets
AI-Agents Forecasting Prediction-Markets LLM
A synthesis of recent work on LLM forecasting agents, focusing on Bridgewater’s AIA Forecaster and why blending AI with market prices can beat either alone.
Published on
December 21, 2025
Automatic Debugging and Failure Detection in AI Agent Systems
AI-Agents LLM Debugging Observability Reliability
A survey of DoVer and related work on failure attribution, intervention-based debugging, and observability tooling for LLM agent systems.
Published on
December 10, 2025
Why You Don’t Need AI Agent Evaluations
AI-Agents Evaluation Observability LLMs Startups
A satirical look at why skipping AI agent evaluations makes perfect sense if you don't value maintainability, customers, or long-term sanity.
Published on
November 23, 2024
MLE-Bench: Benchmarking AI Agents in Machine Learning Engineering
AI-Benchmarking Machine-Learning-Engineering Kaggle-Competitions AIDE-Framework OpenAI-Research AI-Agents
MLE-Bench introduces a new benchmark to evaluate AI agents on real-world ML engineering tasks using Kaggle competitions. This post highlights key findings, including resource scaling effects, debugging challenges, and the performance of different agent frameworks.

Ai-agents

ai-agents (16)