- Published on
A Synthetic Data Generation Harness: Hill-Climbing the Eval Set Itself
Over the last six months three pieces of writing have quietly consolidated harness engineering as its own discipline, separable from prompt engineering, fine-tuning, and traditional ML ops:
- Vivek Trivedy's Better Harness: A Recipe for Harness Hill-Climbing with Evals (LangChain, 8 Apr 2026), which frames the whole activity as "evals are training data for agents" and lays out a six-step recipe built around hand-curation, trace-mining, optimisation/holdout splits, baselines, diagnose-experiment-validate loops, and human review.
- The same author's earlier Improving Deep Agents with Harness Engineering (17 Feb 2026), a case study that takes
deepagents-clifrom 52.8% to 66.5% on Terminal Bench 2.0 without ever changing the model (held fixed atgpt-5.2-codex), purely by iterating on middleware (PreCompletionChecklistMiddleware,LoopDetectionMiddleware,LocalContextMiddleware), the system prompt, and a reasoning-budget strategy called the "reasoning sandwich" (xhigh-high-xhigh). - The academic anchor underneath both posts: Meta-Harness: End-to-End Optimization of Model Harnesses by Yoonho Lee, Roshen Nair, Qizheng Zhang (Stanford), Kangwook Lee (KRAFTON), Omar Khattab (MIT), and Chelsea Finn (Stanford) (arXiv:2603.28052, project page, code). Meta-Harness turns harness design into an outer-loop search: an agentic proposer — Claude Code with Opus-4.6, in the paper's setup — is given a filesystem of every prior candidate's source code, execution traces, and scores, and it navigates that history with
grepandcat, proposing the next harness to try. A single iteration routinely inspects a median of 82 files and burns roughly 10 million feedback tokens, which the paper reports as three orders of magnitude more diagnostic context than competitors like GEPA or AlphaEvolve use.
The recipe is now legible: evals in, better harness out. This post is about the step before that, which none of the three sources above actually addresses.
The Problem LangChain Doesn't Discuss
Read the Better-Harness recipe closely and you'll notice something: it lists three channels for sourcing evals — hand-curation, production traces, and external datasets — and nothing else. The sibling post How We Build Evals for Deep Agents says the same thing in different words: dogfooding, external benchmarks like Terminal Bench 2.0 and BFCL, and hand-written cases. Synthetic data generation is not mentioned in either post. Meta-Harness is even more explicit: it treats the eval distribution as a fixed ground truth and only searches over the inference-time program. The retrieval corpus is real problems from eight open datasets; an entire appendix (§4.2, §C.2) is about decontaminating it against the eval benchmarks.
All three canonical references assume you already have evals.
In a lot of realistic agent-building situations, you don't — and the reasons are structural, not laziness:
- You can't access the real data. You're building a B2B agent that will operate on customer companies, but you're pre-sales, the data is under NDA, or compliance hasn't cleared you to touch production tenants.
- The real data doesn't cover the edge cases you care about. You might have twenty onboarded customers, but none is a holding company with a 40% minority stake in a subsidiary that files separately and has a different fiscal year. That's exactly the case that'll break the agent in month six.
- You need controlled variation to attribute failures. If the agent misbehaves on a real company, is that because of the agent, or because the company is unusual? With synthetic data you own every axis.
Meta-Harness and Better-Harness both treat the eval set as an input. In practice, constructing that input is often the hardest part, and it has the same computational shape as the harness itself: it is code and data that you can hill-climb.
The thesis of this post: you can build a synthetic-data-generation harness with exactly the same shape as a Better-Harness loop. The optimisation target is not "the agent passes more evals." It is "the generated dataset surfaces more distinct failure modes per unit of generation cost, while staying on the manifold of plausible domain data."
The Shape of the Loop
Concretely, the loop looks like this, with Claude Code as the proposer and Arize Phoenix as the observability backend:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. Current synthetic company dataset (start from seed) │
│ │
│ 2. Run the agent-under-test over every company │
│ │ │
│ └── emits OpenInference-decorated OTel spans │
│ │
│ 3. Claude Code (proposer) reads: │
│ - Phoenix traces via @arizeai/phoenix-mcp │
│ - The agent-under-test's prompts and tool defs │
│ - The existing synthetic dataset │
│ - Generation rules / dependency skill files │
│ - Prior iterations' scores and failure clusters │
│ │
│ 4. Proposer emits N new company specs targeting specific │
│ edge-case clusters the current dataset under-covers │
│ │
│ 5. Deterministic validator runs the dependency rules over │
│ the new companies; rejected ones go back with reasons │
│ │
│ 6. Surviving companies merge into the dataset, tagged │
│ with the behaviour class they were generated to probe │
│ │
│ 7. Goto 2. │
│ │
└─────────────────────────────────────────────────────────────┘
This is structurally isomorphic to the Meta-Harness outer loop, with two swaps:
- The thing being evolved is the dataset, not the harness code.
- The score being optimised is coverage + failure-diversity + plausibility rather than raw task accuracy.
The proposer is still a coding agent with filesystem access to traces and prior candidates. Better-Harness calls its side of this "evals as training data for agents." Here we have the mirror image: the agent under test is training data for the dataset.
Why the Proposer Has To Be a Coding Agent (Not a Generator Call)
Meta-Harness's central architectural claim is the thing that matters most when you try to copy the pattern. From the paper (§3):
"The filesystem is typically far larger than the proposer's context window, so the proposer queries it through terminal tools such as
grepandcatrather than ingesting it as a single prompt."
And from Table 1 (which I keep coming back to): ~10M feedback tokens per iteration for Meta-Harness, versus 0.002M for OPRO, 0.008M for GEPA, 0.022M for AlphaEvolve. That's not a knob being turned — it's a different regime entirely, and the paper attributes its dominance over program-search baselines (Best-of-N, OpenEvolve, TTT-Discover, GEPA, same proposer, same eval budget) to it.
Why does this matter for synthetic-data generation? Because traces are voluminous in exactly the way the paper's eval histories are. A single run of a moderately complex agent over one company can produce thousands of OpenInference spans. If you try to summarise those down to "what went wrong" with an LLM critic before handing them to a proposer, you destroy the signal you wanted. Meta-Harness has an empirical result on this point that is worth quoting (Table 3, §4):
"Summaries do not recover the missing signal, and may even hurt by compressing away diagnostically useful details."
Scores-plus-summaries scored below scores-alone on the median (34.9 vs 34.6); raw traces jumped it to 50.0. This is a surprisingly strong result against the "compress feedback with an LLM critic" design pattern, and it's the result that tells you to give Claude Code the Phoenix MCP and let it grep, rather than piping an executive summary into a small generator.
The other reason Claude Code specifically: generation rules live in markdown skill files — the pattern explicitly endorsed by Anthropic Agent Skills and used by Meta-Harness's proposer to know where to write new harnesses and how to inspect prior ones. Appendix D of Meta-Harness flags the skill text itself as the single biggest lever in the whole system: iterating on the skill had a larger effect than tweaking iteration count or population size. You want the proposer reading and editing those skill files, not receiving them as inert context.
Phoenix as the Trace Store: What's Actually There
Picking Phoenix is a choice driven by the proposer's needs, not the UI. The integration surface is more complete than most teams assume:
@arizeai/phoenix-mcp— the official Phoenix MCP server (source). Exposes tools includinglist-traces,get-trace,get-spans,list-sessions,list-datasets,get-dataset-examples,list-experiments-for-dataset,get-experiment-by-id,get-prompt, andupsert-prompt. You register it withclaude mcp addand Claude Code calls it natively.arize-phoenix-client— the Python REST wrapper.client.traces.get_traces(project_identifier=..., session_id=..., limit=..., include_spans=True)and dataset/experiment methods for when the MCP surface doesn't cover something.- OpenInference — Arize's OpenTelemetry semantic-convention layer. It's the thing that makes LangGraph, OpenAI Agents SDK, DSPy, CrewAI, AutoGen, Claude Agent SDK, and raw MCP calls all emit cleanly ingestable spans. Every major agent framework has an instrumentor in openinference-instrumentation/python as of early 2026.
- Phoenix is free to self-host, Apache 2.0, Docker/Helm (docs): "Phoenix is free to self-host with no feature limitations. Your data stays entirely within your infrastructure." That matters when the loop runs for hours and emits tens of thousands of traces per generation cycle.
For a worked walkthrough of the Claude-Code-reads-Phoenix pattern, the canonical reference is Mikyo King (Head of Open Source, Arize) and Hamel Husain's Maven course, Automating Evals with Claude Code + Phoenix. The course description is explicit about the loop we care about: "agent-driven open and axial coding for error analysis… agent designs an experiment to validate a hypothesis… agent runs the experiment and analyses results."
A subtlety worth calling out: Arize themselves have already published a synthetic-eval-generation recipe in Generating Synthetic Datasets for LLM Evaluators & Agents. Their taxonomy for agent-specific eval generation — happy path / multi-step / edge cases / adversarial / noise — is a good starting vocabulary for what your proposer is trying to cover, and Arize's newer "Alyx" feature inside Arize AX ships an in-Playground generator agent. Nothing in my loop is trying to replace those; what's different here is the hill-climbing outer loop driven by traces from running the agent-under-test, rather than a one-shot generation prompt.
The Dependency Problem Is Where This Lives or Dies
The naive version of this loop generates nonsense companies: a SaaS business with negative ARR, a manufacturing firm with more customers than the population of its country, a holding company that owns its own parent. The agent-under-test will fail on those, but it fails because the data is implausible — not because the agent is buggy. You learn nothing.
This is exactly the overfitting pathology Trivedy flags in Better-Harness, just pointed at dataset generation instead of prompts:
"Holdout sets become a proxy for true generalization."
If the synthetic dataset lives on a distribution the real world will never sample from, every score it produces lies. So the data-dependency rules have to be first-class, not an afterthought.
The canonical place to put them is skill files — the Anthropic Agent Skills convention, which is already what Meta-Harness uses to govern its proposer. Markdown documents with YAML frontmatter that describe:
- Field-level constraints.
revenue > 0;employee_count ∈ [1, 10_000_000];founded_year <= current_year. - Cross-field dependencies.
if industry == "SaaS" then gross_margin > 0.6;if public == true then ticker must be valid on its exchange;sum(segment_revenue) == total_revenue ± 1%. - Cross-entity dependencies.
parent_company.consolidated_revenue >= sum(subsidiary.standalone_revenue) - intercompany_eliminations; minority stakes (20–50% ownership) show up as equity-method income on the parent, not consolidated line items; no cycles in the ownership graph. - Plausibility heuristics. "SaaS companies with >$100M ARR have >50 employees 95% of the time." These aren't hard rules — they flag candidates for human review.
A representative skill file might look like:
---
name: company-consolidation-rules
description: Consistency rules for parent/subsidiary company graphs
applies_to: company_entity
---
## Hard constraints
- A company's consolidated revenue must be >= the sum of its
subsidiaries' standalone revenues, minus documented intercompany
eliminations.
- Minority-stake relationships (20% <= ownership < 50%) do NOT
consolidate; revenue shows up as equity-method income on the parent.
- No cycles in the ownership graph.
## Plausibility (flag for review, don't reject)
- Holding companies without operating entities are rare; if generated,
tag with `plausibility_flag: holdco-without-operations`.
- Fiscal-year mismatches between parent and subsidiary should appear
in ~15% of multi-entity groups in the dataset.
The validator in step 5 is a deterministic pass over the hard constraints. Rejections come back to the proposer as structured feedback, which is itself signal for the next iteration — exactly the same shape as Meta-Harness's interface-validation step before scoring. The plausibility flags bubble up to the human reviewer, same as Better-Harness's explicit human-review step for catching overfitting the metric can't see.
One finding from Meta-Harness that cashes out directly here, from the TerminalBench-2 trajectory (Appendix A.2): prompt-template edits regressed six times in a row before the proposer learned a heuristic — "modifications to prompts and completion flow are high risk, even when the local hypothesis sounds reasonable" — and the winning change was purely additive (an environment-snapshot block). The practical lesson for a synthetic-data-generation harness is the same: edits to the generation prompt are high-risk; additive edits to the skill files, the validator rules, and the behaviour-class coverage are where the gains come from. Keep the generation prompts almost invariant and evolve the structured artefacts around them.
What the Proposer Is Actually Optimising
Better-Harness is explicit that the optimisation target has to be something a holdout set can measure, and that reward hacking is the failure mode to guard against. For a synthetic-dataset harness, the analogous question is: what does "better dataset" mean?
A hierarchy of candidate objectives, roughly in order of how useful each is:
- Coverage of declared behaviour classes. Maintain an explicit list of behaviours the agent should handle — fiscal-year mismatches, discontinued operations restatements, stock-based comp in segment reporting, revenue-recognition corner cases. Measure what fraction has at least K synthetic examples. Arize's own happy-path/multi-step/edge-cases/adversarial/noise taxonomy is a reasonable starting backbone.
- Failure-diversity score. Cluster agent failures across the dataset (by trace similarity, tool-call pattern, or an LLM-judge label). A dataset that produces 30 distinct failure clusters is more valuable than one that produces 200 cases that all fail the same way.
- Redundancy penalty. Two companies whose traces are near-identical contribute almost nothing. Penalise near-duplicates — cheap to compute with an embedding model on the generated company descriptions.
- Plausibility rate. Fraction of generated companies that pass the hard constraints on first try. If this drops, the proposer is being too aggressive and the dataset is drifting off-manifold.
- Held-out validation. Meta-Harness's search-set recommendation is "50–100 examples in our classification experiments, 88 problems for math retrieval… a fast, discriminative eval is more valuable than a large one." Apply the same logic to the synthetic dataset: keep a blessed held-out slice of real companies (even if small) around from day one, and track whether agent scores on the synthetic optimisation set and the real held-out set are trending together. Divergence means the synthetic set is overfitting to its own generation process.
These are the scores that get written to the filesystem alongside the dataset, and that Claude Code reads on the next iteration. Same pattern as Meta-Harness; just different axes.
Precedents Worth Naming
The shape of this loop is not novel. Closest public prior art, in rough order of directness:
- Voyager (Wang et al., 2023) remains the cleanest reference for an agent with (a) an automatic curriculum generator, (b) an ever-growing skill library, (c) iterative self-verification. The curriculum-generator role is structurally the same as the proposer role here.
- Benchmark Self-Evolving (Wang et al., COLING 2025). Multi-agent system that dynamically extends existing benchmarks by reframing instances — the most on-nose academic prior art for "loops that write their own evals."
- τ²-bench (Sierra Research, 2025). Its abstract describes "a compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity" — the single clearest published example of schema-aware programmatic agent-eval synthesis.
- Bloom (Anthropic Alignment, 2025). Agentic framework for behavioural evals where scenarios are auto-generated with configurable diversity; perturbations work "by varying substitutable elements like company names and dates." Directly relevant to fictional-company generation.
- Orca-AgentInstruct (Microsoft Research, 2024). Agentic pipeline — Content Transformation → Seed Instruction Generation → Instruction Refinement — across a 100+ subcategory taxonomy. A training-data pipeline, but its structure is a strong template.
- Bespoke Curator. The most usable off-the-shelf synthetic-data-curation library in 2026, with Pydantic structured outputs, async + caching, LiteLLM/vLLM backends. The library to reach for when you want to standardise the boring parts of your generator and focus on the loop.
The pattern that's not well-documented in public is the one this post is most directly about: agent-trace-driven eval-set hill-climbing, with dependency constraints encoded as skill files and a coding-agent proposer. Private shops (quant funds, AlphaSense-style research platforms, the Hebbia tier) almost certainly do some version of this internally; it's just not in the literature. If you're doing this, you're not alone — you're just among the first writing it down.
Connecting Back to the Auto-Harness Idea
In an earlier post I argued most "self-improving agents" in production are doing three things: writing durable behavioural artefacts, collecting trajectories for offline fine-tuning, and pulling upstream code improvements. None of those requires the agent to edit its own substrate.
The synthetic-data-generation harness lives in the same neighbourhood. The system as a whole gets better over time — the dataset expands, the skill files sharpen, the failure clusters shrink — but no individual component is rewriting itself. The proposer is Claude Code, reading a filesystem and writing a filesystem. The skill files are markdown that a human can audit and veto. The agent-under-test is unchanged across iterations. The Meta-Harness authors make the same point about their own system in §5: "brittle if-chains or hard-coded class mappings are visible on inspection" — code-space overfitting is acknowledged but called more inspectable than weight-space overfitting. That property survives when you port the pattern to dataset generation.
What you get at the end is a dataset of synthetic companies that is fit-for-purpose for evaluating a specific agent on specific behaviours, produced by a process that can be replayed, audited, and extended. That dataset then feeds the Better-Harness loop proper, closing the circle: you go from no data, to synthetic data, to a harness optimised against that data, to (eventually) real data replacing or augmenting the synthetic baseline.
The broader point: the harness-engineering frame is general. Once you see one hill-climbing loop over code and data with an agentic proposer, you start seeing them everywhere — over prompts, over tool selections, over datasets, over skill files. Meta-Harness is a paper about one such loop. Better-Harness is a recipe for another. The synthetic-data-generation loop in this post is a third. They are all the same shape, and I suspect the next few years of agent engineering are going to be a Cambrian explosion of loops of this shape stacked on top of each other, with the outputs of one becoming the inputs of the next.
References
Primary sources
- Vivek Trivedy, Better Harness: A Recipe for Harness Hill-Climbing with Evals, LangChain Blog, 8 Apr 2026.
- Vivek Trivedy, Improving Deep Agents with Harness Engineering, LangChain Blog, 17 Feb 2026.
- Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052 [cs.AI], 30 Mar 2026. Code: github.com/stanford-iris-lab/meta-harness. Project page: yoonholee.com/meta-harness.
Adjacent LangChain posts
- Trivedy, Daugherty, Yurtsev, Chase, How We Build Evals for Deep Agents, LangChain Blog, 26 Mar 2026.
- Sydney Runkle, How Middleware Lets You Customize Your Agent Harness, LangChain Blog, 26 Mar 2026.
- Harrison Chase, Your Harness, Your Memory, LangChain Blog, 11 Apr 2026.
- Curme & Daugherty, Context Management for Deep Agents, LangChain Blog, 28 Jan 2026.
Phoenix and Claude Code tooling
- Arize AI, Phoenix on GitHub. Self-hosting docs: arize.com/docs/phoenix/self-hosting.
- Arize AI,
@arizeai/phoenix-mcp. - Arize AI, Generating Synthetic Datasets for LLM Evaluators & Agents.
- Arize AI, OpenInference on GitHub.
- Mikyo King and Hamel Husain, Automating Evals With Claude Code + Phoenix (Maven, 2026).
- Anthropic, Agent Skills Overview.
Synthetic data and self-evolving evals
- Wang et al., Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023.
- Wang et al., Benchmark Self-Evolving, COLING 2025.
- Sierra Research, τ²-bench, 2025.
- Anthropic Alignment, Bloom: Automated Behavioral Evals, 2025.
- Microsoft Research, Orca-AgentInstruct, 2024.
- Bespoke Labs, Curator.
- Anthropic, Demystifying Evals for AI Agents.