Recursive Self-Improvement in Quant Research with Claude Code

What if your quant researcher never sleeps, never stops iterating, and knows when to be skeptical of its own discoveries?

This is the story of building an autonomous alpha research system that uses Claude Code as its core reasoning engine. The system generates factor hypotheses, writes DSL expressions, trains LightGBM models, interprets results, evolves strategies through mutation and crossover — and subjects everything to a gauntlet of statistical tests designed to catch false discoveries.

The result: statistically significant signals in commodity futures that survive permutation tests, multiple testing adjustments, and true out-of-sample evaluation. Not all of them survive. That's the point.

Why Commodity Futures?

Commodity futures offer a less crowded playing field than equities. The universe is small (20 instruments in our case: 5 energy, 15 agricultural), the signals are driven by real economic fundamentals (weather, supply chains, storage costs), and the cross-sectional structure is rich enough for factor-based approaches without requiring the massive universes that equity quants typically rely on. The instruments trade on deep, liquid markets with low slippage.

Our universe:

Energy (5): Brent, WTI, Heating Oil, Natural Gas, Gasoline
Agricultural (15): Cocoa, Cotton, Feeder Cattle, Lean Hogs, Coffee, KC Wheat, Live Cattle, Orange Juice, Sugar, Corn, Soybean Oil, Soybean Meal, Rough Rice, Soybeans, Chicago Wheat

System Architecture — How the Loop Works

The system runs in a walk-forward framework. For each time window:

Direction (human prompt)
    |
    v
Initial Population (16 individuals, each with 18 DSL factor expressions)
    |
    v
[For 8 evolution rounds]:
    |-- Mutation: Claude analyzes feature importances, replaces weak factors
    |-- Crossover: Claude combines best factors from two parents
    |-- Train LightGBM on factor features -> predict next-day returns
    |-- Evaluate by validation RankIC
    |-- Selection: keep top K
    |
    v
Best Model
    |
    v
Robustness & FDR Checks (permutation, deflated Sharpe, subsample, decay, CV)
    |
    v
Verdict: ROBUST / MARGINAL / UNSTABLE
    |
    v
Out-of-Sample Backtest (never seen during any optimization)

What's an "Individual"?

Each individual in the population is a set of 18 factor expressions written in a domain-specific language (DSL). For example:

TS_MEAN(($open - DELAY($close, 1)) / (DELAY($close, 1) + 1e-8), 20)

This computes the 20-day average overnight gap. When an individual is evaluated:

All 18 expressions are computed into a feature matrix (instrument x date x 18 features)
LightGBM is trained with features = factor values, target = next-day return
The model predicts on the validation period
RankIC of predictions = fitness score

Evolution Operators

Mutation: Claude receives the parent's factor set, feature importances, and metrics. It identifies weak factors (low importance) to replace and strong factors (high importance) to keep. Replacements are targeted — Claude reasons about what signal types are missing, what lookback periods are over-represented, and what category coverage gaps exist.

Crossover: Claude receives two parents' factor sets with importances. It takes the top ~8-10 factors from each parent, resolves near-duplicates, and fills remaining slots with bridging factors.

This isn't random search. Claude reads the importance scores, understands what each factor measures, and makes informed decisions about what to change.

Walk-Forward Validation

We use a strict temporal separation:

Train: 6 years (e.g., 2018-2023)
Validation: 1 year (e.g., 2024) — used for fitness evaluation and selection
OOS: 1 year (e.g., 2025) — never touched until final evaluation

Within training, we run 3 expanding folds of purged time-series cross-validation with dynamic embargo. The embargo is computed as the maximum lookback window across all factor expressions in each individual — if your longest factor uses a 60-day window, the embargo is at least 60 trading days.

The False Discovery Gauntlet

Multiple hypothesis testing is the #1 concern in quantitative research. When you test 96 individuals across 8 evolution rounds, some will look good by chance. Our system addresses this with five complementary statistical tests, each targeting a different failure mode.

Permutation Test (1000 shuffles)

The simplest and most powerful test: shuffle the date labels on predictions 1000 times and recompute RankIC each time. If the real model's RankIC isn't in the extreme tail of this null distribution, the signal is indistinguishable from noise.

Result: All models passed (p < 0.001). The observed RankIC was always far beyond the 99.9th percentile of the null distribution. The signal is real.

Deflated Sharpe Ratio (Bailey & Lopez de Prado, 2014)

The permutation test tells you if this model's signal is real. But when you've tested 96 models and selected the best one, you need to ask: is the selected model's Sharpe ratio impressive given how many you tested?

The Deflated Sharpe Ratio adjusts the observed Sharpe for the number of trials. With 96 individuals tested, the expected maximum Sharpe from pure noise is 2.58. Your observed Sharpe needs to clear this bar after adjusting for skewness, kurtosis, and sample size.

Result: Mixed. Models with very high validation Sharpe (>3.0) passed easily. Models with moderate Sharpe (~1.4-1.9) did not. This test is doing its job — it's harder to fool.

Instrument Subsample Stability (20 random splits)

Split the 20 instruments into random halves 20 times. Compute RankIC on each half. If the signal is broad-based across instruments, both halves should consistently show positive RankIC.

Result: All models showed 95-100% stability. The signal isn't concentrated in a few instruments — it works across the full commodity universe.

Decay Analysis

Real predictive signals decay smoothly as the forecast horizon extends from 1 day to 20 days. Noise signals show erratic, non-monotonic patterns. We compute RankIC at horizons [1, 2, 5, 10, 20] days.

CV Consistency

The fraction of internal cross-validation folds with positive validation RankIC. This tests whether the signal is stable across different training periods, not just the final validation window.

Verdicts

Pre-committed thresholds — set before seeing any results:

Verdict	Criteria
ROBUST	CV consistency >= 75% AND permutation p < 0.05 AND subsample stability > 50% AND validation net Sharpe > 0.3
MARGINAL	CV consistency >= 50% AND permutation p < 0.10
UNSTABLE	Everything else

Results: OOS 2024 — The Search for Robustness

Setup: Train 2017-2022, Validation 2023, OOS 2024. 20 agro+energy instruments. 7 independent runs, each evolving ~96 individuals over 8 evolution rounds.

Run	Verdict	Val Net Sharpe	OOS Sharpe	OOS Net Sharpe	OOS Net Return	OOS Max DD
1	UNSTABLE	1.65	1.33	0.10	2.7%	-15.6%
2	MARGINAL	1.76	1.57	0.51	14.3%	-14.1%
3	MARGINAL	1.84	1.07	-0.02	-0.5%	-12.9%
4	UNSTABLE	-0.21	1.21	0.06	1.5%	-12.5%
5	MARGINAL	0.13	1.21	0.10	2.6%	-13.4%
6	UNSTABLE	-0.03	-1.22	-2.46	-67.7%	-33.0%
7	MARGINAL	0.67	-0.39	-1.55	-38.0%	-27.5%

Result: 0 ROBUST out of 7 attempts (4 MARGINAL, 3 UNSTABLE).

Why ROBUST Was Not Achieved

The sole blocker was CV consistency. The three internal CV folds validated on 2020, 2021, and 2022 — three dramatically different market regimes:

2020 (CV-val for Fold 1): COVID crash. WTI went negative. Unprecedented volatility.
2021 (CV-val for Fold 2): Post-COVID rally. Low volatility, momentum-dominated.
2022 (CV-val for Fold 3): Ukraine war. Commodity supercycle. Supply-shock driven.

No factor set consistently produced positive RankIC across all three. Fold 2 (2021) was the persistent failure point — negative in 6 of 7 attempts. The system was honest about this: it flagged every model as having insufficient CV consistency rather than letting them through.

Permutation tests passed 7/7. Subsample stability passed 7/7. The signal is real and broad-based. It just doesn't work across all three of these extreme regimes simultaneously.

The Best OOS 2024 Result

Run 2 delivered 14.3% net return with 0.51 net Sharpe after 10bps/side transaction costs — commercially viable despite not meeting the ROBUST bar. The best model (ID 74, generation 6 crossover) had a raw OOS Sharpe of 1.57.

Results: OOS 2025 — Shifting the Window

An observation from the OOS 2024 analysis: the 2020-2021-2022 regime sequence was unusually hostile to CV consistency. What if we shift the training window forward to exclude the COVID crash?

Setup: Train 2018-2023, Validation 2024, OOS 2025. Two independent runs.

Model	Verdict	Val Net Sharpe	OOS Sharpe	OOS Net Sharpe	OOS Net Return	OOS Max DD	Turnover
Model 80 (Run A)	MARGINAL	2.47	2.23	0.97	23.2%	-13.9%	0.60
Model 23 (Run B)	MARGINAL	1.97	1.36	0.68	18.2%	-9.4%	0.36

Both models are MARGINAL — same CV consistency bottleneck (67%, with 2020 still being the negative fold even from a different window position). But OOS performance is notably stronger.

Model 80 (a generation 7 mutation) achieved:

OOS Sharpe of 2.23 (vs. best 2024 run's 1.57)
OOS net return of 23.2% (vs. 14.3%)
Passed all robustness tests except CV consistency

Model 23 (a generation 1 crossover) had lower turnover (0.36 vs 0.60), resulting in a shallower drawdown (-9.4%) despite a lower raw Sharpe.

What Did the LLM Discover?

The best model (ID 80) uses 18 factors across 5 categories. The top factors by LightGBM importance:

Factor	Category	Importance	Expression
atr_20d	Price Structure	8.2%	`TS_MEAN(MAX($high-$low, ABS($high-DELAY($close,1)), ABS($low-DELAY($close,1))), 20) / $close`
vol_ratio_5_20	Volatility	7.4%	`TS_STD($return, 5) / (TS_STD($return, 20) + 1e-8)`
momentum_skew_20d	Momentum	7.2%	`TS_SKEW($return, 20)`
high_low_ratio_5d	Price Structure	6.7%	`TS_MEAN($high / ($low + 1e-8), 5)`
vol_return_corr_20d	Volatility	6.6%	`TS_CORR(TS_STD($return, 5), $return, 20)`

The evolved factor set spans volatility (ATR, vol ratio, vol-return correlation), momentum (skewness, autocorrelation, decay-weighted returns), price structure (high-low ratios, overnight gaps), and microstructure (gap reversal). No single category dominates — the LightGBM model benefits from diverse signals.

Notably, the top factor is normalized ATR — a volatility measure. Commodity futures are heavily influenced by volatility regimes, and the model learned to weight this signal highest.

What Claude Code Actually Does — The Recursive Self-Improvement

The evolution isn't random. Here's what a concrete mutation step looks like:

Claude receives the parent individual's 18 factors with importance scores and validation metrics. It analyzes:

Weak factors — days_since_high_20d has only 1.6% importance. It's a candidate for replacement.
Category gaps — The parent has 6 momentum factors and only 1 liquidity factor. An additional liquidity signal could help.
Redundancy — Two overnight gap factors are computing similar things. One could be replaced with a distinct microstructure signal.

Based on this analysis, Claude writes a new factor set:

Keeps all 14 high-importance factors unchanged
Replaces days_since_high_20d with volume_zscore_20d (adding liquidity signal)
Replaces one redundant factor with gap_reversal (overnight-to-intraday mean reversion)
Modifies one factor's lookback period from 10 to 20 days

The child is then evaluated. If it improves on the parent's RankIC, it enters the population for the next round of selection.

The "Never Stops" Property

When a run produces UNSTABLE, the system doesn't give up. You launch another run with a different random seed, and the evolutionary process explores a different region of factor space. The 7 OOS 2024 runs demonstrate this: each started from scratch and explored independently. Run 2 found a 14.3% net return model while Run 6 blew up at -67.7%. Both outcomes are informative.

Multiple runs function as a meta-evolution: each run's failures inform what to try next (different training windows, different initial prompts, different hyperparameters).

Evolution Works: The Fitness Trajectory

Both OOS 2025 runs show a clear upward trend in population fitness across generations. In Run A, the mean population RankIC rose from 0.003 at generation 0 to 0.031 at generation 8 — a 10x improvement. The best individual improved from 0.033 to 0.053.

This isn't random search. Random search would show flat mean fitness with occasional lucky spikes. Instead, we see systematic improvement in the population mean, which indicates that the selection + mutation loop is accumulating beneficial changes.

Validation vs. OOS: The Reality Check

The scatter plot of validation net Sharpe vs OOS net Sharpe across all 9 experiments reveals the classic pattern: substantial degradation from validation to OOS. Most points fall well below the diagonal. But the positive finding is that higher validation performance does loosely predict higher OOS performance — the best validation models tend to produce the best OOS results, even if the absolute levels are lower.

The OOS 2025 models (green squares) sit in the upper-right quadrant: high validation performance, and the OOS performance holds up better than the OOS 2024 models. Excluding the COVID year from training appears to produce more generalizable models.

Is OOS Performance Predictable? A Regression Analysis

The scatter plot above hints at a relationship between validation and OOS performance. But how strong is it really? With 10 independent experiments (8 targeting OOS 2024, 2 targeting OOS 2025), we can run a formal regression analysis to answer: can validation metrics predict out-of-sample results?

The Headline: RankIC Shrinks 64%, But Stays Positive

The mean validation RankIC across all 10 runs is 0.055. The mean OOS RankIC is 0.020 — a 64% decay. This shrinkage is expected: validation performance always overstates true predictive power because the model was selected to maximize it.

But the critical finding: 10 out of 10 experiments produced positive OOS RankIC. Even the worst-performing models retained some signal. The signal discovered by the evolutionary search is real — it just isn't as strong as it looks during validation.

Validation RankIC Does Not Predict OOS RankIC

This is a surprising and important result. Regressing OOS RankIC on validation RankIC yields:

OOS_RankIC = 0.020 + 0.003 * val_RankIC    (R-squared = 0.00, p = 0.99)

The slope is essentially zero. A model that scores 0.087 on validation isn't expected to score higher OOS than one scoring 0.039. Adding the MARGINAL/UNSTABLE verdict as dummy variables doesn't help either (R-squared = 0.04). The verdict categories carry no additional predictive power for OOS RankIC beyond what's already in the (useless) validation RankIC.

Why? The validation RankIC reflects both genuine signal and overfitting to the validation year. Higher validation RankIC can mean either "found a stronger real signal" or "overfit harder to this specific year." These two effects cancel out in the regression.

But Validation Net Sharpe Does Predict OOS Net Sharpe

The net Sharpe regression tells a different story:

OOS_netSharpe = -0.98 + 0.69 * val_netSharpe    (R-squared = 0.42, p = 0.044)

This is statistically significant. Each unit of validation net Sharpe translates to 0.69 units of OOS net Sharpe, with a large negative intercept (-0.98). The intercept means you need a validation net Sharpe above ~1.4 just to expect a positive OOS net Sharpe. This is consistent with what we observe: models with validation net Sharpe below 0.5 tend to lose money OOS.

Why does net Sharpe predict while RankIC doesn't? Net Sharpe incorporates transaction costs, which penalize high-turnover models. Models that achieve high net Sharpe tend to do so through genuine low-frequency signals rather than high-frequency noise-fitting. Transaction costs act as a natural regularizer.

Verdict Categories: Marginal vs. Unstable

Metric	MARGINAL (n=6)	UNSTABLE (n=4)
Mean OOS RankIC	0.022	0.017
Mean OOS net Sharpe	+0.12	-0.50
Frac OOS RankIC > 0	100%	100%
Frac OOS net Sharpe > 0	67%	75%

MARGINAL models average higher OOS RankIC (0.022 vs 0.017) and much better OOS net Sharpe (+0.12 vs -0.50). However, with only 10 experiments, this difference isn't statistically significant (t-test p = 0.59). The trend is directionally correct — the robustness checks do filter for better OOS models — but we can't claim statistical significance yet.

The UNSTABLE category's 75% positive OOS net Sharpe rate is misleading — one UNSTABLE model (Run 7) blew up at -2.46 OOS net Sharpe, dragging the mean deep into negative territory despite the other UNSTABLE models being mildly positive.

What This Means for Model Selection

Don't chase high validation RankIC. It has zero predictive power for OOS RankIC. A model scoring 0.04 on validation is just as likely to perform well OOS as one scoring 0.08.
Do look at validation net Sharpe. It explains 42% of OOS net Sharpe variance. Require val net Sharpe > 1.4 as a minimum filter.
The verdict system works directionally but needs more data to prove statistical significance. MARGINAL models are safer bets than UNSTABLE ones on average, particularly for avoiding catastrophic OOS failures.
Expect 64% RankIC shrinkage. Budget for it. A validation RankIC of 0.05 should be expected to deliver ~0.02 OOS. Plan position sizing and risk management around the OOS expectation, not the validation number.

Addressing Skepticism Head-On

"It's just overfitting." No. Every reported OOS result comes from data the model never saw during training, validation, or any selection step. The permutation tests confirm the signal isn't explainable by chance (p < 0.001). The deflated Sharpe ratio adjusts for the 96 models tested. We report net-of-cost returns with 10bps/side transaction costs.

"The LLM is just doing random search." No. The evolution fitness chart shows systematic improvement from generation 0 to generation 8 — both in population best and population mean. Random search would show flat mean fitness. Claude's feature importance analysis guides mutations to replace specific weak factors with targeted alternatives.

"Commodity futures are too noisy." The subsample stability test shows 95-100% consistency across random 50/50 instrument splits, even though the universe is only 20 instruments. The signal is broad-based, not concentrated in a few contracts.

"You cherry-picked the good results." We show all 7 OOS 2024 attempts, including the -67.7% disaster (Run 6) and the -38.0% failure (Run 7). Four of seven runs had near-zero or negative OOS net returns. We also show that 0 of 7 achieved ROBUST verdict. If we were cherry-picking, we wouldn't lead with our failures.

"What about after transaction costs?" All reported Sharpe ratios and returns are net of 10bps/side (20bps round-trip) transaction costs. The best OOS 2025 model's raw Sharpe of 2.23 drops to net Sharpe of 0.97 after costs. This is realistic for institutional commodity futures trading.

Connection to "Evolving Deeper LLM Thinking" (Lee et al., 2025)

The closest academic framework to our approach is "Evolving Deeper LLM Thinking" (Lee et al., 2025), which introduces Mind Evolution — a method that uses LLMs as mutation and crossover operators in evolutionary search over natural language solutions. The paper demonstrates this on planning benchmarks (TravelPlanner, Natural Plan, StegPoet), achieving 95.6% success on TravelPlanner at $0.29/problem — beating both Best-of-N sampling (55.6%) and Sequential Revision (82.8% at$ 2.75). Our system applies the same core idea to a very different domain: financial alpha discovery.

The architectural parallels are deep:

Dimension	Mind Evolution	QuantaAlphaLGBM
Representation	Natural language solutions	DSL factor expressions (structured text)
Mutation	LLM refines solution via RCC	Claude analyzes importances, replaces weak factors
Crossover	LLM merges 1-5 parent solutions	Claude combines top factors from two parents
Fitness	Programmatic constraint checker + score	Validation RankIC from LightGBM backtest
Self-critique	Critic-Author conversation (RCC)	Feature importance analysis guides mutations
Population	4 islands x 20 candidates/gen	Single population, 16 initial + ~80 offspring
Total candidates	~800 per problem	~96 per run
Generations	10	8

Both systems share a critical design principle: the LLM never executes — it only proposes. In Mind Evolution, solutions are "parsed and evaluated programmatically." In our system, Claude writes factor expressions, but all computation, training, and evaluation happens in deterministic Python code. This eliminates hallucination risks at the execution layer.

Both systems also share the insight that LLMs are better evolutionary operators than traditional bit-flip mutations because they understand the structure of the solution space. Mind Evolution's paper states: "the strong language understanding and generation capabilities of an LLM [can] be leveraged to implement powerful recombination." Our evolution fitness plot — where mean population RankIC improves 10x from generation 0 to generation 8 — is empirical confirmation of this claim in the financial domain.

The Critic Mechanism: Their RCC vs. Our Feature Importances

Mind Evolution's key innovation is Refine via Critical Conversation (RCC): a critic persona analyzes the parent solution and evaluation feedback, suggests corrections, then an author persona proposes refined solutions. Their ablation study shows this is essential:

Components	Success Rate
No critic, no feedback	46.1%
+ Critic	71.1%
+ Structured prompts	76.1%
+ Textual feedback from evaluator	91.1%
+ Island resets with LLM	95.6% (full system)

Adding the critic alone jumps performance from 46.1% to 71.1%. But the biggest single gain comes from textual feedback — giving the LLM structured information about what went wrong with each solution.

Our system has an analog to both components. The "critic" is Claude reading feature importance scores and validation metrics. The "textual feedback" is the structured data Claude receives: which factors had less than 2% importance (effectively dead weight), which categories are over-represented, what the validation RankIC and Sharpe look like. This feedback is what enables targeted mutations rather than random perturbations.

We don't have the explicit two-persona separation (critic vs. author) — Claude does both in a single pass. Mind Evolution's ablation suggests we might benefit from adding an explicit critique step before mutation, though our single-pass approach may work because the feedback signal (feature importances) is already highly structured and quantitative, unlike the natural language constraint violations in their planning tasks.

Population Structure: Islands vs. Single Pool

Mind Evolution uses a 4-island model with cyclic migration (top 5 solutions cloned to the next island each generation) and periodic resets (every 3 generations, the 2 worst islands are replaced with global elites). Their ablation shows this matters: removing the island model drops trip planning success from 87.5% to 77.4%.

We use a single population with high offspring counts (6 mutations + 4 crossovers per round, 8 rounds). Our diversity comes from two sources: (1) the breadth of factor categories Claude draws from (momentum, volatility, price structure, microstructure, liquidity), and (2) running multiple independent experiments (7 runs for OOS 2024, 2 runs for OOS 2025).

In retrospect, Mind Evolution's island model is elegant and we may be leaving performance on the table. Their islands naturally maintain strategy diversity — in our context, this could mean one island exploring momentum-heavy factor sets, another exploring volatility regimes, another mean-reversion signals. The periodic reset mechanism (replacing weak islands with elites) maps to a problem we observe: within a single run, population diversity narrows in later generations as selection pressure dominates. Islands with migration could sustain exploration longer.

Where the Domains Diverge

Mind Evolution operates on constraint satisfaction problems with deterministic evaluators. A travel plan either satisfies the budget constraint or it doesn't. The fitness landscape is noisy only in the LLM's generation process, not in the evaluation.

Financial alpha is fundamentally different: the evaluator itself is noisy. A factor set's validation RankIC is a sample estimate from one year of data across 20 instruments. The same factor set might score 0.05 on one validation year and -0.01 on another. This means:

Selection is unreliable. Mind Evolution can confidently keep the highest-scoring solution because the score is deterministic. We can't — our "best" model might just be lucky on the validation set. This is why we need the False Discovery Gauntlet (permutation tests, deflated Sharpe) that Mind Evolution doesn't require.
More candidates can hurt. Mind Evolution benefits monotonically from more candidates — their scaling curves show steady improvement with more generations. In finance, testing more candidates inflates the multiple testing problem. Our deflated Sharpe calculation penalizes for all 96 individuals tested, and the expected maximum noise Sharpe (2.58) is already high. Mind Evolution's 800 candidates in a financial context would demand extreme observed performance to survive multiple testing correction.
Fitness feedback is weaker. Mind Evolution's evaluator returns structured textual feedback: "the plan violates the budget constraint by $200" or "meeting conflicts with Alice's schedule." Our evaluator returns a single number (RankIC = 0.033) with feature importances. There's no textual explanation of why a factor set failed. Claude must infer this from the importance scores and metrics, which is less informative than Mind Evolution's constraint-specific feedback. Adding richer evaluation feedback — perhaps explaining which instruments the model failed on, or which time periods had negative IC — could improve our mutation quality.

What Mind Evolution Suggests We Should Try

Multi-island populations with distinct initial themes (momentum island, volatility island, mean-reversion island, microstructure island) and periodic migration of top factors between islands.
Explicit critic-author separation in the evolution prompt. Currently Claude does analysis and generation in one pass. A two-step approach — first generating a detailed critique of the parent's weaknesses, then generating mutations conditioned on that critique — mirrors RCC and might improve mutation quality.
Richer textual feedback from the evaluator. Instead of just passing feature importances and metrics, we could generate natural language descriptions: "The model performed well on energy instruments but poorly on agricultural — consider factors that capture supply-season dynamics" or "RankIC was positive in Q1-Q3 but turned negative in Q4 during the volatility spike."
Island resets with LLM-guided diversity selection. Mind Evolution doesn't just reset weak islands with the global best — it uses the LLM to select diverse starting points from the top 15 candidates. This combats premature convergence more effectively than random restarts.

The Key Takeaway

Mind Evolution validates the core mechanism behind our system: LLMs are effective evolutionary operators because they understand solution structure. Their 95.6% success rate on planning benchmarks demonstrates the ceiling of this approach on deterministic problems. Our contribution is showing it also works on stochastic problems — financial alpha discovery — where the evaluator is noisy and false discoveries are the primary risk. The price of operating in this domain is the entire robustness apparatus that Mind Evolution doesn't need: permutation tests, deflated Sharpe ratios, subsample stability checks, and pre-committed verdict thresholds.

Mind Evolution solves the search problem. We solve the search problem and the validation problem. Both are necessary for finance.

Conclusion — What This Means for Quant Research

The system works: it finds statistically significant signals in commodity futures that survive rigorous false discovery rate checks and produce positive returns on unseen data. The recursive loop of hypothesis generation, evolution, and validation is genuinely autonomous — once launched, it runs for hours without human intervention.

Current state: MARGINAL models with positive OOS returns. The best result — 23.2% net return with 0.97 net Sharpe on OOS 2025 data — is promising but not yet ROBUST by our pre-committed criteria. The bottleneck is CV consistency across extreme regime changes (COVID, post-COVID, Ukraine war).

What the system demonstrates:

LLMs can do more than generate code — they can reason about why a factor works, what's missing from a factor set, and how to improve it
Evolutionary algorithms with LLM operators are more efficient than random search, as shown by the systematic fitness improvement across generations
Rigorous statistical validation can be automated, making it harder to fool yourself — the system flagged its own models as MARGINAL rather than letting them pass

Next steps: Turnover penalties in the fitness function (currently pure RankIC selection), regime detection to handle structural breaks, and expanded universe beyond energy and agriculture.

The recursive loop means every iteration makes the system better at finding signals — and better at knowing when it hasn't.

All experiments run on commodity futures daily OHLCV data (2016-2025). Transaction costs: 10bps per side. Statistical significance: two-sided except where noted. Code and methodology available on request.