Automated Quant Research with AI Agents: How Microsoft's RD-Agent Achieves 2x Returns with 70% Fewer Factors

The Quant Research Bottleneck

Quantitative finance has always been a war of attrition. Finding alpha requires cycling through thousands of candidate factors, testing model architectures, running backtests, interpreting results, and iterating — often for weeks before a strategy proves viable. The people who do this well are rare. The process is slow. And the signals they find keep decaying faster as markets get more efficient.

What if you could automate the entire loop?

Not just the backtesting — that has been automated for years. The entire research and development cycle: reading financial literature, forming hypotheses about what drives returns, writing the code to compute new factors, evaluating them against real market data, deciding what to try next, and iterating until a strategy emerges.

That is exactly what Microsoft Research's RD-Agent framework does. And the results, published in a paper recently accepted at NeurIPS 2025, are striking: approximately 2x higher annualized returns than classical factor libraries like Alpha158 and Alpha360, while using over 70% fewer factors, at a cost of under $10 per optimization cycle.

This post breaks down how it works, what the results actually show, and what this means for the future of quantitative research.

What Is RD-Agent?

RD-Agent is an open-source multi-agent framework built by Microsoft Research. The name captures the core idea: it automates both the R (Research) and the D (Development) sides of data-driven R&D.

The framework is general-purpose — it supports Kaggle competitions, medical prediction, and general ML engineering — but its most mature and impressive application is quantitative trading, packaged as RD-Agent(Q).

RD-Agent(Q) is described in the paper as "the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization." That is a mouthful, but the key phrase is factor-model co-optimization: the system simultaneously discovers better features (factors) and better models, with each improving the other.

Why This Is Hard

Before diving into the architecture, it helps to understand why automating quant research is fundamentally different from, say, automating code generation or document summarization.

1. High dimensionality. Financial data is massive. Hundreds of stocks, years of daily bars, thousands of candidate features. The search space for useful signals is enormous.

2. Non-stationarity. Markets change. A factor that worked in 2015 may be worthless by 2020. Any automated system needs to handle regime shifts, not just static pattern matching.

3. Signal decay and redundancy. Most candidate factors are noise. Of the ones that aren't, many are correlated with factors you already have. Adding a redundant factor doesn't improve your strategy — it just overfits.

4. The research-development gap. Having a good hypothesis ("momentum reversal after earnings surprises") is useless unless you can translate it into correct, executable code that computes the factor from raw market data. This translation step is where most manual quant research time goes.

5. Fragmented optimization. In traditional workflows, factor mining and model selection happen in separate silos. A factor researcher hands off a feature set; a model researcher trains on it. But the optimal factors depend on the model, and the optimal model depends on the factors. Jointly optimizing both is what RD-Agent(Q) attempts.

Architecture: The R&D Loop

The framework operates as a continuous loop with two phases — Research and Development — connected by a feedback mechanism that decides what to try next.

                         ┌─────────────────────────────────────────────────┐
                         │              RESEARCH PHASE                     │
                         │                                                 │
                         │  ┌───────────────────┐  ┌───────────────────┐   │
                         │  │ Specification Unit │  │  Synthesis Unit   │   │
                         │  │                   │  │                   │   │
                         │  │ • Task context    │  │ • Knowledge forest│   │
                         │  │ • Data interfaces │  │ • SOTA tracking   │   │
                         │  │ • Output formats  │  │ • Hypothesis gen  │   │
                         │  │ • Execution env   │  │ • Complexity adj  │   │
                         │  └───────────────────┘  └────────┬──────────┘   │
                         │                                  │              │
                         └──────────────────────────────────┼──────────────┘
                                                            │
                                                   Hypotheses (factor
                                                   or model proposals)
                                                            │
                                                            ▼
                         ┌─────────────────────────────────────────────────┐
                         │            DEVELOPMENT PHASE                    │
                         │                                                 │
                         │  ┌───────────────────┐                          │
                         │  │ Implementation    │  Co-STEER Agent          │
                         │  │ Unit              │  • DAG task scheduling   │
                         │  │                   │  • Code generation       │
                         │  │                   │  • Knowledge base reuse  │
                         │  └────────┬──────────┘                          │
                         │           │                                     │
                         │           ▼                                     │
                         │  ┌───────────────────┐                          │
                         │  │ Validation Unit   │  Qlib Backtesting        │
                         │  │                   │  • IC de-duplication     │
                         │  │                   │  • Factor/model pairing  │
                         │  │                   │  • Real market backtest  │
                         │  └────────┬──────────┘                          │
                         │           │                                     │
                         │           ▼                                     │
                         │  ┌───────────────────┐                          │
                         │  │ Analysis Unit     │  Multi-Armed Bandit      │
                         │  │                   │  • 8-dim performance     │
                         │  │                   │  • Thompson sampling     │
                         │  │                   │  • Factor vs model?      │
                         │  └────────┬──────────┘                          │
                         │           │                                     │
                         └───────────┼─────────────────────────────────────┘
                                     │
                                     │  Performance feedback
                                     │  + action selection
                                     │
                                     └──────────────────┐
                                                        │
                         ┌──────────────────────────────┘
                         │
                         ▼
                ┌─────────────────┐
                │   SOTA Factor   │◄──── Best factors & models
                │   & Model Set   │      discovered so far
                └────────┬────────┘
                         │
                         │  Feed back into
                         │  next iteration
                         │
                         └──────────► (back to Research Phase)

Research Phase

The Research phase is responsible for generating hypotheses. It has two main components:

Specification Unit — This formalizes the task context. It encodes background assumptions, data interfaces, output formats, and execution environments into structured specifications. This ensures that every candidate factor or model the system generates is compatible with the backtesting infrastructure (built on Qlib, Microsoft's open-source quant platform).

Synthesis Unit — This is where new ideas come from. The system maintains a "knowledge forest" of all previous experiments — hypotheses, implementations, and their results. When generating a new hypothesis, it uses the function $G(\mathcal{H}_t^{(a)}, \mathcal{F}_t^{(a)})$ , where $\mathcal{H}_t^{(a)}$ represents historical hypotheses and $\mathcal{F}_t^{(a)}$ represents corresponding feedback. It also maintains a "State-of-the-Art" (SOTA) set of the best-performing factors and models discovered so far, and adaptively adjusts hypothesis complexity based on recent performance.

A key design choice: factor hypotheses decompose into multiple subtasks (e.g., compute a momentum signal, compute a volatility signal, combine them), while model hypotheses map to single coherent tasks. This reflects the different natures of the two problems — factors are compositional, models are holistic.

Development Phase

The Development phase takes hypotheses and turns them into working code and validated results. Three components:

Implementation Unit (Co-STEER) — This is the code generation agent. It constructs a directed acyclic graph (DAG) of task dependencies, generates Python code for each task, and uses a knowledge base of previous task-code-feedback triples for similarity-based retrieval. When a new task resembles something it has coded before, it leverages that experience. The system uses guided chain-of-thought reasoning with execution feedback loops — if the code fails, it sees the error and iterates.

Validation Unit — This is where rubber meets road. For factors, the system applies de-duplication by computing the information coefficient (IC) between the new factor and all existing factors. If $IC_{max}^{(n)} \geq 0.99$ , the factor is deemed redundant and discarded. Surviving candidates are paired with the current best model and backtested on real market data. For models, the process is reversed — each candidate model is paired with the current best factor set.

Analysis Unit — After validation, this component conducts multi-dimensional performance assessment and feeds results back to the Research phase. Critically, it uses a multi-armed bandit scheduler to decide what to optimize next.

The Multi-Armed Bandit Scheduler

This is one of the more elegant pieces of the system. At each iteration, the agent must decide: should I try to find a better factor, or a better model? This is a classic exploration-exploitation tradeoff.

RD-Agent(Q) formulates this as a contextual Thompson sampling problem. The system maintains an 8-dimensional performance state vector $x_t \in \mathbb{R}^8$ encoding:

IC, ICIR, Rank IC, Rank ICIR (predictive power metrics)
ARR, IR, $-$ MDD, SR (strategy performance metrics)

The action space is simply $\{$ factor, model $\}$ . For each arm, a separate Bayesian linear model with Gaussian posteriors encodes uncertainty over reward coefficients. At each step, the system samples from each posterior, picks the action with the highest sampled reward, executes it, and updates the posteriors with the observed improvement.

The ablation study confirms this matters. Compared to random action selection (IC: 0.0445, ARR: 8.97%) and LLM-based selection (IC: 0.0476, ARR: 10.09%), the bandit scheduler achieves IC: 0.0532 and ARR: 14.21% — a substantial improvement.

Strategy	IC	ARR	SOTA Selections
Random	0.0445	8.97%	7
LLM-based	0.0476	10.09%	5
Bandit	0.0532	14.21%	8

Results: What the Numbers Actually Say

The paper evaluates RD-Agent(Q) on the CSI 300 index (300 large-cap Chinese A-shares) with training data from 2008-2014, validation from 2015-2016, and testing from 2017-2020.

Baselines

The comparison set is comprehensive:

Factor libraries: Alpha 101, Alpha 158, Alpha 360, AutoAlpha
Machine learning models: Linear, MLP, LightGBM, XGBoost, CatBoost, DoubleEnsemble
Deep learning models: GRU, LSTM, ALSTM, Transformer, PatchTST, iTransformer, Mamba, TRA, MASTER, GATs

Three Configurations

The paper tests three modes to isolate contributions:

R&D-Factor — Fixed LightGBM model, optimize factors only (starting from Alpha 20 baseline)
R&D-Model — Fixed Alpha 20 factors, search for better models only
R&D-Agent(Q) — Joint optimization of both factors and models

Headline Results

Component	IC	Rank IC	ARR	IR	MDD
Alpha 158 + LightGBM	0.0420	0.0514	6.80%	0.7890	-12.53%
Alpha 360 + LightGBM	0.0420	0.0514	7.92%	0.9438	-10.31%
TRA (best deep model)	—	—	10.28%	1.2064	-8.23%
MASTER	—	—	9.87%	1.3406	-7.18%
R&D-Factor (o3-mini)	0.0497	0.0463	11.84%	1.5214	-8.26%
R&D-Model (o3-mini)	0.0467	0.0546	10.99%	1.2914	-6.94%
R&D-Agent(Q) (o3-mini)	0.0532	0.0495	14.21%	1.7382	-7.42%

The joint optimization mode (R&D-Agent(Q)) achieves the best IC (0.0532, +26.7% over Alpha 360), the best annualized return (14.21%, +38% over TRA), and the best information ratio (1.7382, +30% over MASTER). Maximum drawdown is well-controlled at -7.42%.

What makes this more impressive: R&D-Factor achieves comparable IC to Alpha 158/360 using only ~22% of their factors. The iterative refinement process strips out redundant and regime-sensitive signals, keeping only the factors that actually contribute.

Cross-Market Generalization

The framework was also tested on CSI 500 and NASDAQ 100 with a more recent test period (2024 to mid-2025):

Market	IC	ICIR	IR	MDD
CSI 500	0.0288	0.1828	2.1721	-6.56%
NASDAQ 100	0.0162	0.1035	1.7737	-6.34%

These results confirm out-of-sample robustness across different markets and time periods.

How the Agent Discovers Factors

One of the most interesting analyses in the paper is the visualization of how factor hypotheses evolve over time. The authors cluster factor hypotheses by semantic similarity and reveal three patterns:

1. Local refinement then directional shift. The agent starts by iterating within a conceptual thread — say, momentum-based factors — refining its implementation across several trials. Then it shifts to an entirely different direction, like volatility-based features. This mirrors how a human researcher might exhaust one idea before pivoting.

2. Strategic revisitation. Later in the optimization, the agent sometimes returns to promising early hypotheses and refines them with knowledge gained from intermediate experiments. Trial 26 in the paper clusters with earlier trials 12-14, showing the system can "revisit and incrementally refine promising early hypotheses."

3. Diverse paths yield synergy. Of 36 trials in one run, 8 were selected into the final SOTA factor set, spanning 5 of 6 hypothesis clusters. The best strategy emerges from exploring multiple conceptual directions, not from deep-diving into one.

This is remarkably similar to how experienced quant researchers describe their own workflows — except the agent completes a full cycle in minutes rather than days.

Co-STEER: The Code Generation Engine

A hypothesis is worthless without a correct implementation. The Co-STEER agent handles this translation from natural language hypothesis to executable Python code.

The key innovation is the knowledge base. Every time Co-STEER implements a task, it records a triple: (task description, generated code, execution feedback). When a new task arrives, it retrieves similar past tasks and uses their code as a starting point:

$c_{\text{new}} = \arg\max_{c_k \in \mathcal{K}} \text{similarity}(t_{\text{new}}, t_k) \cdot c_k$

Pass@k experiments show that both GPT-4o and o3-mini converge to high success rates within a few iterations, with o3-mini showing stronger chain-of-thought reasoning — "a clear advantage in structured, high-dependency coding scenarios."

The system also supports multiple LLM backends. Evaluation across six API variants shows o1 achieves top performance through "strong strategic breakthroughs," GPT-4.1 ranks second, and o3-mini and GPT-4o deliver comparable results — suggesting the framework is robust to the underlying model choice.

Cost and Practicality

Perhaps the most surprising number in the paper: the entire optimization cycle costs under $10 in API calls. For context, a single experienced quant researcher costs hundreds of dollars per hour, and a typical factor research cycle takes days to weeks.

The framework runs on:

A Linux environment with Docker
Python 3.10+ with a Conda environment
API credentials for an LLM provider (OpenAI, Azure, DeepSeek, or alternatives)
Qlib as the backtesting engine

This is accessible enough that a small fund or even a sophisticated individual could run it. The open-source nature (MIT license) removes the typical barriers to entry for institutional-grade quant research tools.

Limitations and Caveats

The paper is transparent about several limitations:

1. Reliance on LLM knowledge. The system currently uses only the LLM's internal financial knowledge to generate hypotheses. It does not ingest real-time news, alternative data feeds, or structured domain priors. Future versions could incorporate these, potentially expanding the hypothesis space significantly.

2. Data leakage prevention. The data-centric design prevents LLMs from accessing raw market data or explicit temporal splits, which mitigates information leakage. But this is a design constraint that limits certain types of reasoning.

3. Market impact and capacity. The backtests assume a long-short strategy on the CSI 300. Real-world deployment would face slippage, market impact, and capacity constraints that are not modeled.

4. No financial advice. The authors include an explicit disclaimer: the framework "does not provide financial opinions, nor is it designed to replace the role of qualified financial professionals." Users must independently validate any generated strategies.

What This Means for Quant Research

RD-Agent(Q) is not going to replace human quant researchers tomorrow. But it represents a genuine shift in how the research pipeline could work.

The strongest implication is not the returns — it is the efficiency. Achieving comparable or better performance with 70% fewer factors means the system is finding cleaner, more orthogonal signals. Fewer factors means less overfitting risk, lower turnover, and simpler production systems. This matters more than raw backtest returns in practice.

The co-optimization result is also significant. The joint factor-model optimization consistently outperforms optimizing either component alone. This validates the intuition that factors and models are deeply coupled — the right features depend on the model, and vice versa. Traditional quant workflows that separate these into different teams may be leaving alpha on the table.

The cost structure changes the calculus. At $10 per cycle, you can afford to run thousands of experiments. This transforms quant research from a high-cost, low-throughput activity into something closer to automated hyperparameter search. The limiting factor shifts from researcher time to idea quality — and the LLM handles idea generation too.

The MLE-Bench results also suggest broader applicability. RD-Agent is currently the top-performing ML engineering agent on this benchmark of 75 Kaggle competitions, achieving 30.22% overall with o3 for research and GPT-4.1 for development. The quant application is the flagship, but the underlying R&D loop is general.

Getting Started

The framework is open-source and available at github.com/microsoft/RD-Agent. The quant-specific scenario builds on Microsoft's Qlib platform for backtesting.

To run the quant scenario:

Set up a Linux environment with Docker
Install RD-Agent via pip in a Python 3.10+ Conda environment
Configure your LLM API credentials (supports OpenAI, Azure, DeepSeek via LiteLLM)
Run the quant factor/model evolution loop

The project has active development, a Discord community, and detailed documentation for each scenario.

Bottom Line

Microsoft's RD-Agent(Q) demonstrates that fully automated quant research is no longer theoretical. A multi-agent system can generate hypotheses, write code, run backtests, interpret results, and iterate — producing strategies that outperform both classical factor libraries and state-of-the-art deep learning models.

The key insight is not that AI can trade. It is that AI can do the research that leads to trading — the slow, expensive, human-intensive part of the pipeline. And it can do it for $10.

References:

Wang, X., et al. "Data-Centric AI for Quantitative Investment: RD-Agent(Q) for Full-Stack Factor-Model Co-Optimization." NeurIPS 2025. arXiv:2505.15155
Microsoft RD-Agent GitHub Repository
Microsoft Qlib: Quantitative Investment Platform