- Published on
Using AI Agents to Forecast Prediction Markets
Prediction markets are one of the cleanest ways we have to turn knowledge into numbers. Prices in markets like Polymarket or Manifold encode a crowd’s belief that some future event will happen – “Yes” shares trade around the probability the event resolves to “Yes.”
The new twist is: what if one of the “traders” is an AI agent?
Over the last year, several groups have started to systematically let LLM agents forecast or even bet directly in prediction markets and evaluate how good they actually are. The papers you shared all sit in this emerging space:
- AIA Forecaster (Bridgewater AIA Labs) – judgmental forecasting agent with search + ensembling + calibration.
- PrediBench (PresageLabs) – agents literally bet $1 per market on Polymarket.
- FutureBench / “Back to the Future” (Together AI) – agents forecast future events, many sourced from prediction markets.
- FutureX (ByteDance/Fudan/etc.) – a live benchmark for agents predicting future events, and a critique of prediction-market–only benchmarks.
Below is a synthesis framed as a blog post, with a special focus on Bridgewater’s AIA Forecaster and the idea that AI agents + market consensus can beat the market alone.
How do AI agents actually “trade” in prediction markets?
Across these projects the basic loop is similar:
Read the market question. E.g. “Will Zohran Mamdani’s RCV margin of victory be greater than 13% in the NYC mayoral Democratic primary?”
Do agentic research.
- Use web search over news, polls, expert commentary, historical base rates, etc.
- Fetch and scrape pages, not just search snippets.
- Optionally query APIs or structured data (econ releases, sports stats, etc.).
PrediBench agents get a
web_searchandvisit_webpagetool and can chain them through a smolagents framework. FutureBench agents similarly use Tavily search plus a web scraper to build a lightweight research pipeline.Produce a probability and a trade.
- A probability, like “P(Yes) = 0.63”, plus an explanation.
- A position: buy “Yes”, buy “No”, or sometimes leave capital unallocated.
PrediBench wraps this in a
SingleInvestmentDecisionschema where the agent outputs a rationale, estimated probability, bet size (positive = Yes, negative = No) and confidence.Evaluate over time.
- Forecast quality via Brier score (squared error between probability and outcome).
- Trading quality via realized PnL and Sharpe ratio over 1–7 day holding periods.
This is exactly what a good human trader/forecaster does – read, research, reason, then size a bet – but automated and repeatable.
AIA Forecaster: a “superforecaster in a box”
Bridgewater’s AIA Forecaster is currently the most sophisticated system in this space. Conceptually, it’s built from three ideas:
Agentic, adaptive search over high-quality sources. Multiple agents are spawned, each free to issue arbitrary search queries, read documents, and iterate based on previous results. Search isn’t just a single retrieval call; it’s a multi-step research process shaped by what the agent learns along the way.
When they ablate search, performance collapses: on a live benchmark of 64 markets, Brier score with search is 0.1002, but without search it worsens to 0.3609 – worse than just predicting 50% on everything (Brier 0.25).
Multi-agent ensembling with a supervisor.
- M independent forecasting agents each research and output a probability.
- A supervisor agent sees the full set of forecasts, looks for disagreements, does extra search, and reconciles them into an aggregated probability.
- This is crucial because single LLM runs are “noisy” – missing an article or over-anchoring can swing a probability a lot.
Statistical calibration and extremization. LLMs tend to hedge and avoid extreme probabilities. AIA shows they’re systematically too cautious and uses Platt scaling / extremization to push probabilities away from 50% where appropriate, improving Brier scores.
How good is it?
On ForecastBench, which aggregates questions from prediction markets and other sources, the AIA Forecaster:
- Achieves Brier 0.0753 on the FB-Market subset,
- While human superforecasters achieve 0.0740,
- And the market price baseline is 0.0965,
- Previous LLM forecasters sit around 0.107, and OpenAI’s o3 reasoning model is 0.1096.
Within uncertainty intervals, AIA is statistically indistinguishable from human superforecasters and clearly better than both the crowd and prior LLM systems on this benchmark.
However, ForecastBench is relatively “easy” – many questions are not fully adversarial or deeply traded. On their more demanding MarketLiquid dataset, built from 1,610 questions in liquid public prediction markets, things get harder:
- Market consensus Brier: 0.1106
- AIA Forecaster Brier: 0.1258
Here markets still win, which is what you’d expect in domains where real money and lots of smart people are already competing.
AI + market consensus: better together
The most interesting result in the AIA paper is what happens when you combine AI forecasts with market prices instead of choosing one or the other.
They fit a simple convex regression of the true outcome on two predictors: the market price and AIA’s probability, learning optimal weights for the ensemble:
On ForecastBench FB-Market, the ensemble has Brier 0.079, slightly worse than AIA alone (0.0753). The regression basically says “put almost all weight on AIA; markets add little extra signal.”
On MarketLiquid, the picture flips:
- Market alone: 0.111
- AIA alone: 0.126
- Ensemble: 0.106
The learned weights are roughly 67% market / 33% AIA, and the ensemble’s Brier beats both components.
This is the key takeaway:
Even when AI agents are worse than the market in absolute terms, their forecasts contain independent information that improves performance when blended with the market.
In other words, AI agents are acting like a diversifying superforecaster in the crowd: not consistently better than the market, but different enough that you want their opinion in the mix.
PrediBench & FutureBench: agents actually betting
While AIA focuses on forecast quality and calibration, PrediBench and FutureBench look more like “live trading tournaments” for AI agents.
PrediBench: $1 on Polymarket
PresageLabs’ PrediBench literally lets AI models bet $1 on each of the top 10 trending Polymarket questions (excluding crypto). Events are chosen by weekly trading volume and must resolve within two months.
Agents:
- Run under a shared smolagents framework.
- Use
web_searchandvisit_webpagetools; the best performers visit many pages, mimicking deep human research. - Output a probability, rationale, confidence, and position for each market.
Results so far:
- Most models are not yet profitable, but about half beat a “market baseline” that just uses the current price as its probability and always bets with the majority outcome.
- The strongest frontier models (e.g. Grok-4 in their write-up) show positive returns (~+6% over a 7-day holding period in early data).
- Better Brier scores correlate with higher placement on broader LLM leaderboards (LMSys), suggesting that general model quality transfers into forecasting power.
- Returns improve with more web pages visited, reinforcing that deeper research leads to better bets, just like for human analysts.
So PrediBench is early but shows that agentic LLMs can at least match or slightly beat simple market-following baselines when pointed at real-money markets.
FutureBench: mixed news + prediction markets
FutureBench (the “Back to the Future” blogpost) constructs questions both from:
- News scraping agents, which read front pages of major outlets and generate time-bound forecasting questions (e.g., “Will the Fed cut rates by ≥0.25% by July 1st?”).
- Prediction market feeds from Polymarket and others, filtered to avoid overly trivial or noisy questions.
Agent setup: all models use a common smolagents pipeline plus Tavily search and a simple scraper; some runs also evaluate “offline” models with no tools, as a prior.
Initial findings:
- Agentic models with tools clearly beat bare LLMs that answer from frozen knowledge.
- Stronger foundation models (DeepSeek-V3, GPT-4.1, Claude) deliver more stable, better-calibrated forecasts.
- Different models adopt different research styles: some lean heavily on search snippets, others do extensive scraping and cross-checking.
FutureBench is less about trading and more about systematically benchmarking agent frameworks, tools, and models on forward-looking questions, many of which also exist as prediction markets.
FutureX: zooming out beyond prediction markets
FutureX is a large, live benchmark that pulls questions from 195 websites across politics, finance, sports, entertainment, government stats, and real-time data feeds.
The authors explicitly position it as a response to prediction-market–only benchmarks like ForecastBench and FutureBench, arguing that:
- Prediction-market questions tend to be relatively simple and sometimes subjective, and
- Over-reliance on them limits domain diversity.
Still, FutureX acknowledges those earlier works as important early steps and notes that prediction markets are an attractive source of contamination-free, automatically scored questions.
From our angle, FutureX reinforces a point: the forecasting agent stack (search → reasoning → calibration → evaluation) generalizes beyond markets. Prediction markets are just the most convenient testbed where “ground truth” arrives quickly and in a clean numeric form.
How to practically use AI agents with prediction markets
Putting these results together, a practical recipe for using AI agents in prediction markets looks like:
Treat the market as a strong prior, not gospel.
- Use the current price as your baseline probability.
- Have the agent do independent research and output its own probability and rationale.
Look for divergences where the agent is confident.
- If the market says 80% and AIA-style agent says 55% with a well-argued explanation, that’s a candidate mispricing.
- Even when markets ultimately win on average (as in MarketLiquid), these divergences carry useful incremental information.
Ensemble instead of betting everything on AI.
- A simple convex combination of market price and AI probability already beats the market on hard datasets (roughly 2/3 market, 1/3 AIA in MarketLiquid).
- You can tune weights over time as you gather your own data.
Invest in research depth and calibration.
- Give agents real tools (search, scraping, structured data) and enough budget to use them.
- Use multi-agent ensembling and supervisor agents to stabilize forecasts.
- Apply calibration/extremization on top of raw probabilities to fix LLM hedging.
Be honest about limitations.
- These systems are still brittle on niche topics, adversarial events, and subtle base-rate issues.
- Live market impact, transaction costs, and adversaries reacting to AI-driven flows are not yet modeled in the benchmarks.
Where this is going
The early evidence suggests:
- On curated forecasting benchmarks, AIA-style agents can match human superforecasters.
- In real, liquid prediction markets, markets still have an edge, but AI forecasts are meaningfully diversifying.
- Lightweight agent frameworks like PrediBench and FutureBench show that frontier LLMs already reach or beat simple market baselines and that deeper research (more pages, more careful tools) reliably improves performance.
The likely future isn’t “AI replaces markets,” but AI becomes another powerful participant whose views you definitely want in your ensemble. For anyone designing trading systems, forecast aggregation platforms, or institutional decision processes, the Bridgewater result is the core design pattern:
Use AI forecasters as a calibrated, agentic layer feeding into the wisdom of crowds – not a replacement for it.
Papers discussed
- Alur, R. et al. (2025) – AIA Forecaster: Technical Report, Bridgewater AIA Labs.
- Azam, C. & Roucher, A. (2025) – PresageLabs PrediBench: Making Agents Bet on Polymarket.
- Bianchi, F. et al. (2025) – Back to The Future: Evaluating AI Agents on Predicting Future Events (FutureBench).
- Liu, J. et al. (2025) – FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction.