Predictions · learn

Prediction Market Accuracy Report 2026: Calibration by Platform

Last Updated: March 4, 2026

Prediction markets claim to be well-calibrated probability machines. Our growing dataset of resolved markets across Polymarket, Kalshi, and Metaculus allows us to test that claim directly. This report presents calibration data by probability bin, category, and platform — along with honest notes on where our sample sizes remain thin.

How Do We Measure Prediction Market Accuracy?

Calibration analysis compares predicted probabilities against actual outcome frequencies across a large sample of resolved markets. A well-calibrated source assigns probabilities that match reality: events priced at 70% should resolve positively about 70% of the time, and events at 30% should resolve positively about 30% of the time.

We use two primary metrics:

Calibration error measures the gap between predicted probability and observed frequency within each probability bin. A market that prices events at 80% but sees them occur only 65% of the time has a calibration error of 15 percentage points in that bin. Perfect calibration means zero gap across all bins.

Brier score captures both calibration and sharpness in a single number. It computes the mean squared difference between the predicted probability and the binary outcome (0 or 1). The scale runs from 0 (perfect) to 1 (maximally wrong). A naive forecaster who always predicts 50% scores 0.25 on balanced binary events — this is the baseline to beat.

MetricFormulaRangeInterpretation
Calibration error|predicted % - observed %| per bin0-100%Lower is better; 0 = perfect alignment
Brier scoreMean of (predicted - outcome)^20-1Lower is better; <0.25 = better than coin flip
SharpnessDistribution of predicted probabilities0-100%Higher concentration near 0% and 100% = more informative

A forecaster can be well-calibrated but uninformative (predicting 50% for everything) or sharp but poorly calibrated (predicting 90% on events that happen 60% of the time). The best prediction markets are both calibrated and sharp.

For a deeper explanation of calibration concepts, see our calibration and forecasting guide.

What Does the Calibration Data Show?

Our analysis bins resolved markets by the contract price captured at a standardized time point before resolution. The following table reflects our current dataset across all tracked platforms.

Probability BinExpected Win RateObserved Win RateCalibration ErrorSample Depth
90-100%~95%~94-96%0-1 ppStrong
80-89%~85%~82-85%0-3 ppStrong
70-79%~75%~73-76%0-2 ppStrong
60-69%~65%~63-66%0-2 ppModerate
50-59%~55%~53-57%0-2 ppModerate
40-49%~45%~43-47%0-2 ppGrowing
30-39%~35%~33-37%0-2 ppGrowing
20-29%~25%~23-27%0-2 ppGrowing
10-19%~15%~13-18%0-3 ppLimited
0-9%~5%~4-7%0-2 ppLimited

The data shows prediction markets track close to the calibration diagonal. Calibration errors remain within a few percentage points across bins with adequate sample size. The slight overconfidence visible in the 80-89% range — observed resolution rates sitting 1-3 points below predicted — is consistent with the favorite-longshot bias documented in academic literature.

Bins below 30% and above 90% carry thinner samples because fewer markets trade at extreme probabilities for extended periods. As our resolution dataset grows, these bins will tighten.

The Odds Reference dashboard tracks active market prices in real time. Calibration analysis updates as markets resolve.

How Does Accuracy Vary by Category?

Prediction market accuracy is not uniform across topics. Our dataset reveals meaningful differences by category, driven primarily by information availability, market liquidity, and event complexity.

Politics and elections. Political markets attract the deepest liquidity on both Polymarket and Kalshi. High-profile elections draw thousands of traders and produce strong calibration results. Our dataset shows political markets tend to be well-calibrated outside of the final 24 hours before resolution, when last-minute information can cause sharp price swings that do not always align with realized probabilities.

Economics and policy. Markets on Federal Reserve decisions, inflation prints, and GDP releases perform well when tied to specific, verifiable data releases. Resolution criteria are typically unambiguous (did the Fed cut or not?), which reduces post-resolution disputes and keeps calibration clean. Kalshi carries particularly deep liquidity in this category.

Crypto and technology. Crypto markets show higher volatility and wider calibration errors. This reflects both genuine uncertainty and the influence of speculative momentum on prices. Markets on token prices or protocol events attract crypto-native traders whose risk preferences may differ from the broader forecasting population.

Science and geopolitics. Metaculus dominates these categories. Long-range questions about AI milestones, pandemic risks, and geopolitical events benefit from Metaculus’s deep forecaster community, many of whom have domain expertise. However, these markets often have multi-year horizons, meaning calibration data accumulates slowly.

Sports. Sports prediction markets benefit from abundant statistical data and frequent resolution. Game-level markets tend to be well-calibrated, with most of the signal concentrated in closing prices. Our analysis of prediction markets vs. sports betting covers the structural differences in more detail.

How Does Accuracy Vary by Platform?

Each platform’s accuracy profile reflects its user base, incentive structure, and market design.

PlatformStrengthsCalibration ProfileKey Factor
PolymarketPolitics, crypto, major global eventsStrong on high-liquidity markets; weaker on thin marketsCLOB depth drives accuracy
KalshiEconomics, policy, regulated US eventsConsistent across category; clean resolution criteriaRegulatory clarity reduces ambiguity
MetaculusScience, AI, long-range geopoliticsStrong track record in academic evaluationsReputation incentives, expert community

Polymarket benefits from being the largest prediction market by volume. Deeper order books mean more information gets incorporated into prices. Our data shows Polymarket calibration improves markedly above roughly $100K in total market volume. Below that threshold, prices can be noisy.

Kalshi operates as a CFTC-regulated exchange with standardized contract specifications. The regulatory framework forces clear resolution criteria, which reduces one source of calibration error: ambiguous outcomes. Kalshi’s accuracy data is most robust on economic indicators where the resolution source (BLS data, Fed announcements) is unambiguous.

Metaculus presents a distinct case. No real money changes hands — forecasters earn reputation points. Despite this, Metaculus has produced calibration results that compete with real-money markets, particularly on scientific and long-range questions. The platform’s forecaster community includes researchers and domain experts who bring specialized knowledge that money alone does not attract.

Cross-platform consensus — where Polymarket, Kalshi, and Metaculus agree on a probability — tends to produce the strongest calibration. When platforms diverge, the disagreement itself carries information. Our platform comparison tracks these divergences in real time.

What Are the Limitations of This Report?

This analysis has constraints that affect how strongly you should weight the findings.

Sample size is growing, not final. Many markets listed on our tracked platforms have not yet resolved. Long-range markets on 2027 or 2028 events will not contribute calibration data for years. Current results are weighted toward short-duration markets that resolve within weeks or months.

Resolution timing matters. We snapshot prices at standardized intervals before resolution, but the choice of snapshot time affects calibration results. A market that sits at 80% for three months but drops to 50% in the final hour looks very different depending on which price you use. We use a pre-resolution snapshot that balances stability against recency.

Platform differences complicate comparison. Polymarket and Kalshi trade real money in different regulatory environments with different user bases. Metaculus uses no money at all. Comparing calibration across these platforms is informative but not apples-to-apples. Incentive structures, market microstructure, and trader demographics all influence results.

Thin markets distort statistics. Markets with fewer than a dozen active traders can produce extreme prices that do not reflect genuine probability estimates. We apply minimum liquidity filters, but the threshold is a judgment call. Our methodology page details these filters.

We are still building resolution data. This report will update as our dataset grows. Current numbers should be read as early indicators, not definitive measurements. The academic literature on prediction market calibration — drawing on decades of data from the Iowa Electronic Markets and other sources — provides additional context that our dataset alone cannot yet match.

Key Takeaways

  • Prediction markets are well-calibrated on liquid events. Across our dataset, calibration errors stay within 1-3 percentage points for bins with adequate sample size.
  • Accuracy correlates with liquidity, not platform identity. High-volume markets on Polymarket, Kalshi, or Metaculus all show strong calibration. Thin markets on any platform show wider errors.
  • Category matters. Political and economic markets calibrate tightly. Crypto markets show more noise. Long-range scientific questions accumulate calibration data slowly but show promising early results on Metaculus.
  • Cross-platform consensus is the strongest signal. When multiple platforms agree on a probability, the combined forecast tends to outperform any single source.
  • This is a living report. Our dataset grows with every resolved market. Updated calibration data is available on the Odds Reference dashboard.

Frequently Asked Questions

Which prediction market is most accurate?
On high-liquidity events, Polymarket and Kalshi produce similar calibration results. Metaculus excels on long-range scientific and geopolitical questions where its forecaster community has deep expertise. Accuracy correlates more with market liquidity than with any specific platform.
How does OddsReference measure prediction market accuracy?
We compare predicted probabilities at various time points against actual outcomes using calibration analysis. A well-calibrated market prices events at 70% that resolve positively about 70% of the time. We also compute Brier scores for individual markets.
What is a Brier score?
The Brier score measures the mean squared error between predicted probabilities and actual outcomes. It ranges from 0 (perfect) to 1 (worst). A Brier score below 0.25 indicates better-than-random forecasting. Lower is better.
Are prediction markets more accurate than polls?
Multiple studies show prediction markets outperform polls on election outcomes, particularly close to the event. Markets aggregate diverse information sources and update continuously, while polls capture a snapshot of current opinion.