Appendix D: Data Sources
Access to high-quality data is the foundation of every analysis in this book. This appendix catalogs the data sources referenced throughout the chapters, provides guidance on accessing each one, and discusses the strengths and limitations of each source. URLs and API details are current as of early 2026, but readers should verify endpoints against official documentation as platforms evolve.
D.1 Prediction Market Data
Polymarket
URL: https://polymarket.com
API Base: https://clob.polymarket.com
Description: Polymarket is the largest real-money prediction market by volume, built on the Polygon blockchain. It uses a Central Limit Order Book (CLOB) model where traders submit limit and market orders. Markets cover politics, economics, sports, technology, and current events.
Data available:
- Current market prices and order books (real-time)
- Historical price data with configurable intervals (minutely to daily)
- Trade-level data (individual transactions)
- Market metadata (question text, resolution criteria, close dates)
- Volume and open interest
Access: Public REST API for read access; no API key required for market data. Authentication required for placing orders. Rate limits apply.
Strengths: Highest liquidity among prediction markets; real-money incentives produce well-calibrated prices; broad topic coverage. Limitations: Markets are denominated in USDC on Polygon, which introduces cryptocurrency familiarity requirements; availability varies by jurisdiction; historical data before 2022 is limited.
Kalshi
URL: https://kalshi.com
API Base: https://trading-api.kalshi.com/trade-api/v2
Description: Kalshi is the first CFTC-regulated prediction market exchange in the United States. It offers event contracts on economics, politics, weather, and other topics. Markets are structured as binary contracts that pay $1 if the event occurs.
Data available:
- Market prices (bid, ask, last trade)
- Historical candlestick data (OHLC at various intervals)
- Trade history and settlement data
- Event metadata and resolution sources
- Open interest and volume
Access: Public API for market data; authenticated API for trading. Requires account creation and email/password authentication for full API access.
Strengths: Regulatory compliance provides legal clarity in the US; structured market categories; clean API design. Limitations: Lower liquidity than Polymarket on many topics; US-only access; some market categories are restricted.
Metaculus
URL: https://www.metaculus.com
API Base: https://www.metaculus.com/api2
Description: Metaculus is a forecasting platform that uses reputation-weighted community predictions rather than financial trading. Forecasters submit probability estimates and earn track-record scores. It is widely used in the effective altruism and rationalist communities.
Data available:
- Community prediction aggregates (median, quartiles)
- Individual question metadata (resolution criteria, categories)
- Prediction time series (how community forecasts evolved)
- Resolution history for closed questions
- Forecaster performance leaderboards
Access: Public API for reading question data and community predictions; no API key required. Account required for submitting forecasts.
Strengths: Long track record (since 2015); strong calibration culture; questions on science, technology, and existential risk that other platforms rarely cover; resolution criteria are well-specified. Limitations: No financial incentives (reputation only); smaller forecaster pool than real-money markets; API documentation is community-maintained.
Manifold Markets
URL: https://manifold.markets
API Base: https://api.manifold.markets/v0
Description: Manifold Markets is a play-money prediction market platform that uses an automated market maker (CPMM). Anyone can create a market on any topic. The platform is notable for its openness and the diversity of markets available.
Data available:
- Market prices and probabilities (real-time)
- Complete bet history for every market
- User profiles and portfolio data
- Market creator and resolution data
- Comments and discussion threads
Access: Generous public API with no authentication required for reading. API key available for programmatic betting. No rate limit issues for reasonable usage.
Strengths: Extremely broad topic coverage; anyone can create markets; fully open data; excellent API; active community. Limitations: Play-money reduces incentive alignment; lower forecasting accuracy than real-money markets on average; market quality varies widely due to open creation.
Historical Datasets
Iowa Electronic Markets (IEM): The longest-running prediction market, operated by the University of Iowa since 1988. Historical data on US elections and other events is available at https://iemweb.biz.uiowa.edu/. Excellent for academic research on market accuracy.
PredictIt (archived): PredictIt operated from 2014 to 2023 under a CFTC no-action letter. Historical data is available through third-party archives and academic datasets. Useful for studying market microstructure with small position limits.
Intrade (archived): Intrade operated from 2001 to 2013 and was the dominant prediction market of its era. Academic papers contain substantial analyzed data from this platform.
D.2 Polling and Survey Data
FiveThirtyEight
URL: https://projects.fivethirtyeight.com/polls/
Description: FiveThirtyEight (now part of ABC News) maintains a comprehensive polling database covering US elections, presidential approval, and generic ballot tracking. Their data includes pollster ratings and methodological assessments.
Data available: Individual poll results, pollster grades, aggregated polling averages, historical forecast model outputs. Data files are available in CSV format on their GitHub repository.
Usage in prediction markets: Polling data is the single most important input for political prediction market models. Comparing polling aggregates to market prices reveals when markets diverge from polling consensus, which can indicate either market inefficiency or the market pricing in information beyond polls.
RealClearPolitics
URL: https://www.realclearpolitics.com/epolling/
Description: RealClearPolitics maintains polling averages for US elections using a simple averaging methodology. Their averages are widely cited and provide a useful baseline.
Data available: Current and historical polling averages, individual poll results, betting odds aggregation. Data is primarily accessed through web scraping, as no formal API exists.
Survey of Professional Forecasters (SPF)
URL: https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/survey-of-professional-forecasters
Description: The Federal Reserve Bank of Philadelphia conducts quarterly surveys of professional economic forecasters. The SPF is the oldest quarterly survey of macroeconomic forecasts in the United States, dating to 1968.
Data available: Point forecasts and probability distributions for GDP, inflation, unemployment, and other macroeconomic variables. Individual forecaster responses are anonymized.
Usage in prediction markets: SPF data provides a benchmark for comparing market-implied economic forecasts against expert consensus. Divergences between SPF medians and prediction market prices on economic questions can signal trading opportunities.
Michigan Consumer Sentiment Survey
URL: https://data.sca.isr.umich.edu/
Description: Monthly survey of consumer confidence and economic expectations. Useful as a feature in models predicting economic prediction market outcomes.
D.3 Economic Data
Federal Reserve Economic Data (FRED)
URL: https://fred.stlouisfed.org/
API: https://api.stlouisfed.org/fred/ (free API key required)
Description: FRED is the most comprehensive source of US economic data, maintained by the Federal Reserve Bank of St. Louis. It aggregates data from dozens of government agencies and international organizations.
Key series for prediction market analysis:
- GDP and GDP growth (GDPC1, A191RL1Q225SBEA)
- Unemployment rate (UNRATE)
- Consumer Price Index / Inflation (CPIAUCSL)
- Federal funds rate (FEDFUNDS, DFEDTARU)
- Treasury yields (DGS10, DGS2, T10Y2Y for spread)
- S&P 500 (SP500)
- Housing starts (HOUST)
- Initial jobless claims (ICSA)
Access: Free API key with generous rate limits. Python access via the fredapi package: pip install fredapi.
from fredapi import Fred
fred = Fred(api_key="your_key_here")
gdp = fred.get_series("GDPC1")
Bureau of Labor Statistics (BLS)
URL: https://www.bls.gov/data/
API: https://api.bls.gov/publicAPI/v2/timeseries/data/
Description: Primary source for employment, inflation, and wage data. Particularly important for prediction markets on jobs reports and CPI releases.
Bureau of Economic Analysis (BEA)
URL: https://www.bea.gov/data
API: https://apps.bea.gov/api/
Description: Source for GDP, personal income, trade balance, and national accounts data. Free API key required.
World Bank Open Data
URL: https://data.worldbank.org/
API: https://api.worldbank.org/v2/
Description: Global development indicators covering 200+ countries. Useful for international prediction market questions on economic development, health outcomes, and demographic trends.
International Monetary Fund (IMF)
URL: https://www.imf.org/en/Data
Description: World Economic Outlook databases, Balance of Payments statistics, and International Financial Statistics. Provides cross-country macroeconomic data and forecasts that serve as baselines for prediction market models on international economic questions.
D.4 News and Media Data
News APIs
NewsAPI (https://newsapi.org/): Aggregates articles from 150,000+ sources. Free tier provides 100 requests/day. Useful for building news-sentiment features for prediction models.
GDELT Project (https://www.gdeltproject.org/): Monitors news media worldwide in real time, providing event databases, tone analysis, and network graphs. The full dataset is freely available on Google BigQuery.
Event Registry (https://eventregistry.org/): Structured news event database with entity extraction and sentiment analysis. Free tier available; premium tiers for higher volume.
RSS Feeds
Major news outlets provide RSS feeds that can be monitored for event-relevant headlines:
- Reuters: https://www.reuters.com/arc/outboundfeeds/
- Associated Press: https://www.ap.org/
- Government press releases (White House, federal agencies)
RSS feeds are particularly useful for building real-time alert systems that flag news relevant to open prediction market positions.
Social Media Data
X (Twitter) API: Provides access to tweets, user profiles, and trends. The API has undergone significant changes; current access tiers include Free (limited), Basic, and Pro. Useful for tracking public sentiment and breaking news that affects prediction markets.
Reddit API: Access to posts and comments across subreddits. Useful for tracking community sentiment on topics covered by prediction markets. Rate-limited but free for reasonable usage.
Google Trends (https://trends.google.com/): Provides normalized search interest data. The pytrends Python package enables programmatic access. Search volume spikes often correlate with prediction market activity.
D.5 Academic Datasets
Iowa Electronic Markets Historical Data
The IEM provides downloadable datasets covering all markets run since 1988. This is the gold standard for academic research on prediction market accuracy, particularly for US presidential elections. Data includes trade prices, volumes, and contract specifications.
URL: https://iemweb.biz.uiowa.edu/pricehistory/
Good Judgment Project Data
The Good Judgment Project (GJP), funded by IARPA, produced extensive data on individual forecaster accuracy from 2011 to 2015. Key findings about "superforecasters" are documented in Philip Tetlock's research. Some data is available through academic publications and data-sharing agreements.
Forecast Evaluation Benchmarks
Replication Data for "Prediction Market Accuracy in the Long Run" (2019): Contains historical prediction market and polling data for US elections spanning several decades.
SuperGLUE / Forecasting benchmarks: Machine learning benchmarks for evaluating automated forecasting systems. Available through the AI forecasting research community.
Research Paper Datasets
Many academic papers on prediction markets publish their data as supplementary materials. Key repositories:
- Harvard Dataverse (https://dataverse.harvard.edu/): Search for "prediction markets" to find replication datasets.
- Zenodo (https://zenodo.org/): Open-access research data repository.
- SSRN (https://www.ssrn.com/): Working papers often include data links.
D.6 Alternative Data
Weather Data
NOAA Climate Data Online (https://www.ncdc.noaa.gov/cdo-web/): Historical weather observations and forecasts. Relevant for weather-related prediction markets (temperature records, hurricane landfalls, snowfall totals).
Open-Meteo (https://open-meteo.com/): Free weather API with historical and forecast data. No API key required for moderate usage.
Satellite and Geospatial Data
NASA Earthdata (https://earthdata.nasa.gov/): Satellite imagery and derived datasets. Applications include monitoring crop conditions (relevant to agricultural commodity markets), night-light intensity (economic activity proxy), and environmental indicators.
Sentinel Hub (https://www.sentinel-hub.com/): European Space Agency satellite data with API access.
Social Media Sentiment
Aggregated sentiment indicators derived from social media can serve as features in prediction models:
- Stocktwits sentiment for financial markets
- Reddit comment sentiment using NLP models
- Twitter/X conversation volume and tone
Pre-built sentiment datasets are available from providers like Quandl (now part of Nasdaq Data Link) and various academic research groups.
Web Traffic and Digital Signals
Similarweb (https://www.similarweb.com/): Website traffic estimates. Can proxy for platform adoption and market interest.
Archive.org Wayback Machine (https://web.archive.org/): Historical snapshots of websites. Useful for verifying historical claims and resolution criteria.
Wikipedia page view statistics (https://pageviews.wmcloud.org/): Public attention proxy. Spikes in page views for topic-relevant articles often correlate with prediction market volume.
D.7 Data Quality Checklist
Before incorporating any data source into a prediction market analysis, evaluate it against the following criteria. This checklist is used throughout the book whenever a new data source is introduced.
Reliability
- Source authority: Is the data produced by a credible institution (government agency, established research organization, regulated exchange)?
- Methodology transparency: Is the data collection methodology documented and reproducible?
- Historical accuracy: Has the data source been accurate in the past? Are there known biases or systematic errors?
- Error handling: How does the source handle corrections and revisions? (Economic data, for example, is frequently revised.)
- Consistency: Does the data maintain consistent definitions and formats over time, or do methodological changes create breaks in the series?
Freshness
- Update frequency: How often is the data updated (real-time, daily, weekly, monthly, quarterly)?
- Publication lag: How much delay exists between the measurement period and data availability?
- Relevance to market timing: Is the data available early enough to inform trading decisions before markets price it in? Stale data has limited trading value.
- Revision schedule: Is the initially published data preliminary, subject to later revision? If so, how large are typical revisions?
Coverage
- Temporal coverage: How far back does the historical data extend? Longer histories enable more robust backtesting.
- Geographic coverage: Does the data cover the relevant regions for the prediction market questions you are analyzing?
- Topic coverage: Does the data source cover the specific variables needed, or are proxies required?
- Completeness: Are there missing observations, gaps, or periods of unavailability? How should these be handled?
Accessibility
- API availability: Is programmatic access available, or is manual download required?
- Rate limits: What are the query limits, and are they sufficient for your use case?
- Cost: Is the data free, freemium, or paid? What is the cost at the volume you need?
- Terms of use: Do the terms of service permit the intended use (research, personal trading, commercial application)?
- Format: Is the data available in machine-readable formats (JSON, CSV, Parquet), or does it require parsing from PDF or HTML?
Relevance to Prediction Markets
- Predictive value: Has this data source been shown (in academic literature or your own analysis) to have predictive power for the outcomes you are modeling?
- Redundancy: Does this data source provide information beyond what is already incorporated in market prices? Adding a data source that markets already fully price in provides no edge.
- Timeliness advantage: Can you access and process this data faster than other market participants? Speed advantages, even minutes, can matter in active markets.
- Signal-to-noise ratio: How much useful signal does the data contain relative to noise? High-noise sources require more sophisticated processing and larger sample sizes.
Practical Assessment Template
For each data source you consider using, document the following:
| Criterion | Assessment | Notes |
|---|---|---|
| Source authority | High / Medium / Low | |
| Update frequency | Real-time / Daily / Weekly / Monthly | |
| Publication lag | Minutes / Hours / Days / Weeks | |
| Historical depth | Years available | |
| API quality | Excellent / Good / Adequate / Poor | |
| Cost | Free / Freemium / Paid | |
| Predictive value (tested) | Yes / No / Unknown | |
| Unique information | Yes / Partial / No |
This structured evaluation prevents the common mistake of incorporating data simply because it is available, rather than because it genuinely improves forecasting accuracy. The most successful prediction market strategies are built on a small number of high-quality, genuinely informative data sources rather than a large number of marginally useful ones.
Summary. This appendix provides a comprehensive catalog of data sources relevant to prediction market analysis. Prediction market data itself (Section D.1) forms the core dataset. Polling and survey data (Section D.2) and economic data (Section D.3) provide the fundamental features for political and economic market models. News and social media data (Section D.4) enable real-time signal processing. Academic datasets (Section D.5) support rigorous backtesting and benchmarking. Alternative data sources (Section D.6) offer potential edges for advanced practitioners. Finally, the data quality checklist (Section D.7) provides a systematic framework for evaluating any data source before committing to its use in a model or strategy.