Further Reading
Books
Web Scraping and Data Collection
-
Ryan Mitchell, Web Scraping with Python, 3rd edition (O'Reilly, 2024). The definitive guide to web scraping in Python. Covers BeautifulSoup, Selenium, and Scrapy in depth. The chapters on handling JavaScript-heavy sites and dealing with CAPTCHAs are particularly relevant for scraping modern prediction market platforms.
-
Dimitrios Kouzis-Loukas, Learning Scrapy, 2nd edition (Packt, 2022). A comprehensive guide to the Scrapy framework, which is more suited for large-scale web scraping projects. If you need to scrape thousands of pages across multiple prediction market sites, Scrapy's built-in concurrency, middleware system, and item pipelines are superior to hand-rolled solutions.
Database Design and Data Engineering
-
Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017). Essential reading for anyone building data systems. The chapters on data models, storage engines, and stream processing provide the theoretical foundation for everything we built in this chapter. Though it covers systems far larger than a typical prediction market database, the principles apply at every scale.
-
Joe Reis and Matt Housley, Fundamentals of Data Engineering (O'Reilly, 2022). A practical overview of the data engineering landscape: ingestion, transformation, storage, serving, and orchestration. Covers the modern data stack and helps you understand where prediction market data pipelines fit within broader data architecture patterns.
-
Maxime Beauchemin, The Rise of the Data Engineer (blog post, 2017). A foundational essay that articulates the role of data engineering as distinct from data science. Available at: https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/
API Design and HTTP
- Leonard Richardson and Mike Amundsen, RESTful Web APIs (O'Reilly, 2013). While primarily aimed at API designers, this book gives data consumers a deep understanding of how REST APIs work, why they are designed the way they are, and how to interact with them effectively.
Academic Papers
Prediction Market Data and Analysis
-
Wolfers, J. and Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107-126. A foundational survey of prediction markets that discusses data sources and methodological considerations for analyzing market data. Provides context for why certain types of data (prices, volumes, resolution outcomes) are important.
-
Berg, J., Nelson, F., and Rietz, T. (2008). "Prediction Market Accuracy in the Long Run." International Journal of Forecasting, 24(2), 285-300. Uses historical data from the Iowa Electronic Markets to assess long-run prediction accuracy. A model for how to collect, clean, and analyze historical prediction market data for calibration studies.
-
Atanasov, P., Rescober, P., Stone, E., et al. (2017). "Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls." Management Science, 63(3), 691-706. Compares prediction market and prediction poll data from the IARPA ACE tournament. Discusses the data collection methodologies used in large-scale forecasting tournaments.
Data Quality and Validation
-
Pipino, L., Lee, Y., and Wang, R. (2002). "Data Quality Assessment." Communications of the ACM, 45(4), 211-218. Introduces a framework for assessing data quality across multiple dimensions: accuracy, completeness, consistency, timeliness, and accessibility. These dimensions map directly to the validation checks discussed in Section 20.9.
-
Rahm, E. and Do, H. (2000). "Data Cleaning: Problems and Current Approaches." IEEE Bulletin of the Technical Committee on Data Engineering, 23(4), 3-13. A taxonomy of data quality problems (missing values, duplicates, conflicting data, wrong formats) and cleaning approaches. Still relevant despite its age.
Ethics and Legal Considerations
- Krotov, V. and Silva, L. (2018). "Legality and Ethics of Web Scraping." Communications of the Association for Information Systems, 47, 539-563. A thorough analysis of the legal and ethical dimensions of web scraping. Covers copyright law, the Computer Fraud and Abuse Act, terms of service, and the European Database Directive.
Online Resources
API Documentation
-
Polymarket API Documentation: https://docs.polymarket.com/ Official documentation for both the Gamma and CLOB APIs. Includes endpoint references, authentication guides, and rate limit specifications.
-
Kalshi API Documentation: https://trading-api.readme.io/ Comprehensive documentation for Kalshi's trading API. Includes market data endpoints, authentication flow, and webhook documentation.
-
Metaculus API: Available through the platform at https://www.metaculus.com/api/ While less formally documented, the Metaculus API endpoints can be explored through the browser's developer tools or community documentation.
-
Manifold Markets API Documentation: https://docs.manifold.markets/api Well-documented REST API with generous rate limits. Includes examples for common use cases.
Data Science Tools
-
Requests Library Documentation: https://docs.python-requests.org/ Official documentation for the Python requests library. The "Advanced Usage" section covers sessions, prepared requests, and event hooks.
-
httpx Documentation: https://www.python-httpx.org/ Documentation for httpx, a modern HTTP client supporting async/await and HTTP/2. A strong alternative to requests for high-performance data collection.
-
BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Complete guide to HTML parsing with BeautifulSoup. The sections on CSS selectors and navigating the parse tree are essential for web scraping.
-
Selenium Documentation: https://www.selenium.dev/documentation/ Official Selenium documentation covering WebDriver API, browser management, and page interaction. Essential for scraping JavaScript-rendered pages.
-
SQLAlchemy Documentation: https://docs.sqlalchemy.org/ Comprehensive documentation for SQLAlchemy, the Python SQL toolkit and ORM. The ORM tutorial and Core tutorial are particularly relevant.
-
APScheduler Documentation: https://apscheduler.readthedocs.io/ Documentation for the Advanced Python Scheduler, used in Section 20.7 for scheduling data pipelines.
Datasets and Data Repositories
-
Iowa Electronic Markets Data: https://iemweb.biz.uiowa.edu/ Historical data from one of the oldest prediction markets, run by the University of Iowa since 1988. Particularly valuable for studying political prediction markets over a long time horizon.
-
Good Judgment Project Data: Various datasets from the IARPA forecasting tournaments have been made available to researchers. Contact the Good Judgment team for access.
-
Dune Analytics (for Polymarket on-chain data): https://dune.com/ Community-maintained SQL queries for analyzing Polymarket (and other blockchain protocol) data directly from the Polygon blockchain.
Tutorials and Guides
-
Real Python: Web Scraping with Python: https://realpython.com/python-web-scraping-practical-introduction/ A practical introduction to web scraping that covers requests, BeautifulSoup, and common patterns.
-
Real Python: Python's Requests Library: https://realpython.com/python-requests/ A thorough tutorial on making HTTP requests in Python, including sessions, authentication, and error handling.
-
SQLite Tutorial: https://www.sqlitetutorial.net/ A comprehensive tutorial for SQLite, covering everything from basic queries to advanced features like window functions and common table expressions.
Related Chapters
- Chapter 6: Platform Landscape --- Provides background on the prediction market platforms whose APIs are covered in this chapter.
- Chapter 21: Feature Engineering for Prediction Markets --- Transforms the raw data collected here into features for machine learning models.
- Chapter 22: Machine Learning for Forecasting --- Uses the data infrastructure built here to train and evaluate forecasting models.
- Chapter 25: Backtesting Strategies --- Relies on the historical data pipeline from this chapter to evaluate trading strategies.