Key Takeaways

1. APIs First, Scraping Second

Always check for an API before resorting to web scraping. APIs are structured, documented, and reliable. Scraping is fragile, slower, and may violate terms of service. Use scraping only when APIs are incomplete or unavailable.

2. Respect Rate Limits --- They Protect Both Sides

Rate limiting is not an obstacle; it is an agreement. Implement proper rate limiting with exponential backoff and jitter. Exceeding rate limits risks getting your access revoked, and overloading a server harms every other user.

3. Normalize Everything to UTC

Different prediction market platforms report timestamps in different timezones, sometimes without explicit zone markers. Store all timestamps in UTC, convert only for display, and always use timezone-aware datetime objects.

4. The ETL Pattern Is Your Backbone

Separate your pipeline into Extract, Transform, and Load stages. This separation makes each component independently testable, replaceable, and debuggable. When a new platform appears, you only need to write a new extractor and transformer while reusing your existing loader.

5. Schema Design Determines Query Capability

Separate slowly-changing metadata (market questions, categories, resolution rules) from rapidly-changing observations (prices, volumes). Use proper foreign keys, indexes on frequently queried columns, and unique constraints to prevent duplicates.

6. Data Quality Requires Active Monitoring

Data does not stay clean on its own. Implement automated validation checks --- price range verification, complement checks, timestamp validation, completeness monitoring --- and run them on every pipeline execution. Track quality metrics over time so you can detect degradation early.

7. Alternative Data Provides Edge

Prediction market prices reflect publicly available information. By systematically collecting and analyzing alternative data --- news, economic indicators, polling data, weather forecasts --- you can identify information not yet priced in. The key is speed and systematic coverage.

8. Order Book Data Is Underappreciated

Beyond simple prices, order book data reveals market depth, liquidity concentration, and the cost of executing trades. Understanding the spread and depth at various price levels is essential for realistic backtesting and live trading.

9. Cross-Platform Data Is Valuable and Messy

The same event may be traded on Polymarket, Kalshi, and Manifold, each with different question wording, different price levels, and different liquidity profiles. Matching markets across platforms is imperfect but reveals arbitrage opportunities and calibration differences.

10. Incremental Collection Scales Better Than Full Loads

Fetching all data from scratch on every pipeline run wastes resources and strains APIs. Implement high-water marks to track where you left off and fetch only new or updated data. Fall back to full loads only when the incremental state is lost.

11. Error Handling Makes the Difference Between a Script and a System

A single failed API call or malformed record should not crash your entire pipeline. Implement per-record error handling, dead letter queues for failed records, and automatic retry logic. Log everything so you can diagnose problems after the fact.

12. Ethics Are Not Optional

Respect terms of service. Respect rate limits. Respect robots.txt. Anonymize personal data. Do not redistribute data you do not have rights to share. The prediction market community is small; reputation matters, and responsible behavior preserves access for everyone.

13. Pagination Is Not a Detail

Most APIs return only a subset of results per request. If you do not handle pagination properly --- whether offset-based or cursor-based --- you will silently miss data. Always verify that you have fetched all available pages.

14. Session Objects and Connection Pooling Save Time

When making many requests to the same API, use a session object (e.g., requests.Session()) to reuse TCP connections. This eliminates the overhead of establishing new connections and can reduce latency by 50% or more for sequential requests.

15. Your Data Infrastructure Is a Product

The data pipeline you build is not a one-time script. It is infrastructure that must be maintained, monitored, and evolved. Design it as you would design software: with clean interfaces, error handling, logging, tests, and documentation. Future-you will thank present-you.

Summary Table

Concept	Key Insight	Common Mistake
API Access	Use sessions, handle pagination, respect limits	Ignoring rate limits; missing pages of data
Web Scraping	Last resort; fragile by nature	Scraping when an API exists; ignoring robots.txt
ETL Pipelines	Separate concerns for maintainability	Monolithic scripts that mix extraction and loading
Database Design	Separate metadata from time-series data	Single flat table for everything
Data Quality	Automate validation; run on every load	Assuming data is clean; no quality checks
Timestamps	UTC everywhere, convert only for display	Naive datetimes; mixed timezones in one table
Alternative Data	Systematic collection of signals	Manual, ad-hoc data gathering
Ethics	Read ToS; respect limits; protect privacy	Aggressive scraping; redistributing private data