Exercises

Section 20.2: Working with REST APIs

Exercise 1: Basic API Request

Write a Python function fetch_json(url, params=None) that makes a GET request to any URL, returns the JSON response as a dictionary, and raises a descriptive error message (including the status code and response body) if the request fails.

Exercise 2: Pagination Handler

Implement a generic cursor-based pagination function fetch_all_cursor(base_url, headers, cursor_field="cursor", data_field="data") that fetches all pages from an API that uses cursor-based pagination. The function should accept the name of the cursor field in the response and the name of the data field, and it should return a single combined list.

Exercise 3: Rate Limiter with Sliding Window

The RateLimiter class in the chapter uses a simple token bucket approach. Implement a SlidingWindowRateLimiter class that tracks the timestamps of the last N requests and enforces a maximum of max_calls requests within any window_seconds time window. This is more accurate than the token bucket approach for bursty traffic patterns.

Exercise 4: Retry with Jitter

Implement a retry_request function that retries failed HTTP requests using exponential backoff with full jitter. The function should: - Accept a requests.Session, URL, and optional parameters - Retry on status codes 429, 500, 502, 503, 504 - Use full jitter: delay = random.uniform(0, min(base_delay * 2^attempt, max_delay)) - Read the Retry-After header when present on 429 responses - Return the successful response or raise after max_retries attempts

Exercise 5: Async API Client

Rewrite the basic API client using httpx with async/await support. Create an AsyncAPIClient class with methods get, get_all_pages, and close. Demonstrate that it can fetch data from multiple endpoints concurrently using asyncio.gather.

Section 20.3-20.5: Platform APIs

Exercise 6: Polymarket Market Summary

Using the PolymarketGammaClient class, write a function get_top_markets(n=10) that fetches the top N markets by 24-hour volume and returns a formatted summary including: question, yes price, volume, and days until close. Handle the case where the close date is missing.

Exercise 7: Kalshi Event Series Analysis

Write a function that takes a Kalshi event series ticker and returns: - Total number of events in the series - Number of resolved events - Average final settlement price - Most common resolution outcome Use the KalshiClient class from the chapter.

Exercise 8: Cross-Platform Price Comparison

Write a function compare_prices(topic) that searches for markets on the same topic across Manifold Markets and Metaculus, matches them by question similarity (using simple keyword overlap), and returns a table comparing their current probabilities. Discuss why prices might differ.

Exercise 9: Manifold Bet History Analyzer

Using the Manifold API, write a function that fetches all bets on a specific market and computes: - Total number of bets - Average bet size - Percentage of YES vs NO bets - Price impact of the largest bet (price before vs after) - Time distribution of bets (histogram by hour of day)

Exercise 10: Metaculus Community Prediction Tracker

Write a function that fetches a Metaculus question and tracks how the community prediction has changed over time. Plot the community prediction as a time series alongside the eventual resolution (if resolved). Calculate the Brier score at each time point.

Section 20.6: Web Scraping

Exercise 11: robots.txt Analyzer

Write a class RobotsAnalyzer that fetches and parses the robots.txt file for a given domain and provides methods to: - List all disallowed paths for a specific user agent - Check if a specific URL is allowed - Report the specified crawl delay - List all sitemaps referenced in the file

Exercise 12: HTML Table Scraper

Write a generic function scrape_table(url, table_index=0) that fetches a web page, finds the nth HTML table, extracts headers and rows, and returns the data as a list of dictionaries. Test it on a publicly available data table.

Exercise 13: Multi-Page Scraper

Build a scraper class that handles websites with "next page" links. The class should: - Follow pagination links automatically - Collect items from each page using a configurable CSS selector - Respect a configurable delay between requests - Stop after a configurable maximum number of pages - Return all collected items as a single list

Exercise 14: Link Graph Builder

Write a scraper that starts from a prediction market's homepage, follows internal links up to 2 levels deep, and builds a graph of the site's internal link structure. Store the results as a dictionary mapping each URL to its outgoing links. Visualize the link graph using networkx.

Exercise 15: Screenshot Capture

Using Selenium, write a function that navigates to a prediction market page, waits for all dynamic content to load, and captures a full-page screenshot. The function should handle cookie consent dialogs and popups that might overlay the content.

Section 20.7: Data Pipelines

Exercise 16: Incremental Extractor

Modify the ManifoldExtractor to support incremental extraction. The extractor should: - Track the timestamp of the last successful extraction - Only fetch markets that have been updated since the last run - Store the high-water mark (latest timestamp) in a file or database - Fall back to a full extraction if the high-water mark is missing

Exercise 17: Dead Letter Queue

Implement a dead letter queue (DLQ) for your data pipeline. When a record fails to transform or load, instead of losing it, the pipeline should: - Write the failed record to a separate "dead_letter" table or file - Record the error message and timestamp - Continue processing the remaining records - Provide a method to inspect and retry failed records

Exercise 18: Pipeline Monitoring Dashboard

Create a simple pipeline monitoring system that: - Logs each pipeline run's start time, end time, records processed, and errors - Computes metrics: success rate, average records per run, average duration - Detects anomalies: runs with zero records, runs that take more than 3x the average duration - Generates a text-based summary report

Exercise 19: Parallel Extraction

Modify the DataPipeline class to extract data from multiple platforms in parallel using concurrent.futures.ThreadPoolExecutor. Measure the speedup compared to sequential extraction. Ensure thread safety in the loader.

Exercise 20: Data Versioning

Implement a data versioning system that: - Assigns a version number to each pipeline run - Allows you to query the state of the database as of any previous run - Supports rollback to a previous version if the latest run contains bad data - Uses a valid_from / valid_to pattern (Type 2 slowly changing dimension)

Section 20.8: Database Design

Exercise 21: Schema Migration

Your original schema does not include an outcomes table for multi-outcome markets. Design a schema migration that: - Creates an outcomes table with columns: id, market_id, name, token_id, current_price - Creates an outcome_price_snapshots table for per-outcome price history - Migrates existing binary market data into the new structure - Implements the migration as a Python function using SQLAlchemy

Exercise 22: Query Optimization

Write the following queries against the schema from Section 20.8 and explain the indexes needed for each: 1. Find the 10 markets with the largest price movement in the last 24 hours 2. Find all markets that resolved differently from their price 7 days before resolution 3. Compute the daily average volume across all markets for a given platform 4. Find pairs of markets with correlated price movements

Exercise 23: Database Backup and Restore

Write functions backup_database(db_path, backup_dir) and restore_database(backup_path, db_path) for SQLite that: - Create timestamped backup copies of the database file - Verify backup integrity using SQLite's integrity check - Support both full backups and incremental (WAL-based) backups - Automatically clean up backups older than 30 days

Section 20.9: Data Quality

Exercise 24: Anomaly Detector

Implement a PriceAnomalyDetector class that detects anomalous price movements using: - Z-score method: Flag prices more than 3 standard deviations from the rolling mean - IQR method: Flag prices outside 1.5 * IQR from the quartiles - Return rate method: Flag single-period returns exceeding a threshold Test your detector on synthetic price data with injected anomalies.

Exercise 25: Data Completeness Monitor

Write a function check_completeness(db_path, expected_markets, expected_frequency_hours) that: - Checks whether all expected markets have recent price data - Identifies gaps in the time series (missing snapshots) - Computes a completeness score (percentage of expected data points present) - Returns a report listing markets with the worst completeness

Exercise 26: Cross-Platform Consistency Check

Write a function that compares prices for the same event across multiple platforms and flags significant discrepancies. The function should: - Match markets across platforms using question text similarity - Compare current prices and flag differences > 5% - Compute the average absolute price difference across all matched pairs - Generate a report sorted by discrepancy size

Section 20.10: Alternative Data

Exercise 27: News Sentiment Pipeline

Build a pipeline that: - Fetches recent news articles related to a prediction market topic - Extracts basic sentiment using keyword matching (count of positive and negative words) - Computes an hourly sentiment score - Stores the results in your SQLite database - Correlates sentiment with concurrent price movements

Exercise 28: Economic Indicator Dashboard

Using the FRED API, build a dashboard that: - Fetches the latest values for 10 key economic indicators - Computes month-over-month changes - Identifies which indicators have deviated most from their historical averages - Suggests which Kalshi markets might be affected by each indicator's movement

Design (but do not implement with live APIs) a social media monitoring system for prediction markets. Write the class interfaces and data models for: - A TwitterMonitor that tracks mentions and sentiment for market-related keywords - A RedditMonitor that tracks relevant subreddit activity - A SentimentAggregator that combines signals from multiple sources - Discuss rate limits, API costs, and privacy considerations

Section 20.11: Ethics and Legal

Exercise 30: Data Ethics Audit

Conduct a data ethics audit of your entire data collection pipeline. Create a document that: 1. Lists every data source you access 2. For each source, identifies: the terms of service, whether your use complies, what personal data (if any) is collected, how it is stored and protected, and the retention policy 3. Assesses the GDPR implications if any data subjects are EU residents 4. Proposes mitigation measures for any identified risks 5. Creates a data retention and deletion policy

Write this as a Python function that generates the audit template as a structured dictionary, pre-populated with the platforms and data types from this chapter.