Quiz

Question 1

What HTTP status code indicates that you have exceeded an API's rate limit?

A) 400 Bad Request B) 403 Forbidden C) 429 Too Many Requests D) 503 Service Unavailable

Answer: C The 429 status code specifically indicates rate limiting. 400 means your request is malformed, 403 means you lack permission, and 503 means the server is temporarily unavailable for reasons unrelated to your request rate.

Question 2

Which of the following is the FIRST thing you should check before scraping a prediction market website?

A) Whether the data is available through a public API B) The website's CSS class names C) How many pages need to be scraped D) Whether Selenium is needed for rendering

Answer: A APIs should always be your first choice. They are more reliable, faster, better structured, and designed for programmatic access. Only resort to scraping when the API is unavailable or incomplete.

Question 3

In the ETL pipeline pattern, what does the "Transform" step do?

A) Moves data from one server to another B) Converts raw data into a normalized, validated format C) Encrypts data for secure storage D) Backs up the database before loading

Answer: B The Transform step takes raw data from the Extract step and cleans, validates, normalizes, and reshapes it into a consistent format suitable for loading into the target storage system.

Question 4

When implementing exponential backoff for retrying API requests, what is the purpose of adding "jitter" to the delay?

A) To make the code more random and harder to reverse-engineer B) To prevent multiple clients from retrying at exactly the same time (thundering herd) C) To confuse the server's rate limiter D) To ensure the delay is always at least 1 second

Answer: B Jitter (random variation in the delay) prevents the thundering herd problem, where many clients that were rate-limited simultaneously all retry at the same exponential backoff intervals, creating synchronized bursts of requests.

Question 5

What is the primary advantage of using requests.Session() over individual requests.get() calls when making many requests to the same API?

A) Sessions automatically handle rate limiting B) Sessions reuse TCP connections, reducing latency and overhead C) Sessions encrypt all requests D) Sessions automatically parse JSON responses

Answer: B A Session object maintains a connection pool and reuses TCP connections across requests to the same host. This avoids the overhead of establishing new connections for each request, significantly improving performance.

Question 6

In the Polymarket ecosystem, what is the difference between the Gamma API and the CLOB API?

A) The Gamma API is for trading and the CLOB API is for data B) The Gamma API provides market metadata and the CLOB API provides order book and trading data C) The Gamma API is free and the CLOB API requires payment D) There is no difference; they are the same API

Answer: B The Gamma API provides market metadata, current prices, and event information. The CLOB (Central Limit Order Book) API provides access to the order book, trade history, and trading functionality.

Question 7

What is the recommended approach for storing timestamps from multiple prediction market platforms?

A) Store each timestamp in the platform's local timezone B) Store all timestamps as Unix epoch integers C) Store all timestamps in UTC and convert only for display D) Store timestamps as strings in the original format

Answer: C Storing in UTC avoids timezone confusion, daylight saving time issues, and simplifies comparisons across platforms. Converting to local time should only happen at the display layer.

Question 8

Which Python library would you use to scrape a prediction market page that renders its content using JavaScript?

A) BeautifulSoup alone B) The requests library C) Selenium with a headless browser D) urllib

Answer: C BeautifulSoup and requests only see the initial HTML source, which for JavaScript-rendered pages may be empty. Selenium drives a real browser (Chrome, Firefox) that executes JavaScript, producing the fully rendered page.

Question 9

In a SQLite database schema for prediction markets, why is the price_snapshots table separate from the markets table?

A) SQLite cannot handle tables with more than 10 columns B) Prices change over time, so each market has many price observations (one-to-many relationship) C) It is required by SQLAlchemy D) Separating them makes the database smaller

Answer: B Each market has many price snapshots taken at different times. This one-to-many relationship is properly modeled with separate tables. The markets table stores static or slowly-changing information, while price_snapshots stores the time series of observations.

Question 10

What does the robots.txt file on a website specify?

A) The programming languages the website is built with B) Which pages crawlers and scrapers are allowed or disallowed from accessing C) The website's security certificates D) The maximum number of concurrent connections allowed

Answer: B The robots.txt file (placed at the root of a website) communicates to web crawlers which parts of the site should not be accessed. It may also specify crawl delay and sitemap locations.

Question 11

When the Kalshi API returns a response with a "cursor" field in addition to the data, what does this indicate?

A) The response contains an error that needs to be corrected B) There are more pages of data available, and the cursor should be passed in the next request C) The data has been cached and the cursor identifies the cache entry D) The cursor is the unique ID of the request for auditing purposes

Answer: B Cursor-based pagination returns an opaque cursor token with each page. Passing this cursor in the next request retrieves the subsequent page. This is more efficient than offset-based pagination for large datasets.

Question 12

What is the primary risk of NOT implementing rate limiting in your API client?

A) Your code will run too slowly B) The API provider may block your access or IP address C) Your database will run out of storage D) The data will be less accurate

Answer: B Exceeding rate limits results in 429 responses and, if persistent, the API provider may ban your API key or IP address. Rate limiting protects both the server and your continued access to the data.

Question 13

In the data quality validation framework, what does the "price complement check" verify?

A) That the yes price is always between 0 and 1 B) That the yes price plus the no price approximately equals 1.0 C) That the price has not changed by more than 50% in one day D) That the price matches the previous day's closing price

Answer: B For binary markets, the yes price and no price should sum to approximately 1.0 (minus any spread). If yes_price + no_price significantly deviates from 1.0, it indicates data corruption, an error in the transformation, or misunderstanding of the price format.

Question 14

Which of the following is a valid reason to prefer the Manifold Markets API for initial data collection practice?

A) It is the only platform with an API B) It uses play money, has generous rate limits, and well-documented open API C) It has the most accurate predictions D) It does not require any programming knowledge

Answer: B Manifold Markets uses play money (reducing financial risk), has a well-documented API with generous rate limits, and its data is openly available. This makes it ideal for learning data collection techniques before working with real-money platforms.

Question 15

What is the purpose of the User-Agent header in HTTP requests?

A) To authenticate the request with an API key B) To encrypt the request payload C) To identify the client software making the request to the server D) To specify the desired response format

Answer: C The User-Agent header tells the server what client is making the request. For scraping, it is good practice to set a descriptive User-Agent (e.g., "PredictionMarketResearch/1.0") rather than disguising your scraper as a web browser.

Question 16

In the context of prediction market data, what does "data freshness" refer to?

A) Whether the data has been validated B) How recently the data was collected relative to the current time C) Whether the data is stored in a modern database D) Whether the data has been backed up

Answer: B Data freshness measures the age of the data --- the time elapsed between when the data was generated or collected and the current moment. Different use cases (real-time trading vs. backtesting research) have different freshness requirements.

Question 17

What is the GDPR's "right to erasure" and how does it apply to prediction market data collection?

A) It requires all data to be encrypted at rest B) It gives EU residents the right to have their personal data deleted upon request C) It requires data to be erased after 30 days D) It only applies to medical data

Answer: B The right to erasure (also called "right to be forgotten") means that EU residents can request deletion of their personal data. If you collect user-level data (usernames, prediction histories) from platforms, you must be prepared to delete it upon request.

Question 18

When designing an ETL pipeline with scheduling, what does "misfire grace time" mean?

A) The maximum duration a single pipeline run is allowed to take B) The amount of time after a scheduled run was missed during which it can still be executed C) The minimum time between consecutive pipeline runs D) The time allowed for the database to commit a transaction

Answer: B If the scheduler was unable to run a job at its scheduled time (e.g., because the system was busy or down), the misfire grace time defines how late the job can still be executed. Beyond this window, the missed run is skipped.

Question 19

What information does an order book provide that a simple price quote does not?

A) The current best price B) The depth of supply and demand at multiple price levels C) The historical price trend D) The market's resolution status

Answer: B An order book shows all outstanding buy (bid) and sell (ask) orders at various price levels. This reveals the depth of the market --- how much volume is available at each price --- which is critical for estimating execution costs and market liquidity.

Question 20

Why might prices for the same event differ between Polymarket and Kalshi?

A) The events are defined differently B) One platform always has more accurate prices C) Differences in user base, liquidity, fees, regulations, and the inability to arbitrage across platforms D) Price differences are impossible in efficient markets

Answer: C Price differences arise from market fragmentation: different user bases, different levels of liquidity, different fee structures, different regulatory environments (Kalshi is CFTC-regulated, Polymarket is not), and barriers to cross-platform arbitrage. These frictions prevent prices from perfectly converging.

Question 21

In the SQLAlchemy ORM model, what does the UniqueConstraint("platform_id", "external_id") on the markets table enforce?

A) That platform_id must be unique across all records B) That external_id must be unique across all records C) That the combination of platform_id and external_id must be unique (no duplicate market per platform) D) That both fields must contain the same value

Answer: C A composite unique constraint ensures that no two rows can have the same combination of platform_id and external_id. This prevents inserting the same market from the same platform twice while allowing different platforms to use the same external_id.

Question 22

What is the primary advantage of using FRED (Federal Reserve Economic Data) as an alternative data source for prediction markets?

A) It provides real-time stock prices B) It offers free, comprehensive, well-structured historical economic data with an API C) It includes prediction market prices D) It provides weather forecast data

Answer: B FRED provides free access to over 800,000 economic data series from dozens of sources, with a well-documented API. Economic indicators like unemployment, inflation, and GDP are directly relevant to prediction markets on economic and policy events.

Question 23

When implementing a data pipeline, why is it important to separate the extractor, transformer, and loader into distinct components?

A) It makes the code longer and more impressive B) Each component can be tested, maintained, and replaced independently; different sources need different extractors but can share the same loader C) Python requires classes to be small D) It is required by SQLite

Answer: B Separation of concerns makes each component independently testable and replaceable. You can add a new data source by writing only a new extractor and transformer while reusing the existing loader. If you switch databases, you only change the loader.

Question 24

What is the difference between offset-based and cursor-based pagination?

A) There is no practical difference B) Offset-based uses a numeric position (skip N records); cursor-based uses an opaque token pointing to the last returned item, making it more reliable when data changes between requests C) Cursor-based is always faster D) Offset-based requires authentication while cursor-based does not

Answer: B Offset-based pagination (skip=100, limit=50) can return duplicate or missing records if data is inserted or deleted between requests. Cursor-based pagination uses a token tied to a specific record, ensuring consistent traversal even as the dataset changes.

Question 25

A data pipeline run extracts 1,000 markets but only loads 800. Which of the following is the LEAST likely explanation?

A) 200 markets failed the transformation step (e.g., missing required fields) B) 200 markets were filtered out because they were not binary markets C) The database already contained 200 of these markets and they were deduplicated D) The API automatically removed 200 markets from the response

Answer: D APIs return what they return; they do not remove records from their own responses mid-transmission. The gap between extracted and loaded counts is typically caused by filtering during transformation (non-binary markets, invalid data), validation failures, or deduplication in the loader.