Case Study 2: Building and Running an Arbitrage Scanner
Overview
This case study walks through the end-to-end process of building, testing, and analyzing an arbitrage scanner for prediction markets. We build the scanner from scratch, generate synthetic market data that mimics real-world conditions, run the scanner against the data, and analyze the results. By the end, you will have a working system that can be adapted to real platform APIs.
Objectives
- Design a modular arbitrage scanner architecture.
- Generate realistic synthetic market data with configurable parameters.
- Implement within-platform, cross-platform, and related-market arbitrage detection.
- Backtest the scanner against synthetic historical data.
- Analyze performance: how many opportunities exist, how large they are, and how long they last.
Part 1: Architecture Design
Component Overview
Our scanner has five components:
MarketDataGenerator --> DataNormalizer --> ArbDetector --> ProfitCalculator --> Reporter
- MarketDataGenerator: Produces synthetic price snapshots for multiple platforms and events.
- DataNormalizer: Standardizes data from different platform formats into a common schema.
- ArbDetector: Scans for within-platform, cross-platform, and related-market arbitrage.
- ProfitCalculator: Adjusts for fees, computes net profit, annualized return, and position sizing.
- Reporter: Summarizes findings with statistics and visualizations (text-based).
Data Model
Each market snapshot contains:
- event_id: Canonical identifier for the underlying event.
- platform: Which platform this price is from.
- timestamp: When this price was observed.
- yes_ask: Best price to buy YES.
- no_ask: Best price to buy NO.
- yes_bid: Best price to sell YES.
- no_bid: Best price to sell NO.
- volume_24h: Trading volume in the last 24 hours.
- resolution_date: When the market settles.
Fee Model
Each platform has a fee structure:
- trading_fee_pct: Percentage fee on purchase price.
- per_contract_fee: Fixed fee per contract.
- settlement_fee_pct: Percentage fee on profits at settlement.
- withdrawal_fee_pct: Percentage fee on withdrawals.
Part 2: Synthetic Data Generation
Why Synthetic Data?
Real prediction market data requires API access, may have licensing restrictions, and changes constantly. Synthetic data allows us to: - Control the frequency and magnitude of mispricings. - Test edge cases (simultaneous mispricing on multiple platforms, partial fills). - Run repeatable experiments. - Validate the scanner against known ground truth.
Data Generation Strategy
We generate price data for 10 events across 3 platforms over 30 days (720 hours), with snapshots every 15 minutes (2,880 snapshots per event per platform).
Each event has a "true probability" that evolves as a random walk with drift:
$$p_{true}(t) = p_{true}(t-1) + \mu \cdot \Delta t + \sigma \cdot \sqrt{\Delta t} \cdot \epsilon$$
where $\mu$ is a small drift (toward 0 or 1 as resolution approaches), $\sigma$ is the volatility, and $\epsilon \sim N(0,1)$.
Each platform observes the true probability with noise:
$$p_{platform}(t) = p_{true}(t) + \text{bias}_{platform} + \text{noise}_{platform}(t)$$
where: - $\text{bias}_{platform}$ is a small, persistent platform-specific bias (e.g., Polymarket skews slightly bullish on crypto events). - $\text{noise}_{platform}(t)$ is random noise drawn from $N(0, \sigma_{platform})$.
The ask and bid prices are derived from the platform price plus a half-spread:
$$\text{yes\_ask} = p_{platform} + \frac{s}{2}, \quad \text{yes\_bid} = p_{platform} - \frac{s}{2}$$ $$\text{no\_ask} = (1 - p_{platform}) + \frac{s}{2}, \quad \text{no\_bid} = (1 - p_{platform}) - \frac{s}{2}$$
where $s$ is the bid-ask spread, typically 0.02--0.05.
Injecting Arbitrage Opportunities
To make the synthetic data realistic, we inject three types of mispricings:
-
Platform divergence events: At random times, one platform's noise spikes, creating a temporary cross-platform gap. These occur ~5 times per day and last 5--30 minutes.
-
Stale price windows: One platform freezes (no update) for 15--60 minutes while others continue to move. This simulates low-liquidity periods or API delays.
-
Within-platform underround: Occasionally the bid-ask dynamics create a situation where YES + NO ask < 1.00 on a single platform. This occurs ~1--2 times per day and lasts 1--10 minutes.
Generation Parameters
| Parameter | Value | Notes |
|---|---|---|
| Number of events | 10 | Diverse event types |
| Number of platforms | 3 | Polymarket, Kalshi, PredictIt |
| Time period | 30 days | 2,880 snapshots per series |
| Snapshot interval | 15 minutes | |
| Base volatility ($\sigma$) | 0.008 per interval | ~1.5% daily |
| Platform noise ($\sigma_{platform}$) | 0.005--0.015 | Varies by platform |
| Bid-ask spread | 0.02--0.04 | Varies by platform and liquidity |
| Platform bias | -0.02 to +0.02 | Small, persistent |
| Divergence events per day | ~5 | Duration: 5--30 min |
| Stale price windows per day | ~2 | Duration: 15--60 min |
| Within-platform underround per day | ~1.5 | Duration: 1--10 min |
Part 3: Running the Scanner
Within-Platform Scan Results
Over 30 days, the scanner detects 412 within-platform arbitrage opportunities (before fees):
| Platform | Opportunities | Avg Gross Profit | Avg Duration (min) | Survived Fees |
|---|---|---|---|---|
| Polymarket | 148 | $0.014 | 4.2 | 31 (21%) |
| Kalshi | 127 | $0.012 | 3.8 | 18 (14%) |
| PredictIt | 137 | $0.018 | 5.1 | 0 (0%) |
Key finding: PredictIt's 10% settlement fee eliminates all within-platform arbitrage. The wider mispricings on PredictIt (due to lower liquidity and position limits) are not wide enough to survive the settlement fee. Polymarket and Kalshi together yield 49 actionable within-platform opportunities.
Cross-Platform Scan Results
The scanner detects 1,247 cross-platform price discrepancies (before fees):
| Platform Pair | Discrepancies | Avg Gross Profit | Survived Fees | Avg Net Profit |
|---|---|---|---|---|
| Polymarket--Kalshi | 438 | $0.022 | 87 (20%) | $0.009 | ||
| Polymarket--PredictIt | 412 | $0.031 | 12 (3%) | $0.007 | ||
| Kalshi--PredictIt | 397 | $0.028 | 8 (2%) | $0.006 |
Key finding: The Polymarket--Kalshi pair dominates because both platforms have low fees. PredictIt pairs rarely survive the settlement fee, despite larger raw mispricings.
Cross-Platform Opportunity Characteristics
For the 107 fee-surviving cross-platform opportunities:
| Statistic | Value |
|---|---|
| Mean net profit per pair | $0.0085 |
| Median net profit per pair | $0.0071 |
| Max net profit per pair | $0.0342 |
| Min net profit per pair | $0.0011 |
| Mean duration (minutes) | 18.3 |
| Median duration (minutes) | 12.0 |
| Mean return (%) | 0.89% |
| Mean annualized return (%) | 47.2% |
Distribution of Opportunity Size
Net Profit Per Pair | Count | Histogram
-----------------------|-------|----------
$0.001 -- $0.005 | 28 | ########
$0.005 -- $0.010 | 41 | ############
$0.010 -- $0.015 | 22 | ######
$0.015 -- $0.020 | 9 | ###
$0.020 -- $0.025 | 4 | #
$0.025 -- $0.030 | 2 | #
$0.030+ | 1 | #
Most opportunities are small ($0.005--$0.010 net profit per pair). The larger opportunities ($0.02+) tend to occur during high-volatility events and last only a few minutes.
Time-of-Day Analysis
Hour (UTC) | Opportunities | Avg Profit
-----------|--------------|----------
00--04 | 8 | $0.0112
04--08 | 6 | $0.0098
08--12 | 12 | $0.0074
12--16 | 32 | $0.0068
16--20 | 35 | $0.0081
20--24 | 14 | $0.0095
Opportunities are most frequent during US trading hours (12--20 UTC) due to higher volume and more frequent news events, but they are smaller (more competition). Off-hours opportunities are rarer but larger (less competition, staler prices).
Part 4: Backtest Simulation
Assumptions
- Execute up to 500 pairs per opportunity (limited by order book depth).
- Execution delay: 5 seconds between detecting and executing.
- Slippage model: 10% of opportunities move by $0.005 against us during the 5-second delay.
- Maximum capital: $50,000 split across platforms.
Results
| Metric | Value |
|---|---|
| Opportunities detected | 156 (within + cross) |
| Opportunities attempted | 142 |
| Successfully executed | 128 (90.1%) |
| Partially filled | 10 (7.0%) |
| Failed (price moved) | 4 (2.8%) |
| Total pairs traded | 48,320 |
| Gross profit | $571.42 |
| Fees paid | ($118.36) |
| Slippage cost | ($28.90) |
| Net profit | $424.16 |
| Avg profit per successful trade | $3.31 |
| Capital utilization | 68% |
| Return on capital (30 days) | 0.85% |
| Annualized return | 10.6% |
Profit Over Time
Day | Cumulative Profit ($)
----|---------------------
5 | $52.30
10 | $118.70
15 | $198.40
20 | $291.20
25 | $362.50
30 | $424.16
The profit accumulation is roughly linear, with some clustering around high-volatility periods (days 8--10 and 22--25 in the synthetic data had higher volatility events).
Comparison: Aggressive vs. Conservative Parameters
| Parameter | Conservative | Aggressive |
|---|---|---|
| Min net return | 1.5% | 0.5% |
| Max pairs per trade | 200 | 1000 |
| Execution delay | 10 sec | 2 sec |
| Slippage rate | 15% | 5% |
| Result | Conservative | Aggressive |
|---|---|---|
| Trades executed | 42 | 218 |
| Net profit | $186.30 | $812.50 | |
| Win rate | 100% | 94.5% |
| Max single loss | $0 | -$12.80 | |
| Annualized return | 4.5% | 19.8% |
The aggressive strategy captures more profit but introduces occasional losses from slippage and partial fills.
Part 5: Analysis and Insights
Insight 1: Fee Structure Determines Viability
The single most important factor in whether an arbitrage survives is the fee structure. A 10% settlement fee (PredictIt) eliminates nearly all opportunities, while a 1% trading fee (Polymarket) preserves most of them. Traders should prioritize platforms with low, simple fee structures.
Insight 2: Speed Is Paramount
In our simulation, opportunities lasted an average of 18 minutes. With a 5-second execution delay, we captured 90% of attempted trades. At 30 seconds, this would drop to roughly 70%. At 5 minutes, perhaps 30%. The relationship between speed and capture rate is nonlinear -- most opportunities disappear within the first few minutes.
Insight 3: Small Edges, High Frequency
The average net profit per pair is under 1 cent. This is not a strategy for large, infrequent bets. It is a strategy for automated, high-frequency detection and execution of many small opportunities.
Insight 4: Off-Hours Are Underexploited
Opportunities during off-peak hours (00--08 UTC) are 50% larger on average than peak-hour opportunities. A trader operating during these hours (or with 24/7 automation) captures a disproportionate share of profit.
Insight 5: Within-Platform Opportunities Are Rare but Pure
Within-platform arbitrage has zero resolution risk, zero settlement timing risk, and minimal execution risk. However, it occurs far less frequently than cross-platform arbitrage because a single platform's market maker actively prevents it.
Part 6: Extending the Scanner
Adding Related-Market Detection
The scanner can be extended to check logical constraints between related markets. For the 10 events in our synthetic data, we define 5 relationships:
- "Dems win Senate" is a superset of "Dems win Senate seat in Ohio."
- "BTC > $100k" and "BTC > $150k" have a sequential constraint.
- "Fed cuts in March" and "Fed cuts by June" have a subset constraint.
- Two economic indicators are correlated (statistical arbitrage candidate).
- A multi-outcome market (3 candidates) should sum to 1.00.
Over 30 days, the related-market scanner detects: - 23 subset/sequential violations (8 survived fees) - 67 complement violations in the multi-outcome market (28 survived fees) - 14 statistical arbitrage signals (not true arbitrage)
Adding Execution Simulation
The Monte Carlo execution simulator (from Section 16.7.4) can be integrated into the backtest to model: - Partial fills based on order book depth. - Price slippage modeled as a function of order size and market volatility. - Platform downtime (randomly occurring, 0.1% of snapshots).
With execution simulation enabled, the net profit drops by approximately 18% compared to the idealized backtest.
Full Code
The complete, runnable code for this case study is in code/case-study-code.py. It includes:
SyntheticMarketGeneratorclass for data generation.ArbScannerclass integrating all detection methods.BacktestEngineclass for historical simulation.PerformanceAnalyzerclass for computing statistics.- A
main()function that runs the full pipeline and prints results.
Discussion Questions
- The scanner found 107 fee-surviving cross-platform opportunities in 30 days. In a real market, would you expect more or fewer? Why?
- PredictIt's settlement fee eliminated virtually all PredictIt-involving arbitrage. If you were designing a prediction market platform, how would you structure fees to attract arbitrageurs (who provide liquidity) while still generating revenue?
- The aggressive strategy had a 94.5% win rate. Is a 5.5% loss rate acceptable for a strategy described as "arbitrage"? How would you characterize these losses?
- Off-hours opportunities are larger. What infrastructure investment would you need to capture them? Is the investment justified by the expected additional profit?
- How would you validate the synthetic data generator? What properties of real market data should it reproduce?