> "Everyone has a plan until they get punched in the mouth." --- Mike Tyson
In This Chapter
- 26.1 Why Backtesting Is Essential
- 26.2 Common Backtesting Pitfalls
- 26.3 Designing a Backtesting Framework
- 26.4 Data Requirements and Preparation
- 26.5 Fill Simulation and Execution Modeling
- 26.6 Transaction Cost Modeling
- 26.7 Walk-Forward Backtesting
- 26.8 Performance Metrics for Prediction Market Strategies
- 26.9 Statistical Significance of Backtest Results
- 26.10 Backtest Report Generation
- 26.11 From Backtest to Paper Trading
- 26.12 Chapter Summary
- What's Next
- Key Equations Reference
Chapter 26: Backtesting Prediction Market Strategies
"Everyone has a plan until they get punched in the mouth." --- Mike Tyson
In prediction markets, the punch comes the moment you deploy a strategy with real capital and discover that the beautiful equity curve you constructed on historical data was nothing more than an artifact of flawed methodology. Backtesting --- the process of simulating a trading strategy on historical data --- is simultaneously the most powerful tool in a quantitative trader's arsenal and the most dangerous source of self-deception.
This chapter builds a rigorous backtesting framework from the ground up, specifically tailored to the unique characteristics of prediction markets. We will confront every bias that corrupts backtest results, simulate realistic execution conditions, model the true costs of trading, and construct statistical tests that separate genuine alpha from statistical noise. By the end, you will possess both the code infrastructure and the intellectual discipline needed to evaluate any prediction market strategy with confidence.
26.1 Why Backtesting Is Essential
26.1.1 Validating Strategies Before Risking Capital
Every prediction market strategy begins as a hypothesis: "Markets that have moved more than 10 cents in the last hour tend to revert within the next four hours," or "Markets whose prices diverge from my model's probability by more than 15 percentage points are profitable to trade." These hypotheses sound plausible, but plausibility is not evidence.
Backtesting transforms a hypothesis into a falsifiable proposition. By applying a strategy's rules mechanically to historical data, we obtain a simulated track record that lets us answer concrete questions:
- Would this strategy have been profitable? After accounting for all costs, does the strategy produce a positive expected return?
- How much risk does it entail? What is the worst drawdown? How volatile are the returns? Could I psychologically and financially survive the bad periods?
- Is the edge large enough? A strategy that produces a 0.5% annual return is not worth the operational complexity of deploying it. Does the edge justify the effort?
- Is the result statistically significant? Could the observed performance be explained by luck alone, or does the data provide genuine evidence of an edge?
Without backtesting, you are gambling on intuition. With rigorous backtesting, you are making an informed decision under uncertainty --- which is precisely what prediction markets themselves are designed to facilitate.
26.1.2 Historical Simulation as a Laboratory
Think of your backtesting framework as a laboratory. Just as a physicist does not build a nuclear reactor to test a theory about fission, you do not deploy capital to test a theory about market behavior. The laboratory lets you:
- Iterate rapidly. You can test dozens of strategy variations in the time it would take to paper-trade a single one.
- Control variables. You can isolate the effect of individual parameters by changing one at a time while holding everything else constant.
- Examine extreme conditions. You can see how your strategy performs during historical crises, elections, and other stress events without actually living through them.
- Build conviction. A strategy that has survived rigorous backtesting gives you the psychological resilience to stick with it during inevitable drawdowns.
26.1.3 The Backtesting Mindset
The correct mental model for backtesting is hypothesis testing, not profit demonstration. You are not trying to prove that a strategy works. You are trying to determine whether the evidence is consistent with the strategy working, while actively looking for reasons it might not.
This distinction matters because of a profound asymmetry: it is vastly easier to construct a strategy that appears profitable on historical data than to construct one that is genuinely profitable going forward. The space of strategies that fit historical noise is enormous. The space of strategies that capture genuine, persistent market inefficiencies is small.
The disciplined backtester asks:
- "What could be wrong with this result?" not "How good is this result?"
- "How would this fail?" not "How much would this make?"
- "Is this edge robust across different time periods, markets, and parameter choices?" not "What parameters maximize the backtest return?"
With this mindset firmly established, let us examine what goes wrong when discipline breaks down.
26.2 Common Backtesting Pitfalls
Backtesting pitfalls are not merely academic curiosities. They are traps that have destroyed real capital and ended real trading operations. Understanding them deeply is a prerequisite for building any credible backtesting framework.
26.2.1 Lookahead Bias
Lookahead bias occurs when a backtest uses information that would not have been available at the time the trading decision was made. This is the single most common and most destructive backtesting error.
Examples in prediction markets:
- Using the final resolution outcome to filter which markets to trade. ("I only backtested on markets that resolved YES, because those are the ones with clear outcomes.")
- Using end-of-day volume data to make trading decisions that would have been made intraday.
- Using a model trained on the entire dataset, including future data, to generate signals at each point in time.
import pandas as pd
import numpy as np
# === LOOKAHEAD BIAS DEMONSTRATION ===
# Simulated prediction market data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100, freq='D')
prices = 0.5 + np.cumsum(np.random.randn(100) * 0.02)
prices = np.clip(prices, 0.01, 0.99)
resolution = 1 # Market resolved YES
df = pd.DataFrame({
'date': dates,
'price': prices,
'volume': np.random.randint(100, 1000, 100)
})
# WRONG: Using future resolution to decide trading direction
# This is lookahead bias --- at trading time, you don't know
# the resolution outcome.
def strategy_with_lookahead(df, resolution):
"""Biased: uses resolution outcome to set direction."""
signals = []
for i, row in df.iterrows():
if resolution == 1:
# Buy when price is below 0.6 --- but we only know
# to buy because we know it resolves YES!
signals.append(1 if row['price'] < 0.6 else 0)
else:
signals.append(-1 if row['price'] > 0.4 else 0)
return signals
# CORRECT: Strategy uses only past information
def strategy_without_lookahead(df):
"""Unbiased: uses only information available at decision time."""
signals = []
for i in range(len(df)):
if i < 20:
signals.append(0) # Not enough history
continue
# Use only past prices to form a view
recent_avg = df['price'].iloc[i-20:i].mean()
current = df['price'].iloc[i]
# Mean reversion: buy if below recent average
if current < recent_avg - 0.05:
signals.append(1)
elif current > recent_avg + 0.05:
signals.append(-1)
else:
signals.append(0)
return signals
print("Strategy with lookahead (BIASED):")
biased_signals = strategy_with_lookahead(df, resolution)
print(f" Buy signals: {sum(1 for s in biased_signals if s == 1)}")
print("\nStrategy without lookahead (CORRECT):")
clean_signals = strategy_without_lookahead(df)
print(f" Buy signals: {sum(1 for s in clean_signals if s == 1)}")
print(f" Sell signals: {sum(1 for s in clean_signals if s == -1)}")
The insidious nature of lookahead bias is that it can be subtle. Using a moving average calculated on all data (including future values) instead of a rolling window introduces lookahead. Using a spread model calibrated to the full dataset introduces lookahead. Any time your code at time t can "see" data from time t+1 or later, you have lookahead bias.
Prevention: The best architectural defense is an event-driven backtesting framework (Section 26.3) where data is fed to the strategy one timestamp at a time, making it structurally impossible to access future data.
26.2.2 Survivorship Bias
Survivorship bias occurs when the backtest dataset includes only markets that survived to the end of the testing period, excluding those that were delisted, cancelled, or otherwise removed.
In prediction markets, survivorship bias takes specific forms:
- Cancelled markets. Some platforms cancel markets due to ambiguous resolution criteria. If your dataset excludes these, your universe of markets is biased toward those with cleaner outcomes.
- Low-liquidity markets that dried up. If you only include markets that maintained liquidity throughout their lifetime, you are excluding precisely the markets where your strategy would have been unable to exit positions.
- Platform changes. If a platform changed its fee structure or market mechanics partway through your testing period, and you only use post-change data, your backtest does not reflect the full historical experience.
# === SURVIVORSHIP BIAS DEMONSTRATION ===
# Suppose we have 1000 prediction markets
np.random.seed(42)
n_markets = 1000
# Each market has a "quality score" that determines if it survives
quality = np.random.uniform(0, 1, n_markets)
# Markets below 0.3 quality get cancelled (300 markets removed)
survived = quality >= 0.3
# Returns are correlated with quality (better markets have
# slightly better average returns for a given strategy)
true_returns = np.random.randn(n_markets) * 0.1 + (quality - 0.5) * 0.05
# Survivorship-biased analysis: only look at surviving markets
biased_mean = true_returns[survived].mean()
biased_sharpe = true_returns[survived].mean() / true_returns[survived].std()
# Correct analysis: include all markets
correct_mean = true_returns.mean()
correct_sharpe = true_returns.mean() / true_returns.std()
print("Survivorship Bias Impact:")
print(f" Biased mean return: {biased_mean:.4f} (Sharpe: {biased_sharpe:.3f})")
print(f" Correct mean return: {correct_mean:.4f} (Sharpe: {correct_sharpe:.3f})")
print(f" Bias magnitude: {biased_mean - correct_mean:.4f}")
print(f" Markets excluded: {n_markets - survived.sum()}")
26.2.3 Overfitting to Historical Data
Overfitting is the process of tuning a strategy's parameters so precisely to historical data that it captures noise rather than signal. An overfit strategy will perform brilliantly on the data used to develop it and poorly on new data.
The mathematical intuition is straightforward. Suppose a strategy has k free parameters. Each parameter adds a degree of freedom that allows the strategy to fit one more peculiarity of the historical data. With enough parameters, any strategy can be made to fit any historical dataset perfectly --- just as a polynomial of degree n-1 can pass through any n points.
Overfitting warning signs:
| Warning Sign | Description |
|---|---|
| Too many parameters | Strategy has more tunable knobs than economic rationale |
| Fragile performance | Small parameter changes cause large performance swings |
| In-sample/out-of-sample gap | Strategy performs much better on training data than test data |
| Implausible Sharpe ratio | Backtest Sharpe > 3.0 for a prediction market strategy is suspicious |
| Complex entry/exit rules | Multiple conditional clauses with specific numeric thresholds |
# === OVERFITTING DEMONSTRATION ===
# Generate random prediction market returns (no real signal)
np.random.seed(42)
n_days = 500
returns = np.random.randn(n_days) * 0.02 # Pure noise
# "Optimize" a strategy by trying many parameter combinations
best_in_sample_return = -np.inf
best_params = None
# Try 1000 random parameter combinations
for trial in range(1000):
# Random lookback and threshold parameters
lookback = np.random.randint(5, 50)
threshold = np.random.uniform(0.01, 0.10)
# Apply a meaningless strategy to in-sample data (first 250 days)
in_sample = returns[:250]
signals = []
for i in range(lookback, 250):
rolling_mean = in_sample[i-lookback:i].mean()
if rolling_mean > threshold:
signals.append(1)
elif rolling_mean < -threshold:
signals.append(-1)
else:
signals.append(0)
# Calculate in-sample return
strategy_returns = [s * in_sample[i+lookback]
for i, s in enumerate(signals)
if i + lookback < 250]
total_return = sum(strategy_returns)
if total_return > best_in_sample_return:
best_in_sample_return = total_return
best_params = (lookback, threshold)
# Now test "best" parameters on out-of-sample data
lookback, threshold = best_params
out_sample = returns[250:]
signals = []
for i in range(lookback, 250):
rolling_mean = out_sample[i-lookback:i].mean()
if rolling_mean > threshold:
signals.append(1)
elif rolling_mean < -threshold:
signals.append(-1)
else:
signals.append(0)
strategy_returns = [s * out_sample[i+lookback]
for i, s in enumerate(signals)
if i + lookback < 250]
out_of_sample_return = sum(strategy_returns)
print("Overfitting Demonstration:")
print(f" Best in-sample params: lookback={best_params[0]}, "
f"threshold={best_params[1]:.4f}")
print(f" In-sample return: {best_in_sample_return:.4f}")
print(f" Out-of-sample return: {out_of_sample_return:.4f}")
print(f" Performance decay: {best_in_sample_return - out_of_sample_return:.4f}")
print("\n Note: The data is pure noise. Any in-sample 'edge' is overfitting.")
26.2.4 Unrealistic Fill Assumptions
Perhaps the most prediction-market-specific pitfall: assuming that you can trade at the prices you see in historical data. Prediction markets have:
- Wide bid-ask spreads, often 2--10 cents on binary contracts.
- Thin order books, where a $100 order can move the price several cents.
- Stale quotes, where the displayed price reflects a trade from minutes or hours ago.
- No guaranteed fills, as limit orders may never execute.
A backtest that assumes you can buy at the last traded price ignores all of these realities. Section 26.5 addresses this in detail.
26.2.5 Ignoring Transaction Costs
Prediction market fees are significant:
- Trading fees: 0--10% of profits or 1--5 cents per contract on some platforms.
- Spread costs: Crossing a 4-cent spread costs 4% on a dollar contract.
- Withdrawal fees: Moving money off-platform has real costs.
- Opportunity cost: Capital locked in positions earning 0% while you wait for resolution.
A strategy that earns 8% gross but incurs 6% in costs earns only 2% net --- a dramatically different proposition.
26.2.6 Data Snooping and Selection Bias
Data snooping occurs when you test many strategies on the same dataset and select the best one. Even if each individual backtest is conducted correctly, the process of selection introduces bias.
If you test 100 strategies on the same data, the best one will appear profitable even if all 100 strategies have zero true edge. This is the multiple comparisons problem. At a 5% significance level, you would expect five strategies to appear significant by chance alone.
Selection bias is a related problem: choosing to develop strategies based on patterns you noticed in the data. If you noticed that a particular market exhibited mean reversion and then backtested a mean-reversion strategy on that market, your test is biased because the strategy was designed to fit the observation.
26.2.7 The Seven Deadly Sins of Backtesting
To summarize, here are the seven most dangerous backtesting errors, ranked by severity:
- Lookahead bias --- Using future information in past decisions.
- Overfitting --- Fitting noise instead of signal.
- Survivorship bias --- Testing only on markets/data that survived.
- Unrealistic execution --- Assuming instant fills at observed prices.
- Ignoring costs --- Omitting fees, spreads, and impact.
- Data snooping --- Testing many strategies and selecting the best.
- Lack of statistical testing --- Not checking if results are significant.
A single instance of any of these sins can render a backtest worthless. Our framework will address each one architecturally.
26.3 Designing a Backtesting Framework
26.3.1 Architecture Overview
A well-designed backtesting framework has five core components:
+------------------+
| Data Handler |
+--------+---------+
|
Market data events
|
v
+------------------+
| Strategy |
+--------+---------+
|
Signal/Order events
|
v
+------------------+
| Portfolio |
+--------+---------+
|
Order events
|
v
+------------------+
| Execution Sim |
+--------+---------+
|
Fill events
|
v
+------------------+
| Analyzer |
+------------------+
Data Handler: Feeds historical data to the strategy one event at a time, ensuring no lookahead. Responsible for data loading, cleaning, and time-alignment.
Strategy: Receives market data and produces trading signals. Contains all the logic that defines "when to buy" and "when to sell." Knows nothing about execution or portfolio management.
Portfolio: Tracks positions, cash, and overall portfolio value. Receives signals from the strategy and decides whether to convert them into actual orders (e.g., based on position limits or risk constraints).
Execution Simulator: Receives orders and simulates realistic fills, including slippage, partial fills, and transaction costs. Returns fill events that update the portfolio.
Analyzer: Collects all events and computes performance metrics, generates reports, and produces visualizations.
26.3.2 Event-Driven vs. Vectorized Backtesting
There are two fundamental approaches to backtesting:
Vectorized backtesting operates on entire arrays of data simultaneously using NumPy/pandas operations. It is fast but makes it easy to introduce lookahead bias and difficult to model realistic execution.
Event-driven backtesting processes data one event at a time, simulating the chronological flow of information. It is slower but naturally prevents lookahead bias and supports complex execution modeling.
| Aspect | Vectorized | Event-Driven |
|---|---|---|
| Speed | Very fast (vectorized ops) | Slower (Python loops) |
| Lookahead safety | Prone to bias | Structurally safe |
| Execution realism | Difficult to model | Natural fit |
| Complexity | Simple | More complex |
| Portfolio tracking | Approximate | Exact |
| Best for | Quick screening | Final validation |
Our recommendation: Use vectorized backtesting for initial strategy screening and event-driven backtesting for final validation. The framework we build supports both approaches.
26.3.3 The Framework in Python
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import List, Dict, Optional, Tuple
import numpy as np
import pandas as pd
# === Event Types ===
class EventType(Enum):
MARKET_DATA = "MARKET_DATA"
SIGNAL = "SIGNAL"
ORDER = "ORDER"
FILL = "FILL"
class OrderSide(Enum):
BUY = "BUY"
SELL = "SELL"
class OrderType(Enum):
MARKET = "MARKET"
LIMIT = "LIMIT"
@dataclass
class MarketDataEvent:
timestamp: datetime
market_id: str
price: float # Last traded price
bid: float # Best bid
ask: float # Best ask
volume: float # Period volume
bid_size: float # Size at best bid
ask_size: float # Size at best ask
@dataclass
class SignalEvent:
timestamp: datetime
market_id: str
direction: int # +1 buy, -1 sell, 0 flat
strength: float # Signal strength [0, 1]
@dataclass
class OrderEvent:
timestamp: datetime
market_id: str
side: OrderSide
order_type: OrderType
quantity: float
limit_price: Optional[float] = None
@dataclass
class FillEvent:
timestamp: datetime
market_id: str
side: OrderSide
quantity: float # Filled quantity
fill_price: float # Actual execution price
commission: float # Transaction cost
slippage: float # Price impact
# === Abstract Base Classes ===
class DataHandler(ABC):
"""Feeds historical data one event at a time."""
@abstractmethod
def has_next(self) -> bool:
"""Returns True if more data is available."""
pass
@abstractmethod
def get_next(self) -> MarketDataEvent:
"""Returns the next market data event."""
pass
@abstractmethod
def get_latest(self, market_id: str,
n: int = 1) -> List[MarketDataEvent]:
"""Returns the n most recent events for a market.
Only returns events that have already been emitted
(no lookahead)."""
pass
class Strategy(ABC):
"""Generates trading signals from market data."""
@abstractmethod
def on_market_data(self, event: MarketDataEvent,
data_handler: DataHandler) -> Optional[SignalEvent]:
"""Process new market data and optionally emit a signal."""
pass
class Portfolio(ABC):
"""Manages positions and converts signals to orders."""
@abstractmethod
def on_signal(self, signal: SignalEvent) -> Optional[OrderEvent]:
"""Process a signal and optionally emit an order."""
pass
@abstractmethod
def on_fill(self, fill: FillEvent) -> None:
"""Update positions based on a fill."""
pass
@abstractmethod
def get_equity(self) -> float:
"""Return current portfolio equity."""
pass
class ExecutionSimulator(ABC):
"""Simulates realistic order execution."""
@abstractmethod
def execute(self, order: OrderEvent,
market_data: MarketDataEvent) -> Optional[FillEvent]:
"""Simulate executing an order given current market conditions."""
pass
class Analyzer(ABC):
"""Computes performance metrics and generates reports."""
@abstractmethod
def record_equity(self, timestamp: datetime, equity: float):
"""Record a portfolio equity snapshot."""
pass
@abstractmethod
def record_trade(self, fill: FillEvent):
"""Record a completed trade."""
pass
@abstractmethod
def compute_metrics(self) -> Dict:
"""Compute all performance metrics."""
pass
26.3.4 The Backtesting Engine
The engine ties all components together:
class BacktestEngine:
"""Main backtesting engine that coordinates all components."""
def __init__(self, data_handler: DataHandler,
strategy: Strategy,
portfolio: Portfolio,
execution_sim: ExecutionSimulator,
analyzer: Analyzer):
self.data_handler = data_handler
self.strategy = strategy
self.portfolio = portfolio
self.execution_sim = execution_sim
self.analyzer = analyzer
self.event_log: List = []
def run(self) -> Dict:
"""Run the backtest and return results."""
iteration = 0
while self.data_handler.has_next():
# Step 1: Get next market data event
market_event = self.data_handler.get_next()
self.event_log.append(market_event)
# Step 2: Strategy processes market data
signal = self.strategy.on_market_data(
market_event, self.data_handler
)
if signal is not None:
self.event_log.append(signal)
# Step 3: Portfolio converts signal to order
order = self.portfolio.on_signal(signal)
if order is not None:
self.event_log.append(order)
# Step 4: Execution simulator fills the order
fill = self.execution_sim.execute(
order, market_event
)
if fill is not None:
self.event_log.append(fill)
# Step 5: Update portfolio with fill
self.portfolio.on_fill(fill)
self.analyzer.record_trade(fill)
# Record equity at each timestep
equity = self.portfolio.get_equity()
self.analyzer.record_equity(
market_event.timestamp, equity
)
iteration += 1
return self.analyzer.compute_metrics()
This architecture enforces the critical invariant: the strategy only sees data that has already been emitted by the data handler. There is no way to introduce lookahead bias without deliberately breaking the framework's API.
26.4 Data Requirements and Preparation
26.4.1 What Data You Need
Prediction market data differs from traditional financial data. Here is the complete data schema:
Market-Level Data (Static):
| Field | Type | Description |
|---|---|---|
market_id |
string | Unique identifier |
question |
string | The market question |
category |
string | Political, sports, crypto, etc. |
creation_date |
datetime | When the market was created |
close_date |
datetime | When trading ends |
resolution_date |
datetime | When outcome is determined |
resolution |
float | Final outcome (0 or 1 for binary) |
platform |
string | Which exchange |
min_tick |
float | Minimum price increment |
Time-Series Data (Dynamic):
| Field | Type | Description |
|---|---|---|
timestamp |
datetime | Observation time |
market_id |
string | Market identifier |
last_price |
float | Last traded price |
bid |
float | Best bid price |
ask |
float | Best ask price |
bid_size |
float | Quantity at best bid |
ask_size |
float | Quantity at best ask |
volume |
float | Contracts traded this period |
open_interest |
int | Total open positions |
26.4.2 Data Cleaning
Raw prediction market data is messy. Common issues and their solutions:
def clean_prediction_market_data(df: pd.DataFrame) -> pd.DataFrame:
"""Clean raw prediction market time-series data."""
df = df.copy()
# 1. Remove duplicates
df = df.drop_duplicates(subset=['timestamp', 'market_id'])
# 2. Sort chronologically
df = df.sort_values(['market_id', 'timestamp'])
# 3. Enforce price bounds [0, 1] for binary markets
for col in ['last_price', 'bid', 'ask']:
if col in df.columns:
df[col] = df[col].clip(0.0, 1.0)
# 4. Ensure bid <= ask (fix crossed quotes)
if 'bid' in df.columns and 'ask' in df.columns:
crossed = df['bid'] > df['ask']
if crossed.any():
# Swap crossed quotes
df.loc[crossed, ['bid', 'ask']] = (
df.loc[crossed, ['ask', 'bid']].values
)
# 5. Forward-fill missing prices within each market
df['last_price'] = df.groupby('market_id')['last_price'].ffill()
# 6. Remove markets with too little data
market_counts = df.groupby('market_id').size()
valid_markets = market_counts[market_counts >= 10].index
df = df[df['market_id'].isin(valid_markets)]
# 7. Handle zero or negative volume
if 'volume' in df.columns:
df['volume'] = df['volume'].clip(lower=0)
# 8. Flag stale data (no price change for extended period)
df['price_change'] = df.groupby('market_id')['last_price'].diff()
df['is_stale'] = (
df.groupby('market_id')['price_change']
.transform(lambda x: x.rolling(24, min_periods=1).sum() == 0)
)
# 9. Drop the helper column
df = df.drop(columns=['price_change'])
return df
26.4.3 Point-in-Time Databases
A point-in-time database stores data exactly as it was known at each historical moment, including corrections and revisions. This is critical for avoiding lookahead bias in data that gets revised after the fact.
For prediction markets, the key point-in-time considerations are:
- Market metadata changes. Resolution criteria sometimes get clarified after market creation. Your backtest should use the criteria as they were known at each point in time.
- Price corrections. Some platforms adjust prices after erroneous trades. Your backtest should use the original prices, since those are what you would have traded on.
- Platform rule changes. Fee structures, position limits, and trading hours change over time.
class PointInTimeDataHandler(DataHandler):
"""Data handler that enforces point-in-time correctness."""
def __init__(self, df: pd.DataFrame, market_metadata: pd.DataFrame):
"""
df: Time-series data with columns:
timestamp, market_id, last_price, bid, ask, volume, ...
market_metadata: Market-level data with columns:
market_id, creation_date, close_date, resolution_date,
resolution, ...
"""
self.df = df.sort_values('timestamp').reset_index(drop=True)
self.metadata = market_metadata
self.current_index = 0
self.history: Dict[str, List[MarketDataEvent]] = {}
def has_next(self) -> bool:
return self.current_index < len(self.df)
def get_next(self) -> MarketDataEvent:
row = self.df.iloc[self.current_index]
self.current_index += 1
event = MarketDataEvent(
timestamp=row['timestamp'],
market_id=row['market_id'],
price=row['last_price'],
bid=row.get('bid', row['last_price'] - 0.02),
ask=row.get('ask', row['last_price'] + 0.02),
volume=row.get('volume', 0),
bid_size=row.get('bid_size', 0),
ask_size=row.get('ask_size', 0),
)
# Store in history (only past data accessible)
market_id = event.market_id
if market_id not in self.history:
self.history[market_id] = []
self.history[market_id].append(event)
return event
def get_latest(self, market_id: str,
n: int = 1) -> List[MarketDataEvent]:
"""Return only historical data --- no lookahead possible."""
if market_id not in self.history:
return []
return self.history[market_id][-n:]
def get_active_markets(self, as_of: datetime) -> List[str]:
"""Return markets that are active (created but not yet
resolved) as of the given timestamp."""
active = self.metadata[
(self.metadata['creation_date'] <= as_of) &
(self.metadata['resolution_date'] > as_of)
]
return active['market_id'].tolist()
26.4.4 Data Sources
Where to obtain prediction market data for backtesting:
| Source | Type | Coverage | Access |
|---|---|---|---|
| Polymarket API | REST/WebSocket | Crypto, politics, events | Free API |
| Kalshi API | REST | Economics, weather, events | Free API |
| Metaculus | REST | Science, geopolitics | Free API |
| PredictIt (historical) | CSV downloads | US politics | Public datasets |
| Manifold Markets | REST | Broad coverage | Free API |
| Academic datasets | Various | Historical elections | Research archives |
When building your own data pipeline, ensure you capture snapshots at regular intervals rather than relying solely on trade data. Many prediction markets have long periods with no trades but meaningful bid-ask changes.
26.5 Fill Simulation and Execution Modeling
26.5.1 Why Fill Simulation Matters
The gap between "the price was 0.45" and "I could have bought at 0.45" is where many promising strategies die. In prediction markets, this gap can be enormous because:
- Liquidity is thin. Many markets have only a few hundred dollars on each side of the book.
- Spreads are wide. A 3--5 cent spread on a binary contract is 3--5% of the contract's value.
- Order books are shallow. Your order may consume all available liquidity at the best price and "walk" the book.
- Latency exists. Between signal generation and order execution, the market may move.
26.5.2 Components of Execution Cost
The total cost of executing a trade is:
$$C_{total} = C_{spread} + C_{impact} + C_{slippage} + C_{fees}$$
Where:
- $C_{spread}$: The cost of crossing the bid-ask spread. For a buy order, you pay the ask price rather than the mid-price.
- $C_{impact}$: The permanent price impact of your order on the market. Larger orders move the price more.
- $C_{slippage}$: The difference between the expected execution price and the actual execution price due to price movement during execution.
- $C_{fees}$: Platform trading fees and settlement costs.
26.5.3 Slippage Models
Constant Slippage Model:
The simplest approach: assume a fixed number of cents of slippage per trade.
$$P_{execution} = P_{observed} + s \cdot \text{side}$$
Where $s$ is the constant slippage (e.g., 0.01 for 1 cent) and side is +1 for buys, -1 for sells.
Spread-Based Slippage Model:
More realistic: execute at the bid (for sells) or ask (for buys), plus additional impact.
$$P_{buy} = P_{ask} + \alpha \cdot \frac{Q}{Q_{ask}}$$
$$P_{sell} = P_{bid} - \alpha \cdot \frac{Q}{Q_{bid}}$$
Where $Q$ is the order quantity, $Q_{ask}$ and $Q_{bid}$ are the available sizes, and $\alpha$ is the market impact coefficient.
Square-Root Impact Model:
For larger orders, empirical research suggests impact scales with the square root of order size:
$$C_{impact} = \sigma \cdot \beta \cdot \sqrt{\frac{Q}{V}}$$
Where $\sigma$ is the price volatility, $\beta$ is a calibration constant, $Q$ is order size, and $V$ is daily volume.
26.5.4 Python Execution Simulator
class RealisticExecutionSimulator(ExecutionSimulator):
"""Simulates realistic execution with slippage, impact, and fees."""
def __init__(self,
fee_rate: float = 0.02,
impact_coeff: float = 0.1,
latency_ms: float = 100,
partial_fill_prob: float = 0.1,
use_sqrt_impact: bool = True):
"""
fee_rate: Transaction fee as fraction of notional
impact_coeff: Market impact coefficient (beta)
latency_ms: Simulated latency in milliseconds
partial_fill_prob: Probability of partial fill
use_sqrt_impact: Use square-root impact model
"""
self.fee_rate = fee_rate
self.impact_coeff = impact_coeff
self.latency_ms = latency_ms
self.partial_fill_prob = partial_fill_prob
self.use_sqrt_impact = use_sqrt_impact
def execute(self, order: OrderEvent,
market_data: MarketDataEvent) -> Optional[FillEvent]:
"""Simulate order execution with realistic costs."""
# Step 1: Determine base execution price
if order.side == OrderSide.BUY:
base_price = market_data.ask # Pay the ask
available_size = market_data.ask_size
else:
base_price = market_data.bid # Receive the bid
available_size = market_data.bid_size
# Step 2: Check if limit order would fill
if order.order_type == OrderType.LIMIT:
if order.side == OrderSide.BUY:
if order.limit_price < base_price:
return None # Would not fill
else:
if order.limit_price > base_price:
return None # Would not fill
# Step 3: Determine fill quantity
if available_size > 0:
fill_qty = min(order.quantity, available_size)
else:
# If no size data, assume we can fill but with impact
fill_qty = order.quantity
# Simulate partial fills
if np.random.random() < self.partial_fill_prob:
fill_qty = fill_qty * np.random.uniform(0.3, 0.9)
fill_qty = max(1, int(fill_qty))
# Step 4: Calculate market impact
if self.use_sqrt_impact and available_size > 0:
# Square-root impact model
participation_rate = fill_qty / max(available_size, 1)
impact = self.impact_coeff * np.sqrt(participation_rate)
else:
# Linear impact model
if available_size > 0:
impact = self.impact_coeff * (
fill_qty / available_size
)
else:
impact = self.impact_coeff * 0.5
# Step 5: Calculate execution price
if order.side == OrderSide.BUY:
fill_price = base_price + impact
fill_price = min(fill_price, 0.99) # Cap at 0.99
else:
fill_price = base_price - impact
fill_price = max(fill_price, 0.01) # Floor at 0.01
# Step 6: For limit orders, cap fill price
if order.order_type == OrderType.LIMIT:
if order.side == OrderSide.BUY:
fill_price = min(fill_price, order.limit_price)
else:
fill_price = max(fill_price, order.limit_price)
# Step 7: Calculate commission
notional = fill_qty * fill_price
commission = notional * self.fee_rate
# Step 8: Calculate slippage
mid_price = (market_data.bid + market_data.ask) / 2
if order.side == OrderSide.BUY:
slippage = fill_price - mid_price
else:
slippage = mid_price - fill_price
return FillEvent(
timestamp=order.timestamp,
market_id=order.market_id,
side=order.side,
quantity=fill_qty,
fill_price=fill_price,
commission=commission,
slippage=slippage,
)
26.5.5 Validating Your Fill Model
How do you know if your fill simulation is realistic? Compare simulated fills to actual fills:
def validate_fill_model(simulated_fills: List[FillEvent],
actual_fills: List[Dict]) -> Dict:
"""Compare simulated fills against actual execution data."""
sim_prices = [f.fill_price for f in simulated_fills]
actual_prices = [f['fill_price'] for f in actual_fills]
# Price deviation
deviations = [abs(s - a) for s, a in zip(sim_prices, actual_prices)]
metrics = {
'mean_deviation': np.mean(deviations),
'median_deviation': np.median(deviations),
'max_deviation': np.max(deviations),
'correlation': np.corrcoef(sim_prices, actual_prices)[0, 1],
'sim_avg_cost': np.mean([f.slippage + f.commission
for f in simulated_fills]),
}
return metrics
26.6 Transaction Cost Modeling
26.6.1 Fee Structures by Platform
Different prediction market platforms have different fee structures. Your backtest must model the correct fees for the platform you intend to trade on.
| Platform | Trading Fee | Settlement Fee | Other |
|---|---|---|---|
| Polymarket | 0% maker / ~2% taker | None | Gas fees for on-chain |
| Kalshi | 0% maker / variable taker | None | Withdrawal fees |
| PredictIt | 5% on profits | 5% on withdrawals | $850 position limit |
| Manifold | Play money | N/A | N/A |
26.6.2 Total Cost Components
@dataclass
class TransactionCostModel:
"""Models all components of transaction costs."""
# Platform fees
maker_fee_rate: float = 0.00 # Fee for providing liquidity
taker_fee_rate: float = 0.02 # Fee for taking liquidity
settlement_fee_rate: float = 0.0 # Fee on winning resolution
withdrawal_fee_rate: float = 0.0 # Fee on withdrawals
# Market impact
spread_cost_model: str = "empirical" # or "fixed"
fixed_half_spread: float = 0.02 # Fixed half-spread
# Opportunity cost
risk_free_rate: float = 0.05 # Annual risk-free rate
def calculate_total_cost(self,
order_side: OrderSide,
quantity: float,
fill_price: float,
bid: float,
ask: float,
is_maker: bool,
holding_days: float,
profit: float = 0.0) -> Dict[str, float]:
"""Calculate complete transaction costs for a trade."""
notional = quantity * fill_price
# 1. Trading fee
fee_rate = self.maker_fee_rate if is_maker else self.taker_fee_rate
trading_fee = notional * fee_rate
# 2. Spread cost (cost of crossing the spread)
mid = (bid + ask) / 2
if self.spread_cost_model == "empirical":
if order_side == OrderSide.BUY:
spread_cost = (fill_price - mid) * quantity
else:
spread_cost = (mid - fill_price) * quantity
else:
spread_cost = self.fixed_half_spread * quantity
spread_cost = max(spread_cost, 0)
# 3. Settlement fee (on winning trades)
settlement_fee = max(profit, 0) * self.settlement_fee_rate
# 4. Opportunity cost of capital lock-up
# Capital locked = max possible loss
if order_side == OrderSide.BUY:
capital_locked = fill_price * quantity
else:
capital_locked = (1.0 - fill_price) * quantity
daily_risk_free = (1 + self.risk_free_rate) ** (1/365) - 1
opportunity_cost = capital_locked * daily_risk_free * holding_days
total = trading_fee + spread_cost + settlement_fee + opportunity_cost
return {
'trading_fee': trading_fee,
'spread_cost': spread_cost,
'settlement_fee': settlement_fee,
'opportunity_cost': opportunity_cost,
'total_cost': total,
'cost_as_pct_notional': total / notional if notional > 0 else 0,
}
26.6.3 Impact of Costs on Strategy Viability
Let us quantify how transaction costs affect strategy performance:
def cost_sensitivity_analysis(gross_returns: np.ndarray,
cost_scenarios: List[float],
trades_per_year: int) -> pd.DataFrame:
"""Analyze strategy viability across different cost assumptions."""
results = []
for cost_per_trade in cost_scenarios:
total_annual_cost = cost_per_trade * trades_per_year
net_returns = gross_returns - cost_per_trade
annual_net = net_returns.mean() * trades_per_year
annual_vol = net_returns.std() * np.sqrt(trades_per_year)
sharpe = annual_net / annual_vol if annual_vol > 0 else 0
results.append({
'cost_per_trade': cost_per_trade,
'annual_cost': total_annual_cost,
'annual_net_return': annual_net,
'sharpe_ratio': sharpe,
'profitable': annual_net > 0,
})
return pd.DataFrame(results)
# Example: Strategy with 2% gross return per trade
gross_returns = np.random.normal(0.02, 0.05, 1000) # 2% mean, 5% vol
cost_scenarios = [0.001, 0.005, 0.01, 0.02, 0.03, 0.05]
results = cost_sensitivity_analysis(gross_returns, cost_scenarios,
trades_per_year=200)
print("Cost Sensitivity Analysis:")
print(results.to_string(index=False))
This analysis often reveals that strategies that appear highly profitable under zero-cost assumptions become marginal or unprofitable when realistic costs are included. This is especially true in prediction markets where spreads are wide and fees can be substantial.
26.6.4 The Opportunity Cost Problem
A unique challenge in prediction markets is the opportunity cost of capital lock-up. When you buy a YES contract at $0.30, you lock up $0.30 per contract until the market resolves. If the market does not resolve for six months, that capital is earning 0% while it could be earning the risk-free rate (or deployed in other opportunities).
For longer-duration markets, this opportunity cost can be substantial:
$$C_{opportunity} = P_{entry} \times Q \times r_f \times T$$
Where $P_{entry}$ is the entry price, $Q$ is the quantity, $r_f$ is the annual risk-free rate, and $T$ is the time to resolution in years.
A position bought at $0.50 held for 6 months with a 5% risk-free rate incurs an opportunity cost of $0.50 \times 0.05 \times 0.5 = \$0.0125$ per contract --- seemingly small, but this is 1.25% of the contract value. For strategies with thin edges, this cost matters.
26.7 Walk-Forward Backtesting
26.7.1 The Problem with Simple Backtesting
A simple backtest optimizes parameters on the entire dataset and reports performance on that same dataset. This conflates in-sample and out-of-sample performance and maximizes overfitting risk.
The solution is walk-forward backtesting, which systematically separates training and testing periods.
26.7.2 Walk-Forward Methodology
The walk-forward procedure divides the historical data into sequential segments:
Time -->
|---Train 1---|--Test 1--|
|---Train 2---|--Test 2--|
|---Train 3---|--Test 3--|
|---Train 4---|--Test 4--|
At each step:
- Train (in-sample): Optimize strategy parameters on the training window.
- Test (out-of-sample): Apply the optimized parameters to the next unseen period.
- Advance: Slide the window forward and repeat.
The final performance metric is the concatenation of all out-of-sample test periods. This ensures that every data point in the final performance track record was genuinely out-of-sample at the time of evaluation.
26.7.3 Anchored vs. Rolling Walk-Forward
Rolling walk-forward: The training window is a fixed size and slides forward. Older data drops out of the training set.
Anchored walk-forward: The training window always starts from the beginning of the data. As time progresses, the training set grows.
| Approach | Training Data | Advantage | Disadvantage |
|---|---|---|---|
| Rolling | Fixed-size window | Adapts to regime changes | Less data for training |
| Anchored | All data up to test | More training data | Slow to adapt to changes |
26.7.4 Python Walk-Forward Engine
class WalkForwardEngine:
"""Walk-forward backtesting with parameter optimization."""
def __init__(self,
data: pd.DataFrame,
strategy_class: type,
param_grid: Dict[str, List],
train_size: int,
test_size: int,
anchored: bool = False,
optimization_metric: str = 'sharpe_ratio'):
"""
data: DataFrame with market data
strategy_class: Strategy class to instantiate
param_grid: Dict of parameter names to lists of values
train_size: Number of periods in training window
test_size: Number of periods in test window
anchored: If True, training starts from beginning
optimization_metric: Metric to maximize during training
"""
self.data = data
self.strategy_class = strategy_class
self.param_grid = param_grid
self.train_size = train_size
self.test_size = test_size
self.anchored = anchored
self.optimization_metric = optimization_metric
def _generate_param_combinations(self) -> List[Dict]:
"""Generate all combinations of parameters."""
from itertools import product
keys = list(self.param_grid.keys())
values = list(self.param_grid.values())
combinations = []
for combo in product(*values):
combinations.append(dict(zip(keys, combo)))
return combinations
def _evaluate_params(self, train_data: pd.DataFrame,
params: Dict) -> float:
"""Evaluate a parameter set on training data.
Returns the optimization metric value."""
# Create strategy with these parameters
strategy = self.strategy_class(**params)
# Run simple vectorized backtest on training data
signals = strategy.generate_signals(train_data)
returns = signals.shift(1) * train_data['return']
returns = returns.dropna()
if len(returns) == 0 or returns.std() == 0:
return -np.inf
if self.optimization_metric == 'sharpe_ratio':
return returns.mean() / returns.std() * np.sqrt(252)
elif self.optimization_metric == 'total_return':
return returns.sum()
elif self.optimization_metric == 'calmar_ratio':
cumulative = (1 + returns).cumprod()
max_dd = (cumulative / cumulative.cummax() - 1).min()
annual_return = returns.mean() * 252
return annual_return / abs(max_dd) if max_dd != 0 else 0
else:
return returns.mean() / returns.std() * np.sqrt(252)
def run(self) -> Dict:
"""Run walk-forward analysis."""
n = len(self.data)
param_combinations = self._generate_param_combinations()
results = {
'windows': [],
'best_params': [],
'in_sample_metrics': [],
'out_of_sample_returns': [],
}
step = 0
start_idx = 0
while start_idx + self.train_size + self.test_size <= n:
# Define train and test windows
if self.anchored:
train_start = 0
else:
train_start = start_idx
train_end = start_idx + self.train_size
test_start = train_end
test_end = min(test_start + self.test_size, n)
train_data = self.data.iloc[train_start:train_end]
test_data = self.data.iloc[test_start:test_end]
# Optimize parameters on training data
best_metric = -np.inf
best_params = None
for params in param_combinations:
metric = self._evaluate_params(train_data, params)
if metric > best_metric:
best_metric = metric
best_params = params
# Apply best parameters to test data
strategy = self.strategy_class(**best_params)
test_signals = strategy.generate_signals(test_data)
test_returns = test_signals.shift(1) * test_data['return']
test_returns = test_returns.dropna()
results['windows'].append({
'step': step,
'train_start': self.data.index[train_start],
'train_end': self.data.index[train_end - 1],
'test_start': self.data.index[test_start],
'test_end': self.data.index[test_end - 1],
})
results['best_params'].append(best_params)
results['in_sample_metrics'].append(best_metric)
results['out_of_sample_returns'].append(test_returns)
# Advance the window
start_idx += self.test_size
step += 1
# Concatenate all out-of-sample returns
if results['out_of_sample_returns']:
all_oos_returns = pd.concat(results['out_of_sample_returns'])
results['overall_metrics'] = {
'total_return': (1 + all_oos_returns).prod() - 1,
'annual_return': all_oos_returns.mean() * 252,
'sharpe_ratio': (all_oos_returns.mean() /
all_oos_returns.std() * np.sqrt(252)
if all_oos_returns.std() > 0 else 0),
'max_drawdown': self._max_drawdown(all_oos_returns),
'num_trades': (all_oos_returns != 0).sum(),
'num_windows': step,
}
# Check parameter stability
results['parameter_stability'] = self._check_param_stability(
results['best_params']
)
return results
def _max_drawdown(self, returns: pd.Series) -> float:
"""Calculate maximum drawdown from return series."""
cumulative = (1 + returns).cumprod()
rolling_max = cumulative.cummax()
drawdown = cumulative / rolling_max - 1
return drawdown.min()
def _check_param_stability(self,
param_history: List[Dict]) -> Dict:
"""Check if optimal parameters are stable across windows."""
stability = {}
if not param_history:
return stability
keys = param_history[0].keys()
for key in keys:
values = [p[key] for p in param_history]
stability[key] = {
'values': values,
'unique_count': len(set(values)),
'most_common': max(set(values), key=values.count),
'stability_ratio': values.count(
max(set(values), key=values.count)
) / len(values),
}
return stability
26.7.5 Interpreting Walk-Forward Results
Key metrics to examine from walk-forward analysis:
-
In-sample vs. out-of-sample performance gap. A large gap suggests overfitting. If in-sample Sharpe is 3.0 but out-of-sample Sharpe is 0.3, the strategy is overfit.
-
Parameter stability. If the optimal parameters change dramatically from one window to the next, the strategy is likely fitting noise. Stable parameters suggest a genuine, persistent edge.
-
Performance consistency across windows. A strategy that makes all its money in one window and loses in the rest is not robust.
-
Degradation over time. If each successive window shows worse performance, the market inefficiency may be closing.
26.8 Performance Metrics for Prediction Market Strategies
26.8.1 Return Metrics
Total Return:
$$R_{total} = \frac{V_{final} - V_{initial}}{V_{initial}}$$
Annualized Return (CAGR):
$$R_{annual} = \left(\frac{V_{final}}{V_{initial}}\right)^{252/n} - 1$$
Where $n$ is the number of trading days. Note: prediction markets trade 24/7, so you might use 365 instead of 252.
Log Return:
$$r = \ln\left(\frac{V_{final}}{V_{initial}}\right)$$
Log returns are additive across time, making them more convenient for statistical analysis.
26.8.2 Risk Metrics
Maximum Drawdown:
The maximum drawdown (MDD) is the largest peak-to-trough decline in portfolio value:
$$MDD = \min_{t} \left(\frac{V_t}{\max_{s \leq t} V_s} - 1\right)$$
This is arguably the most important risk metric for practitioners because it represents the worst psychological and financial pain the strategy inflicts.
Volatility (Standard Deviation of Returns):
$$\sigma = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (r_i - \bar{r})^2}$$
Annualized: $\sigma_{annual} = \sigma_{daily} \times \sqrt{252}$
Value at Risk (VaR):
The loss that is exceeded with probability $\alpha$ (typically 5%):
$$VaR_\alpha = -\text{quantile}(r, \alpha)$$
Conditional VaR (CVaR / Expected Shortfall):
The expected loss given that the loss exceeds VaR:
$$CVaR_\alpha = -E[r \mid r < -VaR_\alpha]$$
26.8.3 Risk-Adjusted Return Metrics
Sharpe Ratio:
$$S = \frac{R_p - R_f}{\sigma_p}$$
Where $R_p$ is the portfolio return, $R_f$ is the risk-free rate, and $\sigma_p$ is the portfolio standard deviation. Annualized:
$$S_{annual} = \frac{(\bar{r} - r_f) \times 252}{\sigma_{daily} \times \sqrt{252}} = \frac{\bar{r} - r_f}{\sigma_{daily}} \times \sqrt{252}$$
Interpretation of Sharpe ratios in prediction markets:
| Sharpe | Interpretation |
|---|---|
| < 0 | Losing money |
| 0.0 -- 0.5 | Poor (not worth the effort) |
| 0.5 -- 1.0 | Acceptable for a simple strategy |
| 1.0 -- 2.0 | Good --- genuine edge likely present |
| 2.0 -- 3.0 | Excellent --- verify carefully for errors |
| > 3.0 | Suspicious --- likely a backtest error or overfitting |
Sortino Ratio:
Like Sharpe, but penalizes only downside volatility:
$$Sortino = \frac{R_p - R_f}{\sigma_{downside}}$$
Where $\sigma_{downside} = \sqrt{\frac{1}{n}\sum_{r_i < 0} r_i^2}$
Calmar Ratio:
$$Calmar = \frac{R_{annual}}{|MDD|}$$
Measures return per unit of maximum drawdown. Particularly relevant for strategies where drawdowns determine survival.
26.8.4 Trade-Level Metrics
Win Rate:
$$W = \frac{\text{Number of winning trades}}{\text{Total number of trades}}$$
Profit Factor:
$$PF = \frac{\sum \text{Winning trade profits}}{\sum |\text{Losing trade losses}|}$$
A profit factor above 1.0 means the strategy is profitable. Above 1.5 is good. Above 2.0 is excellent.
Expectancy (Average Profit per Trade):
$$E = W \times \bar{G} - (1 - W) \times \bar{L}$$
Where $\bar{G}$ is the average win and $\bar{L}$ is the average loss. This tells you how much you expect to make on each trade.
Average Win / Average Loss Ratio:
$$\text{Win/Loss Ratio} = \frac{\bar{G}}{\bar{L}}$$
Combined with win rate, this tells the full story:
- High win rate + low W/L ratio = Many small wins, few large losses (prone to blow-ups).
- Low win rate + high W/L ratio = Few large wins, many small losses (requires psychological resilience).
26.8.5 Python Metrics Suite
class PerformanceMetrics:
"""Comprehensive performance metrics for prediction market strategies."""
def __init__(self, equity_curve: pd.Series,
trades: List[Dict],
risk_free_rate: float = 0.05,
periods_per_year: int = 365):
"""
equity_curve: Series indexed by timestamp with portfolio values
trades: List of trade dicts with 'pnl', 'entry_time', 'exit_time'
risk_free_rate: Annual risk-free rate
periods_per_year: 252 for daily trading days, 365 for calendar
"""
self.equity = equity_curve
self.trades = trades
self.rf = risk_free_rate
self.ppy = periods_per_year
self.returns = equity_curve.pct_change().dropna()
def total_return(self) -> float:
return (self.equity.iloc[-1] / self.equity.iloc[0]) - 1
def annualized_return(self) -> float:
n_periods = len(self.returns)
total = 1 + self.total_return()
return total ** (self.ppy / n_periods) - 1
def volatility(self) -> float:
return self.returns.std() * np.sqrt(self.ppy)
def sharpe_ratio(self) -> float:
excess_return = self.annualized_return() - self.rf
vol = self.volatility()
return excess_return / vol if vol > 0 else 0
def sortino_ratio(self) -> float:
downside = self.returns[self.returns < 0]
if len(downside) == 0:
return np.inf
downside_vol = np.sqrt((downside ** 2).mean()) * np.sqrt(self.ppy)
excess_return = self.annualized_return() - self.rf
return excess_return / downside_vol if downside_vol > 0 else 0
def max_drawdown(self) -> float:
cumulative = (1 + self.returns).cumprod()
rolling_max = cumulative.cummax()
drawdown = cumulative / rolling_max - 1
return drawdown.min()
def max_drawdown_duration(self) -> int:
"""Maximum number of periods spent in drawdown."""
cumulative = (1 + self.returns).cumprod()
rolling_max = cumulative.cummax()
in_drawdown = cumulative < rolling_max
max_duration = 0
current_duration = 0
for is_dd in in_drawdown:
if is_dd:
current_duration += 1
max_duration = max(max_duration, current_duration)
else:
current_duration = 0
return max_duration
def calmar_ratio(self) -> float:
mdd = abs(self.max_drawdown())
if mdd == 0:
return np.inf
return self.annualized_return() / mdd
def var(self, alpha: float = 0.05) -> float:
return -np.percentile(self.returns, alpha * 100)
def cvar(self, alpha: float = 0.05) -> float:
var = self.var(alpha)
tail = self.returns[self.returns <= -var]
return -tail.mean() if len(tail) > 0 else var
def win_rate(self) -> float:
if not self.trades:
return 0
wins = sum(1 for t in self.trades if t['pnl'] > 0)
return wins / len(self.trades)
def profit_factor(self) -> float:
gross_profit = sum(t['pnl'] for t in self.trades if t['pnl'] > 0)
gross_loss = abs(sum(t['pnl'] for t in self.trades if t['pnl'] < 0))
return gross_profit / gross_loss if gross_loss > 0 else np.inf
def expectancy(self) -> float:
if not self.trades:
return 0
return np.mean([t['pnl'] for t in self.trades])
def avg_win_loss_ratio(self) -> float:
wins = [t['pnl'] for t in self.trades if t['pnl'] > 0]
losses = [abs(t['pnl']) for t in self.trades if t['pnl'] < 0]
avg_win = np.mean(wins) if wins else 0
avg_loss = np.mean(losses) if losses else 0
return avg_win / avg_loss if avg_loss > 0 else np.inf
def avg_holding_period(self) -> float:
"""Average time in position (in periods)."""
if not self.trades:
return 0
durations = []
for t in self.trades:
if 'entry_time' in t and 'exit_time' in t:
duration = (t['exit_time'] - t['entry_time']).total_seconds()
durations.append(duration / 86400) # Convert to days
return np.mean(durations) if durations else 0
def monthly_returns(self) -> pd.Series:
"""Compute monthly return series."""
monthly = self.equity.resample('M').last()
return monthly.pct_change().dropna()
def compute_all(self) -> Dict:
"""Compute all metrics and return as dictionary."""
return {
'total_return': self.total_return(),
'annualized_return': self.annualized_return(),
'volatility': self.volatility(),
'sharpe_ratio': self.sharpe_ratio(),
'sortino_ratio': self.sortino_ratio(),
'calmar_ratio': self.calmar_ratio(),
'max_drawdown': self.max_drawdown(),
'max_drawdown_duration': self.max_drawdown_duration(),
'var_5pct': self.var(0.05),
'cvar_5pct': self.cvar(0.05),
'win_rate': self.win_rate(),
'profit_factor': self.profit_factor(),
'expectancy': self.expectancy(),
'avg_win_loss_ratio': self.avg_win_loss_ratio(),
'avg_holding_period_days': self.avg_holding_period(),
'total_trades': len(self.trades),
}
26.9 Statistical Significance of Backtest Results
26.9.1 The Fundamental Question: Luck or Skill?
A backtest shows a 15% annual return and a Sharpe ratio of 1.2. Is this genuine alpha, or could a strategy with no edge have produced this result by chance?
This question is critical because the space of possible strategies is vast, and even random strategies will occasionally produce impressive-looking results. We need formal statistical tests to distinguish signal from noise.
26.9.2 The t-Test for Returns
The simplest test: is the mean return significantly different from zero?
$$t = \frac{\bar{r}}{\sigma_r / \sqrt{n}}$$
Where $\bar{r}$ is the mean return, $\sigma_r$ is the standard deviation of returns, and $n$ is the number of observations.
Under the null hypothesis of zero expected return, $t$ follows a Student's t-distribution with $n-1$ degrees of freedom.
Rule of thumb: You need at least 30 trades for this test to be meaningful, and 100+ trades for reliable results.
26.9.3 Permutation Tests
A permutation test answers: "If the timing of my trades were random, how often would I see a result this good?"
The procedure:
- Calculate the actual strategy performance metric $M_{actual}$.
- Randomly shuffle the assignment of signals to returns (breaking the temporal relationship).
- Calculate the metric on the shuffled data: $M_{shuffled}$.
- Repeat steps 2--3 many times (e.g., 10,000 permutations).
- The p-value is the fraction of permutations where $M_{shuffled} \geq M_{actual}$.
def permutation_test(returns: np.ndarray,
signals: np.ndarray,
n_permutations: int = 10000,
metric: str = 'sharpe') -> Dict:
"""
Test if strategy performance is significantly better than random.
returns: Array of market returns
signals: Array of strategy signals (+1, -1, 0)
n_permutations: Number of random permutations
metric: 'sharpe', 'total_return', or 'profit_factor'
"""
# Calculate actual strategy returns
strategy_returns = signals * returns
# Calculate actual metric
if metric == 'sharpe':
actual_metric = (strategy_returns.mean() /
strategy_returns.std() * np.sqrt(252)
if strategy_returns.std() > 0 else 0)
elif metric == 'total_return':
actual_metric = strategy_returns.sum()
elif metric == 'profit_factor':
wins = strategy_returns[strategy_returns > 0].sum()
losses = abs(strategy_returns[strategy_returns < 0].sum())
actual_metric = wins / losses if losses > 0 else np.inf
# Generate permutation distribution
permuted_metrics = []
for _ in range(n_permutations):
# Shuffle signals (break temporal relationship)
perm_signals = np.random.permutation(signals)
perm_returns = perm_signals * returns
if metric == 'sharpe':
m = (perm_returns.mean() / perm_returns.std() * np.sqrt(252)
if perm_returns.std() > 0 else 0)
elif metric == 'total_return':
m = perm_returns.sum()
elif metric == 'profit_factor':
wins = perm_returns[perm_returns > 0].sum()
losses = abs(perm_returns[perm_returns < 0].sum())
m = wins / losses if losses > 0 else np.inf
permuted_metrics.append(m)
permuted_metrics = np.array(permuted_metrics)
# Calculate p-value
p_value = np.mean(permuted_metrics >= actual_metric)
return {
'actual_metric': actual_metric,
'permutation_mean': permuted_metrics.mean(),
'permutation_std': permuted_metrics.std(),
'p_value': p_value,
'significant_5pct': p_value < 0.05,
'significant_1pct': p_value < 0.01,
'percentile': np.mean(permuted_metrics < actual_metric) * 100,
}
26.9.4 Bootstrap Confidence Intervals
While permutation tests ask "is the result significantly positive?", bootstrap confidence intervals ask "what range of performance could I reasonably expect?"
def bootstrap_confidence_interval(returns: np.ndarray,
n_bootstrap: int = 10000,
confidence_level: float = 0.95,
metric_func=None) -> Dict:
"""
Compute bootstrap confidence interval for a performance metric.
returns: Array of strategy returns
n_bootstrap: Number of bootstrap samples
confidence_level: e.g. 0.95 for 95% CI
metric_func: Function that takes returns array and returns a scalar
"""
if metric_func is None:
# Default: Sharpe ratio
def metric_func(r):
return r.mean() / r.std() * np.sqrt(252) if r.std() > 0 else 0
# Generate bootstrap distribution
bootstrap_metrics = []
n = len(returns)
for _ in range(n_bootstrap):
# Sample with replacement
sample = np.random.choice(returns, size=n, replace=True)
bootstrap_metrics.append(metric_func(sample))
bootstrap_metrics = np.array(bootstrap_metrics)
alpha = 1 - confidence_level
lower = np.percentile(bootstrap_metrics, alpha / 2 * 100)
upper = np.percentile(bootstrap_metrics, (1 - alpha / 2) * 100)
return {
'point_estimate': metric_func(returns),
'bootstrap_mean': bootstrap_metrics.mean(),
'bootstrap_std': bootstrap_metrics.std(),
'ci_lower': lower,
'ci_upper': upper,
'confidence_level': confidence_level,
'ci_contains_zero': lower <= 0 <= upper,
}
26.9.5 Multiple Comparisons Correction
If you test $k$ strategies, the probability of at least one false positive at significance level $\alpha$ is:
$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^k$$
For $k = 20$ strategies at $\alpha = 0.05$: $P = 1 - 0.95^{20} = 0.64$. There is a 64% chance of finding a "significant" result even if no strategy has a genuine edge.
Bonferroni correction: Divide $\alpha$ by the number of tests: $\alpha_{adjusted} = \alpha / k$.
Benjamini-Hochberg (BH) procedure: Controls the False Discovery Rate (FDR) rather than the Family-Wise Error Rate. Less conservative than Bonferroni.
def multiple_comparison_correction(p_values: List[float],
method: str = 'bonferroni',
alpha: float = 0.05) -> Dict:
"""
Correct for multiple comparisons.
p_values: List of p-values from individual strategy tests
method: 'bonferroni' or 'bh' (Benjamini-Hochberg)
alpha: Significance level
"""
k = len(p_values)
p_array = np.array(p_values)
if method == 'bonferroni':
adjusted_alpha = alpha / k
significant = p_array < adjusted_alpha
adjusted_p = np.minimum(p_array * k, 1.0)
elif method == 'bh':
# Benjamini-Hochberg procedure
sorted_indices = np.argsort(p_array)
sorted_p = p_array[sorted_indices]
# Calculate BH critical values
bh_critical = [(i + 1) / k * alpha for i in range(k)]
# Find largest p-value that is below its critical value
significant_sorted = np.zeros(k, dtype=bool)
max_significant = -1
for i in range(k):
if sorted_p[i] <= bh_critical[i]:
max_significant = i
if max_significant >= 0:
significant_sorted[:max_significant + 1] = True
# Map back to original order
significant = np.zeros(k, dtype=bool)
significant[sorted_indices] = significant_sorted
# Adjusted p-values
adjusted_p = np.zeros(k)
adjusted_p[sorted_indices] = np.minimum.accumulate(
sorted_p * k / (np.arange(k) + 1)
)[::-1]
adjusted_p = np.minimum(adjusted_p, 1.0)
return {
'original_p_values': p_values,
'adjusted_p_values': adjusted_p.tolist(),
'significant': significant.tolist(),
'num_significant': significant.sum(),
'method': method,
'alpha': alpha,
'num_tests': k,
}
26.9.6 Minimum Number of Trades
How many trades do you need for a statistically meaningful backtest? The answer depends on the effect size (how large the edge is) and the desired statistical power.
For detecting a Sharpe ratio of $S$ with power $1 - \beta$ at significance level $\alpha$:
$$n \geq \left(\frac{z_\alpha + z_\beta}{S / \sqrt{252}}\right)^2$$
For a Sharpe of 1.0 with 80% power at 5% significance: $n \geq \left(\frac{1.645 + 0.842}{1/\sqrt{252}}\right)^2 \approx 1{,}557$ daily observations, or about 6 years of daily data.
For a Sharpe of 2.0, you need about 389 observations. The higher the Sharpe, the fewer observations needed.
from scipy import stats as scipy_stats
def minimum_trades_required(target_sharpe: float,
alpha: float = 0.05,
power: float = 0.80,
periods_per_year: int = 252) -> int:
"""
Calculate minimum number of observations needed to detect
a given Sharpe ratio with specified statistical power.
"""
z_alpha = scipy_stats.norm.ppf(1 - alpha)
z_beta = scipy_stats.norm.ppf(power)
# Effect size per period
effect_per_period = target_sharpe / np.sqrt(periods_per_year)
n = ((z_alpha + z_beta) / effect_per_period) ** 2
return int(np.ceil(n))
# Examples
for sharpe in [0.5, 1.0, 1.5, 2.0, 3.0]:
n = minimum_trades_required(sharpe)
years = n / 252
print(f"Sharpe {sharpe:.1f}: need {n:,} observations "
f"({years:.1f} years of daily data)")
26.10 Backtest Report Generation
26.10.1 What a Good Report Contains
A comprehensive backtest report should include:
- Strategy description: What the strategy does, its parameters, and its rationale.
- Summary statistics: All metrics from Section 26.8 in a clean table.
- Equity curve: Portfolio value over time.
- Drawdown chart: Drawdown percentage over time.
- Monthly returns heatmap: Returns by month and year.
- Trade distribution: Histogram of individual trade P&Ls.
- Rolling Sharpe ratio: How the Sharpe changes over time.
- Statistical significance: Results of permutation tests and confidence intervals.
- Parameter sensitivity: How performance changes with different parameter values.
- Comparison to benchmarks: How does this perform vs. simple benchmarks?
26.10.2 Python Report Generator
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
class BacktestReport:
"""Generate a comprehensive backtest report."""
def __init__(self, equity_curve: pd.Series,
trades: List[Dict],
strategy_name: str = "Strategy",
benchmark_equity: Optional[pd.Series] = None):
self.equity = equity_curve
self.trades = trades
self.strategy_name = strategy_name
self.benchmark = benchmark_equity
self.metrics = PerformanceMetrics(equity_curve, trades)
def generate_summary_table(self) -> str:
"""Generate text summary of all metrics."""
m = self.metrics.compute_all()
lines = [
f"{'='*60}",
f" BACKTEST REPORT: {self.strategy_name}",
f"{'='*60}",
f"",
f" Period: {self.equity.index[0].strftime('%Y-%m-%d')} to "
f"{self.equity.index[-1].strftime('%Y-%m-%d')}",
f" Total Trades: {m['total_trades']}",
f"",
f" --- Return Metrics ---",
f" Total Return: {m['total_return']:.2%}",
f" Annualized Return: {m['annualized_return']:.2%}",
f"",
f" --- Risk Metrics ---",
f" Volatility: {m['volatility']:.2%}",
f" Max Drawdown: {m['max_drawdown']:.2%}",
f" Max DD Duration: {m['max_drawdown_duration']} periods",
f" VaR (5%): {m['var_5pct']:.4f}",
f" CVaR (5%): {m['cvar_5pct']:.4f}",
f"",
f" --- Risk-Adjusted Metrics ---",
f" Sharpe Ratio: {m['sharpe_ratio']:.3f}",
f" Sortino Ratio: {m['sortino_ratio']:.3f}",
f" Calmar Ratio: {m['calmar_ratio']:.3f}",
f"",
f" --- Trade Metrics ---",
f" Win Rate: {m['win_rate']:.2%}",
f" Profit Factor: {m['profit_factor']:.3f}",
f" Expectancy: ${m['expectancy']:.4f}",
f" Avg Win/Loss: {m['avg_win_loss_ratio']:.3f}",
f" Avg Holding Period: {m['avg_holding_period_days']:.1f} days",
f"{'='*60}",
]
return "\n".join(lines)
def plot_equity_curve(self, ax=None):
"""Plot the equity curve."""
if ax is None:
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(self.equity.index, self.equity.values,
label=self.strategy_name, linewidth=1.5)
if self.benchmark is not None:
ax.plot(self.benchmark.index, self.benchmark.values,
label='Benchmark', linewidth=1, alpha=0.7,
linestyle='--')
ax.set_title('Equity Curve')
ax.set_xlabel('Date')
ax.set_ylabel('Portfolio Value ($)')
ax.legend()
ax.grid(True, alpha=0.3)
return ax
def plot_drawdown(self, ax=None):
"""Plot the drawdown chart."""
if ax is None:
fig, ax = plt.subplots(figsize=(12, 4))
returns = self.equity.pct_change().dropna()
cumulative = (1 + returns).cumprod()
drawdown = cumulative / cumulative.cummax() - 1
ax.fill_between(drawdown.index, drawdown.values, 0,
color='red', alpha=0.3)
ax.plot(drawdown.index, drawdown.values,
color='red', linewidth=0.5)
ax.set_title('Drawdown')
ax.set_xlabel('Date')
ax.set_ylabel('Drawdown (%)')
ax.grid(True, alpha=0.3)
return ax
def plot_monthly_returns_heatmap(self, ax=None):
"""Plot monthly returns as a heatmap."""
if ax is None:
fig, ax = plt.subplots(figsize=(12, 6))
monthly = self.equity.resample('M').last().pct_change().dropna()
# Create pivot table: year x month
monthly_df = pd.DataFrame({
'year': monthly.index.year,
'month': monthly.index.month,
'return': monthly.values,
})
pivot = monthly_df.pivot_table(
index='year', columns='month', values='return'
)
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# Ensure all months present
for m in range(1, 13):
if m not in pivot.columns:
pivot[m] = np.nan
pivot = pivot.reindex(columns=range(1, 13))
im = ax.imshow(pivot.values, cmap='RdYlGn', aspect='auto',
vmin=-0.1, vmax=0.1)
ax.set_xticks(range(12))
ax.set_xticklabels(month_names)
ax.set_yticks(range(len(pivot.index)))
ax.set_yticklabels(pivot.index)
ax.set_title('Monthly Returns Heatmap')
# Add text annotations
for i in range(pivot.shape[0]):
for j in range(pivot.shape[1]):
val = pivot.values[i, j]
if not np.isnan(val):
ax.text(j, i, f'{val:.1%}',
ha='center', va='center', fontsize=8)
plt.colorbar(im, ax=ax, label='Return')
return ax
def plot_trade_distribution(self, ax=None):
"""Plot histogram of trade P&Ls."""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 5))
pnls = [t['pnl'] for t in self.trades]
ax.hist(pnls, bins=50, color='steelblue', alpha=0.7,
edgecolor='black', linewidth=0.5)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1)
ax.axvline(x=np.mean(pnls), color='green', linestyle='--',
linewidth=1, label=f'Mean: ${np.mean(pnls):.4f}')
ax.set_title('Trade P&L Distribution')
ax.set_xlabel('P&L ($)')
ax.set_ylabel('Frequency')
ax.legend()
ax.grid(True, alpha=0.3)
return ax
def plot_rolling_sharpe(self, window: int = 60, ax=None):
"""Plot rolling Sharpe ratio."""
if ax is None:
fig, ax = plt.subplots(figsize=(12, 4))
returns = self.equity.pct_change().dropna()
rolling_mean = returns.rolling(window).mean()
rolling_std = returns.rolling(window).std()
rolling_sharpe = (rolling_mean / rolling_std *
np.sqrt(252)).dropna()
ax.plot(rolling_sharpe.index, rolling_sharpe.values,
linewidth=1, color='steelblue')
ax.axhline(y=0, color='red', linestyle='--', linewidth=0.5)
ax.axhline(y=1, color='green', linestyle='--', linewidth=0.5,
alpha=0.5)
ax.set_title(f'Rolling {window}-Period Sharpe Ratio')
ax.set_xlabel('Date')
ax.set_ylabel('Sharpe Ratio')
ax.grid(True, alpha=0.3)
return ax
def generate_full_report(self, save_path: str = None):
"""Generate complete multi-panel report."""
fig = plt.figure(figsize=(16, 24))
# Layout: 6 rows
gs = fig.add_gridspec(6, 2, hspace=0.4, wspace=0.3)
# Row 1: Equity curve (full width)
ax1 = fig.add_subplot(gs[0, :])
self.plot_equity_curve(ax1)
# Row 2: Drawdown (full width)
ax2 = fig.add_subplot(gs[1, :])
self.plot_drawdown(ax2)
# Row 3: Monthly heatmap (full width)
ax3 = fig.add_subplot(gs[2, :])
self.plot_monthly_returns_heatmap(ax3)
# Row 4: Trade distribution + Rolling Sharpe
ax4 = fig.add_subplot(gs[3, 0])
self.plot_trade_distribution(ax4)
ax5 = fig.add_subplot(gs[3, 1])
self.plot_rolling_sharpe(ax=ax5)
# Row 5: Summary statistics as text
ax6 = fig.add_subplot(gs[4:, :])
ax6.axis('off')
summary = self.generate_summary_table()
ax6.text(0.05, 0.95, summary, transform=ax6.transAxes,
fontsize=9, verticalalignment='top',
fontfamily='monospace')
fig.suptitle(f'Backtest Report: {self.strategy_name}',
fontsize=16, fontweight='bold', y=0.98)
if save_path:
fig.savefig(save_path, dpi=150, bbox_inches='tight')
print(f"Report saved to {save_path}")
plt.close(fig)
return fig
26.11 From Backtest to Paper Trading
26.11.1 The Gap Between Backtest and Reality
Even a perfectly conducted backtest operates under assumptions that may not hold in live trading:
- Data quality. Your historical data is cleaner than real-time data feeds.
- Execution timing. In backtesting, you react to each data event instantly. In reality, there is latency.
- Behavioral factors. In a backtest, every trade is executed mechanically. In practice, you may hesitate, second-guess, or deviate from the strategy.
- Market regime changes. The future may be structurally different from the past.
- Feedback effects. Your own trading may affect the market, which the backtest does not model (unless you explicitly include market impact).
26.11.2 The Paper Trading Protocol
Paper trading (forward testing) bridges the gap between backtesting and live trading. Here is a systematic protocol:
Phase 1: Shadow Mode (2--4 weeks)
Run the strategy in real-time without placing any orders. Log every signal and compare it to what you would have expected based on the backtest.
Check: - Are signals being generated at the expected frequency? - Are the signals consistent with what the historical data would predict? - Is the data feed providing clean, timely data?
Phase 2: Simulated Execution (4--8 weeks)
Continue running without real orders, but simulate execution against live market data. Use the fill simulator from Section 26.5 with live bid/ask data.
Check: - Are simulated fills realistic? Compare to actual market trades. - Is the equity curve tracking the expected range from the backtest? - Are transaction costs in line with assumptions?
Phase 3: Small Live Trading (4--12 weeks)
Deploy with minimal capital (e.g., 1% of intended allocation). Execute real trades.
Check: - Does actual execution match simulated execution? - Are there any systematic differences between backtest, paper, and live? - Is the strategy surviving real-world friction?
Phase 4: Gradual Scale-Up
If Phase 3 results are consistent with backtest expectations, gradually increase position sizes. A common rule: double the allocation every 4 weeks, provided performance remains within expected bounds.
26.11.3 Go/No-Go Criteria
Define explicit criteria for proceeding to live trading:
def go_live_decision(backtest_metrics: Dict,
paper_metrics: Dict,
threshold_pct: float = 0.50) -> Dict:
"""
Decide whether to proceed from paper trading to live trading.
threshold_pct: Maximum allowable degradation from backtest
"""
decisions = {}
# 1. Sharpe ratio within acceptable range
sharpe_ratio = paper_metrics['sharpe_ratio'] / backtest_metrics['sharpe_ratio']
decisions['sharpe_degradation'] = {
'backtest': backtest_metrics['sharpe_ratio'],
'paper': paper_metrics['sharpe_ratio'],
'ratio': sharpe_ratio,
'pass': sharpe_ratio >= (1 - threshold_pct),
}
# 2. Win rate within acceptable range
win_diff = abs(paper_metrics['win_rate'] - backtest_metrics['win_rate'])
decisions['win_rate_consistency'] = {
'backtest': backtest_metrics['win_rate'],
'paper': paper_metrics['win_rate'],
'difference': win_diff,
'pass': win_diff < 0.15, # Within 15 percentage points
}
# 3. Max drawdown not significantly worse
dd_ratio = paper_metrics['max_drawdown'] / backtest_metrics['max_drawdown']
decisions['drawdown_check'] = {
'backtest': backtest_metrics['max_drawdown'],
'paper': paper_metrics['max_drawdown'],
'ratio': dd_ratio,
'pass': dd_ratio < 1.5, # DD no more than 50% worse
}
# 4. Sufficient number of trades
decisions['sufficient_trades'] = {
'paper_trades': paper_metrics['total_trades'],
'pass': paper_metrics['total_trades'] >= 30,
}
# 5. Positive expectancy
decisions['positive_expectancy'] = {
'expectancy': paper_metrics['expectancy'],
'pass': paper_metrics['expectancy'] > 0,
}
# Overall decision
all_pass = all(d['pass'] for d in decisions.values())
decisions['overall'] = {
'go_live': all_pass,
'checks_passed': sum(1 for d in decisions.values()
if isinstance(d, dict) and d.get('pass')),
'total_checks': len(decisions) - 1,
}
return decisions
26.11.4 When to Stop
Equally important is knowing when to stop a live strategy. Define these rules before going live:
- Maximum drawdown stop. If the drawdown exceeds 1.5x the worst backtest drawdown, stop trading and investigate.
- Performance deviation stop. If the rolling 30-day Sharpe falls below -1.0, pause the strategy.
- Structural break stop. If the platform changes fees, rules, or market mechanics, pause and re-evaluate.
- Signal frequency stop. If the strategy generates significantly more or fewer signals than expected, investigate.
26.12 Chapter Summary
This chapter has built a complete backtesting infrastructure for prediction market strategies. Let us recapitulate the essential lessons:
Mindset: Backtesting is hypothesis testing, not profit demonstration. The goal is to falsify strategies, not confirm them.
Pitfalls: The seven deadly sins --- lookahead bias, overfitting, survivorship bias, unrealistic execution, ignoring costs, data snooping, and lack of statistical testing --- can each independently invalidate a backtest. Our framework addresses all seven architecturally.
Framework: The event-driven architecture (Data Handler, Strategy, Portfolio, Execution Simulator, Analyzer) enforces correct information flow and prevents lookahead bias structurally.
Fill simulation: Realistic execution modeling is essential in thin prediction markets. Our execution simulator models spreads, market impact (including square-root impact for larger orders), partial fills, and platform-specific fees.
Transaction costs: Trading fees, spread costs, market impact, and opportunity cost of capital lock-up must all be modeled. The cost sensitivity analysis reveals whether a strategy's edge survives real-world friction.
Walk-forward testing: Simple backtesting conflates in-sample and out-of-sample performance. Walk-forward analysis with rolling or anchored windows provides genuine out-of-sample evaluation and reveals parameter stability.
Performance metrics: A complete evaluation requires return metrics (total, annualized), risk metrics (drawdown, VaR, volatility), risk-adjusted metrics (Sharpe, Sortino, Calmar), and trade-level metrics (win rate, profit factor, expectancy).
Statistical significance: Permutation tests, bootstrap confidence intervals, and multiple comparisons correction distinguish genuine alpha from noise. The minimum observation requirements remind us that statistical claims require statistical evidence.
From backtest to live: The disciplined progression through shadow mode, simulated execution, small live trading, and gradual scale-up protects capital while validating backtest assumptions in the real world.
What's Next
In Chapter 27, we will explore Real-Time Data Pipelines for Prediction Markets, building the infrastructure needed to collect, process, and analyze prediction market data in real time. This is the natural next step: once you have a backtested strategy that passes all the tests in this chapter, you need a live data infrastructure to deploy it. We will cover streaming data architectures, real-time feature computation, and the engineering challenges of running prediction market strategies in production.
Key Equations Reference
| Metric | Formula |
|---|---|
| Sharpe Ratio | $S = \frac{R_p - R_f}{\sigma_p}$ |
| Sortino Ratio | $Sortino = \frac{R_p - R_f}{\sigma_{downside}}$ |
| Calmar Ratio | $Calmar = \frac{R_{annual}}{\|MDD\|}$ |
| Max Drawdown | $MDD = \min_t\left(\frac{V_t}{\max_{s \leq t}V_s} - 1\right)$ |
| Profit Factor | $PF = \frac{\sum \text{wins}}{\sum \|\text{losses}\|}$ |
| Expectancy | $E = W \times \bar{G} - (1-W) \times \bar{L}$ |
| Market Impact | $C_{impact} = \sigma \cdot \beta \cdot \sqrt{Q/V}$ |
| Min Observations | $n \geq \left(\frac{z_\alpha + z_\beta}{S/\sqrt{252}}\right)^2$ |