26 min read

> "Everyone has a plan until they get punched in the mouth." --- Mike Tyson

Chapter 26: Backtesting Prediction Market Strategies

"Everyone has a plan until they get punched in the mouth." --- Mike Tyson

In prediction markets, the punch comes the moment you deploy a strategy with real capital and discover that the beautiful equity curve you constructed on historical data was nothing more than an artifact of flawed methodology. Backtesting --- the process of simulating a trading strategy on historical data --- is simultaneously the most powerful tool in a quantitative trader's arsenal and the most dangerous source of self-deception.

This chapter builds a rigorous backtesting framework from the ground up, specifically tailored to the unique characteristics of prediction markets. We will confront every bias that corrupts backtest results, simulate realistic execution conditions, model the true costs of trading, and construct statistical tests that separate genuine alpha from statistical noise. By the end, you will possess both the code infrastructure and the intellectual discipline needed to evaluate any prediction market strategy with confidence.


26.1 Why Backtesting Is Essential

26.1.1 Validating Strategies Before Risking Capital

Every prediction market strategy begins as a hypothesis: "Markets that have moved more than 10 cents in the last hour tend to revert within the next four hours," or "Markets whose prices diverge from my model's probability by more than 15 percentage points are profitable to trade." These hypotheses sound plausible, but plausibility is not evidence.

Backtesting transforms a hypothesis into a falsifiable proposition. By applying a strategy's rules mechanically to historical data, we obtain a simulated track record that lets us answer concrete questions:

  • Would this strategy have been profitable? After accounting for all costs, does the strategy produce a positive expected return?
  • How much risk does it entail? What is the worst drawdown? How volatile are the returns? Could I psychologically and financially survive the bad periods?
  • Is the edge large enough? A strategy that produces a 0.5% annual return is not worth the operational complexity of deploying it. Does the edge justify the effort?
  • Is the result statistically significant? Could the observed performance be explained by luck alone, or does the data provide genuine evidence of an edge?

Without backtesting, you are gambling on intuition. With rigorous backtesting, you are making an informed decision under uncertainty --- which is precisely what prediction markets themselves are designed to facilitate.

26.1.2 Historical Simulation as a Laboratory

Think of your backtesting framework as a laboratory. Just as a physicist does not build a nuclear reactor to test a theory about fission, you do not deploy capital to test a theory about market behavior. The laboratory lets you:

  1. Iterate rapidly. You can test dozens of strategy variations in the time it would take to paper-trade a single one.
  2. Control variables. You can isolate the effect of individual parameters by changing one at a time while holding everything else constant.
  3. Examine extreme conditions. You can see how your strategy performs during historical crises, elections, and other stress events without actually living through them.
  4. Build conviction. A strategy that has survived rigorous backtesting gives you the psychological resilience to stick with it during inevitable drawdowns.

26.1.3 The Backtesting Mindset

The correct mental model for backtesting is hypothesis testing, not profit demonstration. You are not trying to prove that a strategy works. You are trying to determine whether the evidence is consistent with the strategy working, while actively looking for reasons it might not.

This distinction matters because of a profound asymmetry: it is vastly easier to construct a strategy that appears profitable on historical data than to construct one that is genuinely profitable going forward. The space of strategies that fit historical noise is enormous. The space of strategies that capture genuine, persistent market inefficiencies is small.

The disciplined backtester asks:

  • "What could be wrong with this result?" not "How good is this result?"
  • "How would this fail?" not "How much would this make?"
  • "Is this edge robust across different time periods, markets, and parameter choices?" not "What parameters maximize the backtest return?"

With this mindset firmly established, let us examine what goes wrong when discipline breaks down.


26.2 Common Backtesting Pitfalls

Backtesting pitfalls are not merely academic curiosities. They are traps that have destroyed real capital and ended real trading operations. Understanding them deeply is a prerequisite for building any credible backtesting framework.

26.2.1 Lookahead Bias

Lookahead bias occurs when a backtest uses information that would not have been available at the time the trading decision was made. This is the single most common and most destructive backtesting error.

Examples in prediction markets:

  • Using the final resolution outcome to filter which markets to trade. ("I only backtested on markets that resolved YES, because those are the ones with clear outcomes.")
  • Using end-of-day volume data to make trading decisions that would have been made intraday.
  • Using a model trained on the entire dataset, including future data, to generate signals at each point in time.
import pandas as pd
import numpy as np

# === LOOKAHEAD BIAS DEMONSTRATION ===

# Simulated prediction market data
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100, freq='D')
prices = 0.5 + np.cumsum(np.random.randn(100) * 0.02)
prices = np.clip(prices, 0.01, 0.99)
resolution = 1  # Market resolved YES

df = pd.DataFrame({
    'date': dates,
    'price': prices,
    'volume': np.random.randint(100, 1000, 100)
})

# WRONG: Using future resolution to decide trading direction
# This is lookahead bias --- at trading time, you don't know
# the resolution outcome.
def strategy_with_lookahead(df, resolution):
    """Biased: uses resolution outcome to set direction."""
    signals = []
    for i, row in df.iterrows():
        if resolution == 1:
            # Buy when price is below 0.6 --- but we only know
            # to buy because we know it resolves YES!
            signals.append(1 if row['price'] < 0.6 else 0)
        else:
            signals.append(-1 if row['price'] > 0.4 else 0)
    return signals

# CORRECT: Strategy uses only past information
def strategy_without_lookahead(df):
    """Unbiased: uses only information available at decision time."""
    signals = []
    for i in range(len(df)):
        if i < 20:
            signals.append(0)  # Not enough history
            continue
        # Use only past prices to form a view
        recent_avg = df['price'].iloc[i-20:i].mean()
        current = df['price'].iloc[i]
        # Mean reversion: buy if below recent average
        if current < recent_avg - 0.05:
            signals.append(1)
        elif current > recent_avg + 0.05:
            signals.append(-1)
        else:
            signals.append(0)
    return signals

print("Strategy with lookahead (BIASED):")
biased_signals = strategy_with_lookahead(df, resolution)
print(f"  Buy signals: {sum(1 for s in biased_signals if s == 1)}")

print("\nStrategy without lookahead (CORRECT):")
clean_signals = strategy_without_lookahead(df)
print(f"  Buy signals: {sum(1 for s in clean_signals if s == 1)}")
print(f"  Sell signals: {sum(1 for s in clean_signals if s == -1)}")

The insidious nature of lookahead bias is that it can be subtle. Using a moving average calculated on all data (including future values) instead of a rolling window introduces lookahead. Using a spread model calibrated to the full dataset introduces lookahead. Any time your code at time t can "see" data from time t+1 or later, you have lookahead bias.

Prevention: The best architectural defense is an event-driven backtesting framework (Section 26.3) where data is fed to the strategy one timestamp at a time, making it structurally impossible to access future data.

26.2.2 Survivorship Bias

Survivorship bias occurs when the backtest dataset includes only markets that survived to the end of the testing period, excluding those that were delisted, cancelled, or otherwise removed.

In prediction markets, survivorship bias takes specific forms:

  • Cancelled markets. Some platforms cancel markets due to ambiguous resolution criteria. If your dataset excludes these, your universe of markets is biased toward those with cleaner outcomes.
  • Low-liquidity markets that dried up. If you only include markets that maintained liquidity throughout their lifetime, you are excluding precisely the markets where your strategy would have been unable to exit positions.
  • Platform changes. If a platform changed its fee structure or market mechanics partway through your testing period, and you only use post-change data, your backtest does not reflect the full historical experience.
# === SURVIVORSHIP BIAS DEMONSTRATION ===

# Suppose we have 1000 prediction markets
np.random.seed(42)
n_markets = 1000

# Each market has a "quality score" that determines if it survives
quality = np.random.uniform(0, 1, n_markets)
# Markets below 0.3 quality get cancelled (300 markets removed)
survived = quality >= 0.3

# Returns are correlated with quality (better markets have
# slightly better average returns for a given strategy)
true_returns = np.random.randn(n_markets) * 0.1 + (quality - 0.5) * 0.05

# Survivorship-biased analysis: only look at surviving markets
biased_mean = true_returns[survived].mean()
biased_sharpe = true_returns[survived].mean() / true_returns[survived].std()

# Correct analysis: include all markets
correct_mean = true_returns.mean()
correct_sharpe = true_returns.mean() / true_returns.std()

print("Survivorship Bias Impact:")
print(f"  Biased mean return:  {biased_mean:.4f} (Sharpe: {biased_sharpe:.3f})")
print(f"  Correct mean return: {correct_mean:.4f} (Sharpe: {correct_sharpe:.3f})")
print(f"  Bias magnitude:      {biased_mean - correct_mean:.4f}")
print(f"  Markets excluded:    {n_markets - survived.sum()}")

26.2.3 Overfitting to Historical Data

Overfitting is the process of tuning a strategy's parameters so precisely to historical data that it captures noise rather than signal. An overfit strategy will perform brilliantly on the data used to develop it and poorly on new data.

The mathematical intuition is straightforward. Suppose a strategy has k free parameters. Each parameter adds a degree of freedom that allows the strategy to fit one more peculiarity of the historical data. With enough parameters, any strategy can be made to fit any historical dataset perfectly --- just as a polynomial of degree n-1 can pass through any n points.

Overfitting warning signs:

Warning Sign Description
Too many parameters Strategy has more tunable knobs than economic rationale
Fragile performance Small parameter changes cause large performance swings
In-sample/out-of-sample gap Strategy performs much better on training data than test data
Implausible Sharpe ratio Backtest Sharpe > 3.0 for a prediction market strategy is suspicious
Complex entry/exit rules Multiple conditional clauses with specific numeric thresholds
# === OVERFITTING DEMONSTRATION ===

# Generate random prediction market returns (no real signal)
np.random.seed(42)
n_days = 500
returns = np.random.randn(n_days) * 0.02  # Pure noise

# "Optimize" a strategy by trying many parameter combinations
best_in_sample_return = -np.inf
best_params = None

# Try 1000 random parameter combinations
for trial in range(1000):
    # Random lookback and threshold parameters
    lookback = np.random.randint(5, 50)
    threshold = np.random.uniform(0.01, 0.10)

    # Apply a meaningless strategy to in-sample data (first 250 days)
    in_sample = returns[:250]
    signals = []
    for i in range(lookback, 250):
        rolling_mean = in_sample[i-lookback:i].mean()
        if rolling_mean > threshold:
            signals.append(1)
        elif rolling_mean < -threshold:
            signals.append(-1)
        else:
            signals.append(0)

    # Calculate in-sample return
    strategy_returns = [s * in_sample[i+lookback]
                        for i, s in enumerate(signals)
                        if i + lookback < 250]
    total_return = sum(strategy_returns)

    if total_return > best_in_sample_return:
        best_in_sample_return = total_return
        best_params = (lookback, threshold)

# Now test "best" parameters on out-of-sample data
lookback, threshold = best_params
out_sample = returns[250:]
signals = []
for i in range(lookback, 250):
    rolling_mean = out_sample[i-lookback:i].mean()
    if rolling_mean > threshold:
        signals.append(1)
    elif rolling_mean < -threshold:
        signals.append(-1)
    else:
        signals.append(0)

strategy_returns = [s * out_sample[i+lookback]
                    for i, s in enumerate(signals)
                    if i + lookback < 250]
out_of_sample_return = sum(strategy_returns)

print("Overfitting Demonstration:")
print(f"  Best in-sample params: lookback={best_params[0]}, "
      f"threshold={best_params[1]:.4f}")
print(f"  In-sample return:      {best_in_sample_return:.4f}")
print(f"  Out-of-sample return:  {out_of_sample_return:.4f}")
print(f"  Performance decay:     {best_in_sample_return - out_of_sample_return:.4f}")
print("\n  Note: The data is pure noise. Any in-sample 'edge' is overfitting.")

26.2.4 Unrealistic Fill Assumptions

Perhaps the most prediction-market-specific pitfall: assuming that you can trade at the prices you see in historical data. Prediction markets have:

  • Wide bid-ask spreads, often 2--10 cents on binary contracts.
  • Thin order books, where a $100 order can move the price several cents.
  • Stale quotes, where the displayed price reflects a trade from minutes or hours ago.
  • No guaranteed fills, as limit orders may never execute.

A backtest that assumes you can buy at the last traded price ignores all of these realities. Section 26.5 addresses this in detail.

26.2.5 Ignoring Transaction Costs

Prediction market fees are significant:

  • Trading fees: 0--10% of profits or 1--5 cents per contract on some platforms.
  • Spread costs: Crossing a 4-cent spread costs 4% on a dollar contract.
  • Withdrawal fees: Moving money off-platform has real costs.
  • Opportunity cost: Capital locked in positions earning 0% while you wait for resolution.

A strategy that earns 8% gross but incurs 6% in costs earns only 2% net --- a dramatically different proposition.

26.2.6 Data Snooping and Selection Bias

Data snooping occurs when you test many strategies on the same dataset and select the best one. Even if each individual backtest is conducted correctly, the process of selection introduces bias.

If you test 100 strategies on the same data, the best one will appear profitable even if all 100 strategies have zero true edge. This is the multiple comparisons problem. At a 5% significance level, you would expect five strategies to appear significant by chance alone.

Selection bias is a related problem: choosing to develop strategies based on patterns you noticed in the data. If you noticed that a particular market exhibited mean reversion and then backtested a mean-reversion strategy on that market, your test is biased because the strategy was designed to fit the observation.

26.2.7 The Seven Deadly Sins of Backtesting

To summarize, here are the seven most dangerous backtesting errors, ranked by severity:

  1. Lookahead bias --- Using future information in past decisions.
  2. Overfitting --- Fitting noise instead of signal.
  3. Survivorship bias --- Testing only on markets/data that survived.
  4. Unrealistic execution --- Assuming instant fills at observed prices.
  5. Ignoring costs --- Omitting fees, spreads, and impact.
  6. Data snooping --- Testing many strategies and selecting the best.
  7. Lack of statistical testing --- Not checking if results are significant.

A single instance of any of these sins can render a backtest worthless. Our framework will address each one architecturally.


26.3 Designing a Backtesting Framework

26.3.1 Architecture Overview

A well-designed backtesting framework has five core components:

                    +------------------+
                    |   Data Handler   |
                    +--------+---------+
                             |
                     Market data events
                             |
                             v
                    +------------------+
                    |    Strategy      |
                    +--------+---------+
                             |
                      Signal/Order events
                             |
                             v
                    +------------------+
                    |    Portfolio     |
                    +--------+---------+
                             |
                       Order events
                             |
                             v
                    +------------------+
                    | Execution Sim    |
                    +--------+---------+
                             |
                       Fill events
                             |
                             v
                    +------------------+
                    |    Analyzer      |
                    +------------------+

Data Handler: Feeds historical data to the strategy one event at a time, ensuring no lookahead. Responsible for data loading, cleaning, and time-alignment.

Strategy: Receives market data and produces trading signals. Contains all the logic that defines "when to buy" and "when to sell." Knows nothing about execution or portfolio management.

Portfolio: Tracks positions, cash, and overall portfolio value. Receives signals from the strategy and decides whether to convert them into actual orders (e.g., based on position limits or risk constraints).

Execution Simulator: Receives orders and simulates realistic fills, including slippage, partial fills, and transaction costs. Returns fill events that update the portfolio.

Analyzer: Collects all events and computes performance metrics, generates reports, and produces visualizations.

26.3.2 Event-Driven vs. Vectorized Backtesting

There are two fundamental approaches to backtesting:

Vectorized backtesting operates on entire arrays of data simultaneously using NumPy/pandas operations. It is fast but makes it easy to introduce lookahead bias and difficult to model realistic execution.

Event-driven backtesting processes data one event at a time, simulating the chronological flow of information. It is slower but naturally prevents lookahead bias and supports complex execution modeling.

Aspect Vectorized Event-Driven
Speed Very fast (vectorized ops) Slower (Python loops)
Lookahead safety Prone to bias Structurally safe
Execution realism Difficult to model Natural fit
Complexity Simple More complex
Portfolio tracking Approximate Exact
Best for Quick screening Final validation

Our recommendation: Use vectorized backtesting for initial strategy screening and event-driven backtesting for final validation. The framework we build supports both approaches.

26.3.3 The Framework in Python

from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import List, Dict, Optional, Tuple
import numpy as np
import pandas as pd

# === Event Types ===

class EventType(Enum):
    MARKET_DATA = "MARKET_DATA"
    SIGNAL = "SIGNAL"
    ORDER = "ORDER"
    FILL = "FILL"

class OrderSide(Enum):
    BUY = "BUY"
    SELL = "SELL"

class OrderType(Enum):
    MARKET = "MARKET"
    LIMIT = "LIMIT"

@dataclass
class MarketDataEvent:
    timestamp: datetime
    market_id: str
    price: float          # Last traded price
    bid: float            # Best bid
    ask: float            # Best ask
    volume: float         # Period volume
    bid_size: float       # Size at best bid
    ask_size: float       # Size at best ask

@dataclass
class SignalEvent:
    timestamp: datetime
    market_id: str
    direction: int        # +1 buy, -1 sell, 0 flat
    strength: float       # Signal strength [0, 1]

@dataclass
class OrderEvent:
    timestamp: datetime
    market_id: str
    side: OrderSide
    order_type: OrderType
    quantity: float
    limit_price: Optional[float] = None

@dataclass
class FillEvent:
    timestamp: datetime
    market_id: str
    side: OrderSide
    quantity: float       # Filled quantity
    fill_price: float     # Actual execution price
    commission: float     # Transaction cost
    slippage: float       # Price impact

# === Abstract Base Classes ===

class DataHandler(ABC):
    """Feeds historical data one event at a time."""

    @abstractmethod
    def has_next(self) -> bool:
        """Returns True if more data is available."""
        pass

    @abstractmethod
    def get_next(self) -> MarketDataEvent:
        """Returns the next market data event."""
        pass

    @abstractmethod
    def get_latest(self, market_id: str,
                   n: int = 1) -> List[MarketDataEvent]:
        """Returns the n most recent events for a market.
        Only returns events that have already been emitted
        (no lookahead)."""
        pass

class Strategy(ABC):
    """Generates trading signals from market data."""

    @abstractmethod
    def on_market_data(self, event: MarketDataEvent,
                       data_handler: DataHandler) -> Optional[SignalEvent]:
        """Process new market data and optionally emit a signal."""
        pass

class Portfolio(ABC):
    """Manages positions and converts signals to orders."""

    @abstractmethod
    def on_signal(self, signal: SignalEvent) -> Optional[OrderEvent]:
        """Process a signal and optionally emit an order."""
        pass

    @abstractmethod
    def on_fill(self, fill: FillEvent) -> None:
        """Update positions based on a fill."""
        pass

    @abstractmethod
    def get_equity(self) -> float:
        """Return current portfolio equity."""
        pass

class ExecutionSimulator(ABC):
    """Simulates realistic order execution."""

    @abstractmethod
    def execute(self, order: OrderEvent,
                market_data: MarketDataEvent) -> Optional[FillEvent]:
        """Simulate executing an order given current market conditions."""
        pass

class Analyzer(ABC):
    """Computes performance metrics and generates reports."""

    @abstractmethod
    def record_equity(self, timestamp: datetime, equity: float):
        """Record a portfolio equity snapshot."""
        pass

    @abstractmethod
    def record_trade(self, fill: FillEvent):
        """Record a completed trade."""
        pass

    @abstractmethod
    def compute_metrics(self) -> Dict:
        """Compute all performance metrics."""
        pass

26.3.4 The Backtesting Engine

The engine ties all components together:

class BacktestEngine:
    """Main backtesting engine that coordinates all components."""

    def __init__(self, data_handler: DataHandler,
                 strategy: Strategy,
                 portfolio: Portfolio,
                 execution_sim: ExecutionSimulator,
                 analyzer: Analyzer):
        self.data_handler = data_handler
        self.strategy = strategy
        self.portfolio = portfolio
        self.execution_sim = execution_sim
        self.analyzer = analyzer
        self.event_log: List = []

    def run(self) -> Dict:
        """Run the backtest and return results."""
        iteration = 0

        while self.data_handler.has_next():
            # Step 1: Get next market data event
            market_event = self.data_handler.get_next()
            self.event_log.append(market_event)

            # Step 2: Strategy processes market data
            signal = self.strategy.on_market_data(
                market_event, self.data_handler
            )

            if signal is not None:
                self.event_log.append(signal)

                # Step 3: Portfolio converts signal to order
                order = self.portfolio.on_signal(signal)

                if order is not None:
                    self.event_log.append(order)

                    # Step 4: Execution simulator fills the order
                    fill = self.execution_sim.execute(
                        order, market_event
                    )

                    if fill is not None:
                        self.event_log.append(fill)

                        # Step 5: Update portfolio with fill
                        self.portfolio.on_fill(fill)
                        self.analyzer.record_trade(fill)

            # Record equity at each timestep
            equity = self.portfolio.get_equity()
            self.analyzer.record_equity(
                market_event.timestamp, equity
            )

            iteration += 1

        return self.analyzer.compute_metrics()

This architecture enforces the critical invariant: the strategy only sees data that has already been emitted by the data handler. There is no way to introduce lookahead bias without deliberately breaking the framework's API.


26.4 Data Requirements and Preparation

26.4.1 What Data You Need

Prediction market data differs from traditional financial data. Here is the complete data schema:

Market-Level Data (Static):

Field Type Description
market_id string Unique identifier
question string The market question
category string Political, sports, crypto, etc.
creation_date datetime When the market was created
close_date datetime When trading ends
resolution_date datetime When outcome is determined
resolution float Final outcome (0 or 1 for binary)
platform string Which exchange
min_tick float Minimum price increment

Time-Series Data (Dynamic):

Field Type Description
timestamp datetime Observation time
market_id string Market identifier
last_price float Last traded price
bid float Best bid price
ask float Best ask price
bid_size float Quantity at best bid
ask_size float Quantity at best ask
volume float Contracts traded this period
open_interest int Total open positions

26.4.2 Data Cleaning

Raw prediction market data is messy. Common issues and their solutions:

def clean_prediction_market_data(df: pd.DataFrame) -> pd.DataFrame:
    """Clean raw prediction market time-series data."""

    df = df.copy()

    # 1. Remove duplicates
    df = df.drop_duplicates(subset=['timestamp', 'market_id'])

    # 2. Sort chronologically
    df = df.sort_values(['market_id', 'timestamp'])

    # 3. Enforce price bounds [0, 1] for binary markets
    for col in ['last_price', 'bid', 'ask']:
        if col in df.columns:
            df[col] = df[col].clip(0.0, 1.0)

    # 4. Ensure bid <= ask (fix crossed quotes)
    if 'bid' in df.columns and 'ask' in df.columns:
        crossed = df['bid'] > df['ask']
        if crossed.any():
            # Swap crossed quotes
            df.loc[crossed, ['bid', 'ask']] = (
                df.loc[crossed, ['ask', 'bid']].values
            )

    # 5. Forward-fill missing prices within each market
    df['last_price'] = df.groupby('market_id')['last_price'].ffill()

    # 6. Remove markets with too little data
    market_counts = df.groupby('market_id').size()
    valid_markets = market_counts[market_counts >= 10].index
    df = df[df['market_id'].isin(valid_markets)]

    # 7. Handle zero or negative volume
    if 'volume' in df.columns:
        df['volume'] = df['volume'].clip(lower=0)

    # 8. Flag stale data (no price change for extended period)
    df['price_change'] = df.groupby('market_id')['last_price'].diff()
    df['is_stale'] = (
        df.groupby('market_id')['price_change']
        .transform(lambda x: x.rolling(24, min_periods=1).sum() == 0)
    )

    # 9. Drop the helper column
    df = df.drop(columns=['price_change'])

    return df

26.4.3 Point-in-Time Databases

A point-in-time database stores data exactly as it was known at each historical moment, including corrections and revisions. This is critical for avoiding lookahead bias in data that gets revised after the fact.

For prediction markets, the key point-in-time considerations are:

  • Market metadata changes. Resolution criteria sometimes get clarified after market creation. Your backtest should use the criteria as they were known at each point in time.
  • Price corrections. Some platforms adjust prices after erroneous trades. Your backtest should use the original prices, since those are what you would have traded on.
  • Platform rule changes. Fee structures, position limits, and trading hours change over time.
class PointInTimeDataHandler(DataHandler):
    """Data handler that enforces point-in-time correctness."""

    def __init__(self, df: pd.DataFrame, market_metadata: pd.DataFrame):
        """
        df: Time-series data with columns:
            timestamp, market_id, last_price, bid, ask, volume, ...
        market_metadata: Market-level data with columns:
            market_id, creation_date, close_date, resolution_date,
            resolution, ...
        """
        self.df = df.sort_values('timestamp').reset_index(drop=True)
        self.metadata = market_metadata
        self.current_index = 0
        self.history: Dict[str, List[MarketDataEvent]] = {}

    def has_next(self) -> bool:
        return self.current_index < len(self.df)

    def get_next(self) -> MarketDataEvent:
        row = self.df.iloc[self.current_index]
        self.current_index += 1

        event = MarketDataEvent(
            timestamp=row['timestamp'],
            market_id=row['market_id'],
            price=row['last_price'],
            bid=row.get('bid', row['last_price'] - 0.02),
            ask=row.get('ask', row['last_price'] + 0.02),
            volume=row.get('volume', 0),
            bid_size=row.get('bid_size', 0),
            ask_size=row.get('ask_size', 0),
        )

        # Store in history (only past data accessible)
        market_id = event.market_id
        if market_id not in self.history:
            self.history[market_id] = []
        self.history[market_id].append(event)

        return event

    def get_latest(self, market_id: str,
                   n: int = 1) -> List[MarketDataEvent]:
        """Return only historical data --- no lookahead possible."""
        if market_id not in self.history:
            return []
        return self.history[market_id][-n:]

    def get_active_markets(self, as_of: datetime) -> List[str]:
        """Return markets that are active (created but not yet
        resolved) as of the given timestamp."""
        active = self.metadata[
            (self.metadata['creation_date'] <= as_of) &
            (self.metadata['resolution_date'] > as_of)
        ]
        return active['market_id'].tolist()

26.4.4 Data Sources

Where to obtain prediction market data for backtesting:

Source Type Coverage Access
Polymarket API REST/WebSocket Crypto, politics, events Free API
Kalshi API REST Economics, weather, events Free API
Metaculus REST Science, geopolitics Free API
PredictIt (historical) CSV downloads US politics Public datasets
Manifold Markets REST Broad coverage Free API
Academic datasets Various Historical elections Research archives

When building your own data pipeline, ensure you capture snapshots at regular intervals rather than relying solely on trade data. Many prediction markets have long periods with no trades but meaningful bid-ask changes.


26.5 Fill Simulation and Execution Modeling

26.5.1 Why Fill Simulation Matters

The gap between "the price was 0.45" and "I could have bought at 0.45" is where many promising strategies die. In prediction markets, this gap can be enormous because:

  1. Liquidity is thin. Many markets have only a few hundred dollars on each side of the book.
  2. Spreads are wide. A 3--5 cent spread on a binary contract is 3--5% of the contract's value.
  3. Order books are shallow. Your order may consume all available liquidity at the best price and "walk" the book.
  4. Latency exists. Between signal generation and order execution, the market may move.

26.5.2 Components of Execution Cost

The total cost of executing a trade is:

$$C_{total} = C_{spread} + C_{impact} + C_{slippage} + C_{fees}$$

Where:

  • $C_{spread}$: The cost of crossing the bid-ask spread. For a buy order, you pay the ask price rather than the mid-price.
  • $C_{impact}$: The permanent price impact of your order on the market. Larger orders move the price more.
  • $C_{slippage}$: The difference between the expected execution price and the actual execution price due to price movement during execution.
  • $C_{fees}$: Platform trading fees and settlement costs.

26.5.3 Slippage Models

Constant Slippage Model:

The simplest approach: assume a fixed number of cents of slippage per trade.

$$P_{execution} = P_{observed} + s \cdot \text{side}$$

Where $s$ is the constant slippage (e.g., 0.01 for 1 cent) and side is +1 for buys, -1 for sells.

Spread-Based Slippage Model:

More realistic: execute at the bid (for sells) or ask (for buys), plus additional impact.

$$P_{buy} = P_{ask} + \alpha \cdot \frac{Q}{Q_{ask}}$$

$$P_{sell} = P_{bid} - \alpha \cdot \frac{Q}{Q_{bid}}$$

Where $Q$ is the order quantity, $Q_{ask}$ and $Q_{bid}$ are the available sizes, and $\alpha$ is the market impact coefficient.

Square-Root Impact Model:

For larger orders, empirical research suggests impact scales with the square root of order size:

$$C_{impact} = \sigma \cdot \beta \cdot \sqrt{\frac{Q}{V}}$$

Where $\sigma$ is the price volatility, $\beta$ is a calibration constant, $Q$ is order size, and $V$ is daily volume.

26.5.4 Python Execution Simulator

class RealisticExecutionSimulator(ExecutionSimulator):
    """Simulates realistic execution with slippage, impact, and fees."""

    def __init__(self,
                 fee_rate: float = 0.02,
                 impact_coeff: float = 0.1,
                 latency_ms: float = 100,
                 partial_fill_prob: float = 0.1,
                 use_sqrt_impact: bool = True):
        """
        fee_rate: Transaction fee as fraction of notional
        impact_coeff: Market impact coefficient (beta)
        latency_ms: Simulated latency in milliseconds
        partial_fill_prob: Probability of partial fill
        use_sqrt_impact: Use square-root impact model
        """
        self.fee_rate = fee_rate
        self.impact_coeff = impact_coeff
        self.latency_ms = latency_ms
        self.partial_fill_prob = partial_fill_prob
        self.use_sqrt_impact = use_sqrt_impact

    def execute(self, order: OrderEvent,
                market_data: MarketDataEvent) -> Optional[FillEvent]:
        """Simulate order execution with realistic costs."""

        # Step 1: Determine base execution price
        if order.side == OrderSide.BUY:
            base_price = market_data.ask  # Pay the ask
            available_size = market_data.ask_size
        else:
            base_price = market_data.bid  # Receive the bid
            available_size = market_data.bid_size

        # Step 2: Check if limit order would fill
        if order.order_type == OrderType.LIMIT:
            if order.side == OrderSide.BUY:
                if order.limit_price < base_price:
                    return None  # Would not fill
            else:
                if order.limit_price > base_price:
                    return None  # Would not fill

        # Step 3: Determine fill quantity
        if available_size > 0:
            fill_qty = min(order.quantity, available_size)
        else:
            # If no size data, assume we can fill but with impact
            fill_qty = order.quantity

        # Simulate partial fills
        if np.random.random() < self.partial_fill_prob:
            fill_qty = fill_qty * np.random.uniform(0.3, 0.9)
            fill_qty = max(1, int(fill_qty))

        # Step 4: Calculate market impact
        if self.use_sqrt_impact and available_size > 0:
            # Square-root impact model
            participation_rate = fill_qty / max(available_size, 1)
            impact = self.impact_coeff * np.sqrt(participation_rate)
        else:
            # Linear impact model
            if available_size > 0:
                impact = self.impact_coeff * (
                    fill_qty / available_size
                )
            else:
                impact = self.impact_coeff * 0.5

        # Step 5: Calculate execution price
        if order.side == OrderSide.BUY:
            fill_price = base_price + impact
            fill_price = min(fill_price, 0.99)  # Cap at 0.99
        else:
            fill_price = base_price - impact
            fill_price = max(fill_price, 0.01)  # Floor at 0.01

        # Step 6: For limit orders, cap fill price
        if order.order_type == OrderType.LIMIT:
            if order.side == OrderSide.BUY:
                fill_price = min(fill_price, order.limit_price)
            else:
                fill_price = max(fill_price, order.limit_price)

        # Step 7: Calculate commission
        notional = fill_qty * fill_price
        commission = notional * self.fee_rate

        # Step 8: Calculate slippage
        mid_price = (market_data.bid + market_data.ask) / 2
        if order.side == OrderSide.BUY:
            slippage = fill_price - mid_price
        else:
            slippage = mid_price - fill_price

        return FillEvent(
            timestamp=order.timestamp,
            market_id=order.market_id,
            side=order.side,
            quantity=fill_qty,
            fill_price=fill_price,
            commission=commission,
            slippage=slippage,
        )

26.5.5 Validating Your Fill Model

How do you know if your fill simulation is realistic? Compare simulated fills to actual fills:

def validate_fill_model(simulated_fills: List[FillEvent],
                        actual_fills: List[Dict]) -> Dict:
    """Compare simulated fills against actual execution data."""

    sim_prices = [f.fill_price for f in simulated_fills]
    actual_prices = [f['fill_price'] for f in actual_fills]

    # Price deviation
    deviations = [abs(s - a) for s, a in zip(sim_prices, actual_prices)]

    metrics = {
        'mean_deviation': np.mean(deviations),
        'median_deviation': np.median(deviations),
        'max_deviation': np.max(deviations),
        'correlation': np.corrcoef(sim_prices, actual_prices)[0, 1],
        'sim_avg_cost': np.mean([f.slippage + f.commission
                                 for f in simulated_fills]),
    }

    return metrics

26.6 Transaction Cost Modeling

26.6.1 Fee Structures by Platform

Different prediction market platforms have different fee structures. Your backtest must model the correct fees for the platform you intend to trade on.

Platform Trading Fee Settlement Fee Other
Polymarket 0% maker / ~2% taker None Gas fees for on-chain
Kalshi 0% maker / variable taker None Withdrawal fees
PredictIt 5% on profits 5% on withdrawals $850 position limit
Manifold Play money N/A N/A

26.6.2 Total Cost Components

@dataclass
class TransactionCostModel:
    """Models all components of transaction costs."""

    # Platform fees
    maker_fee_rate: float = 0.00   # Fee for providing liquidity
    taker_fee_rate: float = 0.02   # Fee for taking liquidity
    settlement_fee_rate: float = 0.0  # Fee on winning resolution
    withdrawal_fee_rate: float = 0.0  # Fee on withdrawals

    # Market impact
    spread_cost_model: str = "empirical"  # or "fixed"
    fixed_half_spread: float = 0.02       # Fixed half-spread

    # Opportunity cost
    risk_free_rate: float = 0.05  # Annual risk-free rate

    def calculate_total_cost(self,
                             order_side: OrderSide,
                             quantity: float,
                             fill_price: float,
                             bid: float,
                             ask: float,
                             is_maker: bool,
                             holding_days: float,
                             profit: float = 0.0) -> Dict[str, float]:
        """Calculate complete transaction costs for a trade."""

        notional = quantity * fill_price

        # 1. Trading fee
        fee_rate = self.maker_fee_rate if is_maker else self.taker_fee_rate
        trading_fee = notional * fee_rate

        # 2. Spread cost (cost of crossing the spread)
        mid = (bid + ask) / 2
        if self.spread_cost_model == "empirical":
            if order_side == OrderSide.BUY:
                spread_cost = (fill_price - mid) * quantity
            else:
                spread_cost = (mid - fill_price) * quantity
        else:
            spread_cost = self.fixed_half_spread * quantity

        spread_cost = max(spread_cost, 0)

        # 3. Settlement fee (on winning trades)
        settlement_fee = max(profit, 0) * self.settlement_fee_rate

        # 4. Opportunity cost of capital lock-up
        # Capital locked = max possible loss
        if order_side == OrderSide.BUY:
            capital_locked = fill_price * quantity
        else:
            capital_locked = (1.0 - fill_price) * quantity

        daily_risk_free = (1 + self.risk_free_rate) ** (1/365) - 1
        opportunity_cost = capital_locked * daily_risk_free * holding_days

        total = trading_fee + spread_cost + settlement_fee + opportunity_cost

        return {
            'trading_fee': trading_fee,
            'spread_cost': spread_cost,
            'settlement_fee': settlement_fee,
            'opportunity_cost': opportunity_cost,
            'total_cost': total,
            'cost_as_pct_notional': total / notional if notional > 0 else 0,
        }

26.6.3 Impact of Costs on Strategy Viability

Let us quantify how transaction costs affect strategy performance:

def cost_sensitivity_analysis(gross_returns: np.ndarray,
                              cost_scenarios: List[float],
                              trades_per_year: int) -> pd.DataFrame:
    """Analyze strategy viability across different cost assumptions."""

    results = []
    for cost_per_trade in cost_scenarios:
        total_annual_cost = cost_per_trade * trades_per_year
        net_returns = gross_returns - cost_per_trade

        annual_net = net_returns.mean() * trades_per_year
        annual_vol = net_returns.std() * np.sqrt(trades_per_year)
        sharpe = annual_net / annual_vol if annual_vol > 0 else 0

        results.append({
            'cost_per_trade': cost_per_trade,
            'annual_cost': total_annual_cost,
            'annual_net_return': annual_net,
            'sharpe_ratio': sharpe,
            'profitable': annual_net > 0,
        })

    return pd.DataFrame(results)

# Example: Strategy with 2% gross return per trade
gross_returns = np.random.normal(0.02, 0.05, 1000)  # 2% mean, 5% vol
cost_scenarios = [0.001, 0.005, 0.01, 0.02, 0.03, 0.05]

results = cost_sensitivity_analysis(gross_returns, cost_scenarios,
                                     trades_per_year=200)
print("Cost Sensitivity Analysis:")
print(results.to_string(index=False))

This analysis often reveals that strategies that appear highly profitable under zero-cost assumptions become marginal or unprofitable when realistic costs are included. This is especially true in prediction markets where spreads are wide and fees can be substantial.

26.6.4 The Opportunity Cost Problem

A unique challenge in prediction markets is the opportunity cost of capital lock-up. When you buy a YES contract at $0.30, you lock up $0.30 per contract until the market resolves. If the market does not resolve for six months, that capital is earning 0% while it could be earning the risk-free rate (or deployed in other opportunities).

For longer-duration markets, this opportunity cost can be substantial:

$$C_{opportunity} = P_{entry} \times Q \times r_f \times T$$

Where $P_{entry}$ is the entry price, $Q$ is the quantity, $r_f$ is the annual risk-free rate, and $T$ is the time to resolution in years.

A position bought at $0.50 held for 6 months with a 5% risk-free rate incurs an opportunity cost of $0.50 \times 0.05 \times 0.5 = \$0.0125$ per contract --- seemingly small, but this is 1.25% of the contract value. For strategies with thin edges, this cost matters.


26.7 Walk-Forward Backtesting

26.7.1 The Problem with Simple Backtesting

A simple backtest optimizes parameters on the entire dataset and reports performance on that same dataset. This conflates in-sample and out-of-sample performance and maximizes overfitting risk.

The solution is walk-forward backtesting, which systematically separates training and testing periods.

26.7.2 Walk-Forward Methodology

The walk-forward procedure divides the historical data into sequential segments:

Time -->
|---Train 1---|--Test 1--|
      |---Train 2---|--Test 2--|
            |---Train 3---|--Test 3--|
                  |---Train 4---|--Test 4--|

At each step:

  1. Train (in-sample): Optimize strategy parameters on the training window.
  2. Test (out-of-sample): Apply the optimized parameters to the next unseen period.
  3. Advance: Slide the window forward and repeat.

The final performance metric is the concatenation of all out-of-sample test periods. This ensures that every data point in the final performance track record was genuinely out-of-sample at the time of evaluation.

26.7.3 Anchored vs. Rolling Walk-Forward

Rolling walk-forward: The training window is a fixed size and slides forward. Older data drops out of the training set.

Anchored walk-forward: The training window always starts from the beginning of the data. As time progresses, the training set grows.

Approach Training Data Advantage Disadvantage
Rolling Fixed-size window Adapts to regime changes Less data for training
Anchored All data up to test More training data Slow to adapt to changes

26.7.4 Python Walk-Forward Engine

class WalkForwardEngine:
    """Walk-forward backtesting with parameter optimization."""

    def __init__(self,
                 data: pd.DataFrame,
                 strategy_class: type,
                 param_grid: Dict[str, List],
                 train_size: int,
                 test_size: int,
                 anchored: bool = False,
                 optimization_metric: str = 'sharpe_ratio'):
        """
        data: DataFrame with market data
        strategy_class: Strategy class to instantiate
        param_grid: Dict of parameter names to lists of values
        train_size: Number of periods in training window
        test_size: Number of periods in test window
        anchored: If True, training starts from beginning
        optimization_metric: Metric to maximize during training
        """
        self.data = data
        self.strategy_class = strategy_class
        self.param_grid = param_grid
        self.train_size = train_size
        self.test_size = test_size
        self.anchored = anchored
        self.optimization_metric = optimization_metric

    def _generate_param_combinations(self) -> List[Dict]:
        """Generate all combinations of parameters."""
        from itertools import product

        keys = list(self.param_grid.keys())
        values = list(self.param_grid.values())

        combinations = []
        for combo in product(*values):
            combinations.append(dict(zip(keys, combo)))

        return combinations

    def _evaluate_params(self, train_data: pd.DataFrame,
                         params: Dict) -> float:
        """Evaluate a parameter set on training data.
        Returns the optimization metric value."""

        # Create strategy with these parameters
        strategy = self.strategy_class(**params)

        # Run simple vectorized backtest on training data
        signals = strategy.generate_signals(train_data)
        returns = signals.shift(1) * train_data['return']
        returns = returns.dropna()

        if len(returns) == 0 or returns.std() == 0:
            return -np.inf

        if self.optimization_metric == 'sharpe_ratio':
            return returns.mean() / returns.std() * np.sqrt(252)
        elif self.optimization_metric == 'total_return':
            return returns.sum()
        elif self.optimization_metric == 'calmar_ratio':
            cumulative = (1 + returns).cumprod()
            max_dd = (cumulative / cumulative.cummax() - 1).min()
            annual_return = returns.mean() * 252
            return annual_return / abs(max_dd) if max_dd != 0 else 0
        else:
            return returns.mean() / returns.std() * np.sqrt(252)

    def run(self) -> Dict:
        """Run walk-forward analysis."""

        n = len(self.data)
        param_combinations = self._generate_param_combinations()

        results = {
            'windows': [],
            'best_params': [],
            'in_sample_metrics': [],
            'out_of_sample_returns': [],
        }

        step = 0
        start_idx = 0

        while start_idx + self.train_size + self.test_size <= n:
            # Define train and test windows
            if self.anchored:
                train_start = 0
            else:
                train_start = start_idx

            train_end = start_idx + self.train_size
            test_start = train_end
            test_end = min(test_start + self.test_size, n)

            train_data = self.data.iloc[train_start:train_end]
            test_data = self.data.iloc[test_start:test_end]

            # Optimize parameters on training data
            best_metric = -np.inf
            best_params = None

            for params in param_combinations:
                metric = self._evaluate_params(train_data, params)
                if metric > best_metric:
                    best_metric = metric
                    best_params = params

            # Apply best parameters to test data
            strategy = self.strategy_class(**best_params)
            test_signals = strategy.generate_signals(test_data)
            test_returns = test_signals.shift(1) * test_data['return']
            test_returns = test_returns.dropna()

            results['windows'].append({
                'step': step,
                'train_start': self.data.index[train_start],
                'train_end': self.data.index[train_end - 1],
                'test_start': self.data.index[test_start],
                'test_end': self.data.index[test_end - 1],
            })
            results['best_params'].append(best_params)
            results['in_sample_metrics'].append(best_metric)
            results['out_of_sample_returns'].append(test_returns)

            # Advance the window
            start_idx += self.test_size
            step += 1

        # Concatenate all out-of-sample returns
        if results['out_of_sample_returns']:
            all_oos_returns = pd.concat(results['out_of_sample_returns'])

            results['overall_metrics'] = {
                'total_return': (1 + all_oos_returns).prod() - 1,
                'annual_return': all_oos_returns.mean() * 252,
                'sharpe_ratio': (all_oos_returns.mean() /
                                all_oos_returns.std() * np.sqrt(252)
                                if all_oos_returns.std() > 0 else 0),
                'max_drawdown': self._max_drawdown(all_oos_returns),
                'num_trades': (all_oos_returns != 0).sum(),
                'num_windows': step,
            }

            # Check parameter stability
            results['parameter_stability'] = self._check_param_stability(
                results['best_params']
            )

        return results

    def _max_drawdown(self, returns: pd.Series) -> float:
        """Calculate maximum drawdown from return series."""
        cumulative = (1 + returns).cumprod()
        rolling_max = cumulative.cummax()
        drawdown = cumulative / rolling_max - 1
        return drawdown.min()

    def _check_param_stability(self,
                                param_history: List[Dict]) -> Dict:
        """Check if optimal parameters are stable across windows."""
        stability = {}

        if not param_history:
            return stability

        keys = param_history[0].keys()
        for key in keys:
            values = [p[key] for p in param_history]
            stability[key] = {
                'values': values,
                'unique_count': len(set(values)),
                'most_common': max(set(values), key=values.count),
                'stability_ratio': values.count(
                    max(set(values), key=values.count)
                ) / len(values),
            }

        return stability

26.7.5 Interpreting Walk-Forward Results

Key metrics to examine from walk-forward analysis:

  1. In-sample vs. out-of-sample performance gap. A large gap suggests overfitting. If in-sample Sharpe is 3.0 but out-of-sample Sharpe is 0.3, the strategy is overfit.

  2. Parameter stability. If the optimal parameters change dramatically from one window to the next, the strategy is likely fitting noise. Stable parameters suggest a genuine, persistent edge.

  3. Performance consistency across windows. A strategy that makes all its money in one window and loses in the rest is not robust.

  4. Degradation over time. If each successive window shows worse performance, the market inefficiency may be closing.


26.8 Performance Metrics for Prediction Market Strategies

26.8.1 Return Metrics

Total Return:

$$R_{total} = \frac{V_{final} - V_{initial}}{V_{initial}}$$

Annualized Return (CAGR):

$$R_{annual} = \left(\frac{V_{final}}{V_{initial}}\right)^{252/n} - 1$$

Where $n$ is the number of trading days. Note: prediction markets trade 24/7, so you might use 365 instead of 252.

Log Return:

$$r = \ln\left(\frac{V_{final}}{V_{initial}}\right)$$

Log returns are additive across time, making them more convenient for statistical analysis.

26.8.2 Risk Metrics

Maximum Drawdown:

The maximum drawdown (MDD) is the largest peak-to-trough decline in portfolio value:

$$MDD = \min_{t} \left(\frac{V_t}{\max_{s \leq t} V_s} - 1\right)$$

This is arguably the most important risk metric for practitioners because it represents the worst psychological and financial pain the strategy inflicts.

Volatility (Standard Deviation of Returns):

$$\sigma = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (r_i - \bar{r})^2}$$

Annualized: $\sigma_{annual} = \sigma_{daily} \times \sqrt{252}$

Value at Risk (VaR):

The loss that is exceeded with probability $\alpha$ (typically 5%):

$$VaR_\alpha = -\text{quantile}(r, \alpha)$$

Conditional VaR (CVaR / Expected Shortfall):

The expected loss given that the loss exceeds VaR:

$$CVaR_\alpha = -E[r \mid r < -VaR_\alpha]$$

26.8.3 Risk-Adjusted Return Metrics

Sharpe Ratio:

$$S = \frac{R_p - R_f}{\sigma_p}$$

Where $R_p$ is the portfolio return, $R_f$ is the risk-free rate, and $\sigma_p$ is the portfolio standard deviation. Annualized:

$$S_{annual} = \frac{(\bar{r} - r_f) \times 252}{\sigma_{daily} \times \sqrt{252}} = \frac{\bar{r} - r_f}{\sigma_{daily}} \times \sqrt{252}$$

Interpretation of Sharpe ratios in prediction markets:

Sharpe Interpretation
< 0 Losing money
0.0 -- 0.5 Poor (not worth the effort)
0.5 -- 1.0 Acceptable for a simple strategy
1.0 -- 2.0 Good --- genuine edge likely present
2.0 -- 3.0 Excellent --- verify carefully for errors
> 3.0 Suspicious --- likely a backtest error or overfitting

Sortino Ratio:

Like Sharpe, but penalizes only downside volatility:

$$Sortino = \frac{R_p - R_f}{\sigma_{downside}}$$

Where $\sigma_{downside} = \sqrt{\frac{1}{n}\sum_{r_i < 0} r_i^2}$

Calmar Ratio:

$$Calmar = \frac{R_{annual}}{|MDD|}$$

Measures return per unit of maximum drawdown. Particularly relevant for strategies where drawdowns determine survival.

26.8.4 Trade-Level Metrics

Win Rate:

$$W = \frac{\text{Number of winning trades}}{\text{Total number of trades}}$$

Profit Factor:

$$PF = \frac{\sum \text{Winning trade profits}}{\sum |\text{Losing trade losses}|}$$

A profit factor above 1.0 means the strategy is profitable. Above 1.5 is good. Above 2.0 is excellent.

Expectancy (Average Profit per Trade):

$$E = W \times \bar{G} - (1 - W) \times \bar{L}$$

Where $\bar{G}$ is the average win and $\bar{L}$ is the average loss. This tells you how much you expect to make on each trade.

Average Win / Average Loss Ratio:

$$\text{Win/Loss Ratio} = \frac{\bar{G}}{\bar{L}}$$

Combined with win rate, this tells the full story:

  • High win rate + low W/L ratio = Many small wins, few large losses (prone to blow-ups).
  • Low win rate + high W/L ratio = Few large wins, many small losses (requires psychological resilience).

26.8.5 Python Metrics Suite

class PerformanceMetrics:
    """Comprehensive performance metrics for prediction market strategies."""

    def __init__(self, equity_curve: pd.Series,
                 trades: List[Dict],
                 risk_free_rate: float = 0.05,
                 periods_per_year: int = 365):
        """
        equity_curve: Series indexed by timestamp with portfolio values
        trades: List of trade dicts with 'pnl', 'entry_time', 'exit_time'
        risk_free_rate: Annual risk-free rate
        periods_per_year: 252 for daily trading days, 365 for calendar
        """
        self.equity = equity_curve
        self.trades = trades
        self.rf = risk_free_rate
        self.ppy = periods_per_year
        self.returns = equity_curve.pct_change().dropna()

    def total_return(self) -> float:
        return (self.equity.iloc[-1] / self.equity.iloc[0]) - 1

    def annualized_return(self) -> float:
        n_periods = len(self.returns)
        total = 1 + self.total_return()
        return total ** (self.ppy / n_periods) - 1

    def volatility(self) -> float:
        return self.returns.std() * np.sqrt(self.ppy)

    def sharpe_ratio(self) -> float:
        excess_return = self.annualized_return() - self.rf
        vol = self.volatility()
        return excess_return / vol if vol > 0 else 0

    def sortino_ratio(self) -> float:
        downside = self.returns[self.returns < 0]
        if len(downside) == 0:
            return np.inf
        downside_vol = np.sqrt((downside ** 2).mean()) * np.sqrt(self.ppy)
        excess_return = self.annualized_return() - self.rf
        return excess_return / downside_vol if downside_vol > 0 else 0

    def max_drawdown(self) -> float:
        cumulative = (1 + self.returns).cumprod()
        rolling_max = cumulative.cummax()
        drawdown = cumulative / rolling_max - 1
        return drawdown.min()

    def max_drawdown_duration(self) -> int:
        """Maximum number of periods spent in drawdown."""
        cumulative = (1 + self.returns).cumprod()
        rolling_max = cumulative.cummax()
        in_drawdown = cumulative < rolling_max

        max_duration = 0
        current_duration = 0
        for is_dd in in_drawdown:
            if is_dd:
                current_duration += 1
                max_duration = max(max_duration, current_duration)
            else:
                current_duration = 0

        return max_duration

    def calmar_ratio(self) -> float:
        mdd = abs(self.max_drawdown())
        if mdd == 0:
            return np.inf
        return self.annualized_return() / mdd

    def var(self, alpha: float = 0.05) -> float:
        return -np.percentile(self.returns, alpha * 100)

    def cvar(self, alpha: float = 0.05) -> float:
        var = self.var(alpha)
        tail = self.returns[self.returns <= -var]
        return -tail.mean() if len(tail) > 0 else var

    def win_rate(self) -> float:
        if not self.trades:
            return 0
        wins = sum(1 for t in self.trades if t['pnl'] > 0)
        return wins / len(self.trades)

    def profit_factor(self) -> float:
        gross_profit = sum(t['pnl'] for t in self.trades if t['pnl'] > 0)
        gross_loss = abs(sum(t['pnl'] for t in self.trades if t['pnl'] < 0))
        return gross_profit / gross_loss if gross_loss > 0 else np.inf

    def expectancy(self) -> float:
        if not self.trades:
            return 0
        return np.mean([t['pnl'] for t in self.trades])

    def avg_win_loss_ratio(self) -> float:
        wins = [t['pnl'] for t in self.trades if t['pnl'] > 0]
        losses = [abs(t['pnl']) for t in self.trades if t['pnl'] < 0]

        avg_win = np.mean(wins) if wins else 0
        avg_loss = np.mean(losses) if losses else 0

        return avg_win / avg_loss if avg_loss > 0 else np.inf

    def avg_holding_period(self) -> float:
        """Average time in position (in periods)."""
        if not self.trades:
            return 0
        durations = []
        for t in self.trades:
            if 'entry_time' in t and 'exit_time' in t:
                duration = (t['exit_time'] - t['entry_time']).total_seconds()
                durations.append(duration / 86400)  # Convert to days
        return np.mean(durations) if durations else 0

    def monthly_returns(self) -> pd.Series:
        """Compute monthly return series."""
        monthly = self.equity.resample('M').last()
        return monthly.pct_change().dropna()

    def compute_all(self) -> Dict:
        """Compute all metrics and return as dictionary."""
        return {
            'total_return': self.total_return(),
            'annualized_return': self.annualized_return(),
            'volatility': self.volatility(),
            'sharpe_ratio': self.sharpe_ratio(),
            'sortino_ratio': self.sortino_ratio(),
            'calmar_ratio': self.calmar_ratio(),
            'max_drawdown': self.max_drawdown(),
            'max_drawdown_duration': self.max_drawdown_duration(),
            'var_5pct': self.var(0.05),
            'cvar_5pct': self.cvar(0.05),
            'win_rate': self.win_rate(),
            'profit_factor': self.profit_factor(),
            'expectancy': self.expectancy(),
            'avg_win_loss_ratio': self.avg_win_loss_ratio(),
            'avg_holding_period_days': self.avg_holding_period(),
            'total_trades': len(self.trades),
        }

26.9 Statistical Significance of Backtest Results

26.9.1 The Fundamental Question: Luck or Skill?

A backtest shows a 15% annual return and a Sharpe ratio of 1.2. Is this genuine alpha, or could a strategy with no edge have produced this result by chance?

This question is critical because the space of possible strategies is vast, and even random strategies will occasionally produce impressive-looking results. We need formal statistical tests to distinguish signal from noise.

26.9.2 The t-Test for Returns

The simplest test: is the mean return significantly different from zero?

$$t = \frac{\bar{r}}{\sigma_r / \sqrt{n}}$$

Where $\bar{r}$ is the mean return, $\sigma_r$ is the standard deviation of returns, and $n$ is the number of observations.

Under the null hypothesis of zero expected return, $t$ follows a Student's t-distribution with $n-1$ degrees of freedom.

Rule of thumb: You need at least 30 trades for this test to be meaningful, and 100+ trades for reliable results.

26.9.3 Permutation Tests

A permutation test answers: "If the timing of my trades were random, how often would I see a result this good?"

The procedure:

  1. Calculate the actual strategy performance metric $M_{actual}$.
  2. Randomly shuffle the assignment of signals to returns (breaking the temporal relationship).
  3. Calculate the metric on the shuffled data: $M_{shuffled}$.
  4. Repeat steps 2--3 many times (e.g., 10,000 permutations).
  5. The p-value is the fraction of permutations where $M_{shuffled} \geq M_{actual}$.
def permutation_test(returns: np.ndarray,
                     signals: np.ndarray,
                     n_permutations: int = 10000,
                     metric: str = 'sharpe') -> Dict:
    """
    Test if strategy performance is significantly better than random.

    returns: Array of market returns
    signals: Array of strategy signals (+1, -1, 0)
    n_permutations: Number of random permutations
    metric: 'sharpe', 'total_return', or 'profit_factor'
    """

    # Calculate actual strategy returns
    strategy_returns = signals * returns

    # Calculate actual metric
    if metric == 'sharpe':
        actual_metric = (strategy_returns.mean() /
                        strategy_returns.std() * np.sqrt(252)
                        if strategy_returns.std() > 0 else 0)
    elif metric == 'total_return':
        actual_metric = strategy_returns.sum()
    elif metric == 'profit_factor':
        wins = strategy_returns[strategy_returns > 0].sum()
        losses = abs(strategy_returns[strategy_returns < 0].sum())
        actual_metric = wins / losses if losses > 0 else np.inf

    # Generate permutation distribution
    permuted_metrics = []
    for _ in range(n_permutations):
        # Shuffle signals (break temporal relationship)
        perm_signals = np.random.permutation(signals)
        perm_returns = perm_signals * returns

        if metric == 'sharpe':
            m = (perm_returns.mean() / perm_returns.std() * np.sqrt(252)
                 if perm_returns.std() > 0 else 0)
        elif metric == 'total_return':
            m = perm_returns.sum()
        elif metric == 'profit_factor':
            wins = perm_returns[perm_returns > 0].sum()
            losses = abs(perm_returns[perm_returns < 0].sum())
            m = wins / losses if losses > 0 else np.inf

        permuted_metrics.append(m)

    permuted_metrics = np.array(permuted_metrics)

    # Calculate p-value
    p_value = np.mean(permuted_metrics >= actual_metric)

    return {
        'actual_metric': actual_metric,
        'permutation_mean': permuted_metrics.mean(),
        'permutation_std': permuted_metrics.std(),
        'p_value': p_value,
        'significant_5pct': p_value < 0.05,
        'significant_1pct': p_value < 0.01,
        'percentile': np.mean(permuted_metrics < actual_metric) * 100,
    }

26.9.4 Bootstrap Confidence Intervals

While permutation tests ask "is the result significantly positive?", bootstrap confidence intervals ask "what range of performance could I reasonably expect?"

def bootstrap_confidence_interval(returns: np.ndarray,
                                   n_bootstrap: int = 10000,
                                   confidence_level: float = 0.95,
                                   metric_func=None) -> Dict:
    """
    Compute bootstrap confidence interval for a performance metric.

    returns: Array of strategy returns
    n_bootstrap: Number of bootstrap samples
    confidence_level: e.g. 0.95 for 95% CI
    metric_func: Function that takes returns array and returns a scalar
    """
    if metric_func is None:
        # Default: Sharpe ratio
        def metric_func(r):
            return r.mean() / r.std() * np.sqrt(252) if r.std() > 0 else 0

    # Generate bootstrap distribution
    bootstrap_metrics = []
    n = len(returns)

    for _ in range(n_bootstrap):
        # Sample with replacement
        sample = np.random.choice(returns, size=n, replace=True)
        bootstrap_metrics.append(metric_func(sample))

    bootstrap_metrics = np.array(bootstrap_metrics)

    alpha = 1 - confidence_level
    lower = np.percentile(bootstrap_metrics, alpha / 2 * 100)
    upper = np.percentile(bootstrap_metrics, (1 - alpha / 2) * 100)

    return {
        'point_estimate': metric_func(returns),
        'bootstrap_mean': bootstrap_metrics.mean(),
        'bootstrap_std': bootstrap_metrics.std(),
        'ci_lower': lower,
        'ci_upper': upper,
        'confidence_level': confidence_level,
        'ci_contains_zero': lower <= 0 <= upper,
    }

26.9.5 Multiple Comparisons Correction

If you test $k$ strategies, the probability of at least one false positive at significance level $\alpha$ is:

$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^k$$

For $k = 20$ strategies at $\alpha = 0.05$: $P = 1 - 0.95^{20} = 0.64$. There is a 64% chance of finding a "significant" result even if no strategy has a genuine edge.

Bonferroni correction: Divide $\alpha$ by the number of tests: $\alpha_{adjusted} = \alpha / k$.

Benjamini-Hochberg (BH) procedure: Controls the False Discovery Rate (FDR) rather than the Family-Wise Error Rate. Less conservative than Bonferroni.

def multiple_comparison_correction(p_values: List[float],
                                    method: str = 'bonferroni',
                                    alpha: float = 0.05) -> Dict:
    """
    Correct for multiple comparisons.

    p_values: List of p-values from individual strategy tests
    method: 'bonferroni' or 'bh' (Benjamini-Hochberg)
    alpha: Significance level
    """
    k = len(p_values)
    p_array = np.array(p_values)

    if method == 'bonferroni':
        adjusted_alpha = alpha / k
        significant = p_array < adjusted_alpha
        adjusted_p = np.minimum(p_array * k, 1.0)

    elif method == 'bh':
        # Benjamini-Hochberg procedure
        sorted_indices = np.argsort(p_array)
        sorted_p = p_array[sorted_indices]

        # Calculate BH critical values
        bh_critical = [(i + 1) / k * alpha for i in range(k)]

        # Find largest p-value that is below its critical value
        significant_sorted = np.zeros(k, dtype=bool)
        max_significant = -1
        for i in range(k):
            if sorted_p[i] <= bh_critical[i]:
                max_significant = i

        if max_significant >= 0:
            significant_sorted[:max_significant + 1] = True

        # Map back to original order
        significant = np.zeros(k, dtype=bool)
        significant[sorted_indices] = significant_sorted

        # Adjusted p-values
        adjusted_p = np.zeros(k)
        adjusted_p[sorted_indices] = np.minimum.accumulate(
            sorted_p * k / (np.arange(k) + 1)
        )[::-1]
        adjusted_p = np.minimum(adjusted_p, 1.0)

    return {
        'original_p_values': p_values,
        'adjusted_p_values': adjusted_p.tolist(),
        'significant': significant.tolist(),
        'num_significant': significant.sum(),
        'method': method,
        'alpha': alpha,
        'num_tests': k,
    }

26.9.6 Minimum Number of Trades

How many trades do you need for a statistically meaningful backtest? The answer depends on the effect size (how large the edge is) and the desired statistical power.

For detecting a Sharpe ratio of $S$ with power $1 - \beta$ at significance level $\alpha$:

$$n \geq \left(\frac{z_\alpha + z_\beta}{S / \sqrt{252}}\right)^2$$

For a Sharpe of 1.0 with 80% power at 5% significance: $n \geq \left(\frac{1.645 + 0.842}{1/\sqrt{252}}\right)^2 \approx 1{,}557$ daily observations, or about 6 years of daily data.

For a Sharpe of 2.0, you need about 389 observations. The higher the Sharpe, the fewer observations needed.

from scipy import stats as scipy_stats

def minimum_trades_required(target_sharpe: float,
                             alpha: float = 0.05,
                             power: float = 0.80,
                             periods_per_year: int = 252) -> int:
    """
    Calculate minimum number of observations needed to detect
    a given Sharpe ratio with specified statistical power.
    """
    z_alpha = scipy_stats.norm.ppf(1 - alpha)
    z_beta = scipy_stats.norm.ppf(power)

    # Effect size per period
    effect_per_period = target_sharpe / np.sqrt(periods_per_year)

    n = ((z_alpha + z_beta) / effect_per_period) ** 2

    return int(np.ceil(n))

# Examples
for sharpe in [0.5, 1.0, 1.5, 2.0, 3.0]:
    n = minimum_trades_required(sharpe)
    years = n / 252
    print(f"Sharpe {sharpe:.1f}: need {n:,} observations "
          f"({years:.1f} years of daily data)")

26.10 Backtest Report Generation

26.10.1 What a Good Report Contains

A comprehensive backtest report should include:

  1. Strategy description: What the strategy does, its parameters, and its rationale.
  2. Summary statistics: All metrics from Section 26.8 in a clean table.
  3. Equity curve: Portfolio value over time.
  4. Drawdown chart: Drawdown percentage over time.
  5. Monthly returns heatmap: Returns by month and year.
  6. Trade distribution: Histogram of individual trade P&Ls.
  7. Rolling Sharpe ratio: How the Sharpe changes over time.
  8. Statistical significance: Results of permutation tests and confidence intervals.
  9. Parameter sensitivity: How performance changes with different parameter values.
  10. Comparison to benchmarks: How does this perform vs. simple benchmarks?

26.10.2 Python Report Generator

import matplotlib
matplotlib.use('Agg')  # Non-interactive backend
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

class BacktestReport:
    """Generate a comprehensive backtest report."""

    def __init__(self, equity_curve: pd.Series,
                 trades: List[Dict],
                 strategy_name: str = "Strategy",
                 benchmark_equity: Optional[pd.Series] = None):
        self.equity = equity_curve
        self.trades = trades
        self.strategy_name = strategy_name
        self.benchmark = benchmark_equity
        self.metrics = PerformanceMetrics(equity_curve, trades)

    def generate_summary_table(self) -> str:
        """Generate text summary of all metrics."""
        m = self.metrics.compute_all()

        lines = [
            f"{'='*60}",
            f"  BACKTEST REPORT: {self.strategy_name}",
            f"{'='*60}",
            f"",
            f"  Period: {self.equity.index[0].strftime('%Y-%m-%d')} to "
            f"{self.equity.index[-1].strftime('%Y-%m-%d')}",
            f"  Total Trades: {m['total_trades']}",
            f"",
            f"  --- Return Metrics ---",
            f"  Total Return:        {m['total_return']:.2%}",
            f"  Annualized Return:   {m['annualized_return']:.2%}",
            f"",
            f"  --- Risk Metrics ---",
            f"  Volatility:          {m['volatility']:.2%}",
            f"  Max Drawdown:        {m['max_drawdown']:.2%}",
            f"  Max DD Duration:     {m['max_drawdown_duration']} periods",
            f"  VaR (5%):            {m['var_5pct']:.4f}",
            f"  CVaR (5%):           {m['cvar_5pct']:.4f}",
            f"",
            f"  --- Risk-Adjusted Metrics ---",
            f"  Sharpe Ratio:        {m['sharpe_ratio']:.3f}",
            f"  Sortino Ratio:       {m['sortino_ratio']:.3f}",
            f"  Calmar Ratio:        {m['calmar_ratio']:.3f}",
            f"",
            f"  --- Trade Metrics ---",
            f"  Win Rate:            {m['win_rate']:.2%}",
            f"  Profit Factor:       {m['profit_factor']:.3f}",
            f"  Expectancy:          ${m['expectancy']:.4f}",
            f"  Avg Win/Loss:        {m['avg_win_loss_ratio']:.3f}",
            f"  Avg Holding Period:  {m['avg_holding_period_days']:.1f} days",
            f"{'='*60}",
        ]
        return "\n".join(lines)

    def plot_equity_curve(self, ax=None):
        """Plot the equity curve."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(12, 6))

        ax.plot(self.equity.index, self.equity.values,
                label=self.strategy_name, linewidth=1.5)

        if self.benchmark is not None:
            ax.plot(self.benchmark.index, self.benchmark.values,
                    label='Benchmark', linewidth=1, alpha=0.7,
                    linestyle='--')

        ax.set_title('Equity Curve')
        ax.set_xlabel('Date')
        ax.set_ylabel('Portfolio Value ($)')
        ax.legend()
        ax.grid(True, alpha=0.3)

        return ax

    def plot_drawdown(self, ax=None):
        """Plot the drawdown chart."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(12, 4))

        returns = self.equity.pct_change().dropna()
        cumulative = (1 + returns).cumprod()
        drawdown = cumulative / cumulative.cummax() - 1

        ax.fill_between(drawdown.index, drawdown.values, 0,
                        color='red', alpha=0.3)
        ax.plot(drawdown.index, drawdown.values,
                color='red', linewidth=0.5)

        ax.set_title('Drawdown')
        ax.set_xlabel('Date')
        ax.set_ylabel('Drawdown (%)')
        ax.grid(True, alpha=0.3)

        return ax

    def plot_monthly_returns_heatmap(self, ax=None):
        """Plot monthly returns as a heatmap."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(12, 6))

        monthly = self.equity.resample('M').last().pct_change().dropna()

        # Create pivot table: year x month
        monthly_df = pd.DataFrame({
            'year': monthly.index.year,
            'month': monthly.index.month,
            'return': monthly.values,
        })

        pivot = monthly_df.pivot_table(
            index='year', columns='month', values='return'
        )

        month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                       'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

        # Ensure all months present
        for m in range(1, 13):
            if m not in pivot.columns:
                pivot[m] = np.nan
        pivot = pivot.reindex(columns=range(1, 13))

        im = ax.imshow(pivot.values, cmap='RdYlGn', aspect='auto',
                       vmin=-0.1, vmax=0.1)

        ax.set_xticks(range(12))
        ax.set_xticklabels(month_names)
        ax.set_yticks(range(len(pivot.index)))
        ax.set_yticklabels(pivot.index)
        ax.set_title('Monthly Returns Heatmap')

        # Add text annotations
        for i in range(pivot.shape[0]):
            for j in range(pivot.shape[1]):
                val = pivot.values[i, j]
                if not np.isnan(val):
                    ax.text(j, i, f'{val:.1%}',
                           ha='center', va='center', fontsize=8)

        plt.colorbar(im, ax=ax, label='Return')

        return ax

    def plot_trade_distribution(self, ax=None):
        """Plot histogram of trade P&Ls."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(10, 5))

        pnls = [t['pnl'] for t in self.trades]

        ax.hist(pnls, bins=50, color='steelblue', alpha=0.7,
                edgecolor='black', linewidth=0.5)
        ax.axvline(x=0, color='red', linestyle='--', linewidth=1)
        ax.axvline(x=np.mean(pnls), color='green', linestyle='--',
                  linewidth=1, label=f'Mean: ${np.mean(pnls):.4f}')

        ax.set_title('Trade P&L Distribution')
        ax.set_xlabel('P&L ($)')
        ax.set_ylabel('Frequency')
        ax.legend()
        ax.grid(True, alpha=0.3)

        return ax

    def plot_rolling_sharpe(self, window: int = 60, ax=None):
        """Plot rolling Sharpe ratio."""
        if ax is None:
            fig, ax = plt.subplots(figsize=(12, 4))

        returns = self.equity.pct_change().dropna()
        rolling_mean = returns.rolling(window).mean()
        rolling_std = returns.rolling(window).std()
        rolling_sharpe = (rolling_mean / rolling_std *
                         np.sqrt(252)).dropna()

        ax.plot(rolling_sharpe.index, rolling_sharpe.values,
                linewidth=1, color='steelblue')
        ax.axhline(y=0, color='red', linestyle='--', linewidth=0.5)
        ax.axhline(y=1, color='green', linestyle='--', linewidth=0.5,
                  alpha=0.5)

        ax.set_title(f'Rolling {window}-Period Sharpe Ratio')
        ax.set_xlabel('Date')
        ax.set_ylabel('Sharpe Ratio')
        ax.grid(True, alpha=0.3)

        return ax

    def generate_full_report(self, save_path: str = None):
        """Generate complete multi-panel report."""
        fig = plt.figure(figsize=(16, 24))

        # Layout: 6 rows
        gs = fig.add_gridspec(6, 2, hspace=0.4, wspace=0.3)

        # Row 1: Equity curve (full width)
        ax1 = fig.add_subplot(gs[0, :])
        self.plot_equity_curve(ax1)

        # Row 2: Drawdown (full width)
        ax2 = fig.add_subplot(gs[1, :])
        self.plot_drawdown(ax2)

        # Row 3: Monthly heatmap (full width)
        ax3 = fig.add_subplot(gs[2, :])
        self.plot_monthly_returns_heatmap(ax3)

        # Row 4: Trade distribution + Rolling Sharpe
        ax4 = fig.add_subplot(gs[3, 0])
        self.plot_trade_distribution(ax4)

        ax5 = fig.add_subplot(gs[3, 1])
        self.plot_rolling_sharpe(ax=ax5)

        # Row 5: Summary statistics as text
        ax6 = fig.add_subplot(gs[4:, :])
        ax6.axis('off')
        summary = self.generate_summary_table()
        ax6.text(0.05, 0.95, summary, transform=ax6.transAxes,
                fontsize=9, verticalalignment='top',
                fontfamily='monospace')

        fig.suptitle(f'Backtest Report: {self.strategy_name}',
                    fontsize=16, fontweight='bold', y=0.98)

        if save_path:
            fig.savefig(save_path, dpi=150, bbox_inches='tight')
            print(f"Report saved to {save_path}")

        plt.close(fig)
        return fig

26.11 From Backtest to Paper Trading

26.11.1 The Gap Between Backtest and Reality

Even a perfectly conducted backtest operates under assumptions that may not hold in live trading:

  • Data quality. Your historical data is cleaner than real-time data feeds.
  • Execution timing. In backtesting, you react to each data event instantly. In reality, there is latency.
  • Behavioral factors. In a backtest, every trade is executed mechanically. In practice, you may hesitate, second-guess, or deviate from the strategy.
  • Market regime changes. The future may be structurally different from the past.
  • Feedback effects. Your own trading may affect the market, which the backtest does not model (unless you explicitly include market impact).

26.11.2 The Paper Trading Protocol

Paper trading (forward testing) bridges the gap between backtesting and live trading. Here is a systematic protocol:

Phase 1: Shadow Mode (2--4 weeks)

Run the strategy in real-time without placing any orders. Log every signal and compare it to what you would have expected based on the backtest.

Check: - Are signals being generated at the expected frequency? - Are the signals consistent with what the historical data would predict? - Is the data feed providing clean, timely data?

Phase 2: Simulated Execution (4--8 weeks)

Continue running without real orders, but simulate execution against live market data. Use the fill simulator from Section 26.5 with live bid/ask data.

Check: - Are simulated fills realistic? Compare to actual market trades. - Is the equity curve tracking the expected range from the backtest? - Are transaction costs in line with assumptions?

Phase 3: Small Live Trading (4--12 weeks)

Deploy with minimal capital (e.g., 1% of intended allocation). Execute real trades.

Check: - Does actual execution match simulated execution? - Are there any systematic differences between backtest, paper, and live? - Is the strategy surviving real-world friction?

Phase 4: Gradual Scale-Up

If Phase 3 results are consistent with backtest expectations, gradually increase position sizes. A common rule: double the allocation every 4 weeks, provided performance remains within expected bounds.

26.11.3 Go/No-Go Criteria

Define explicit criteria for proceeding to live trading:

def go_live_decision(backtest_metrics: Dict,
                     paper_metrics: Dict,
                     threshold_pct: float = 0.50) -> Dict:
    """
    Decide whether to proceed from paper trading to live trading.

    threshold_pct: Maximum allowable degradation from backtest
    """

    decisions = {}

    # 1. Sharpe ratio within acceptable range
    sharpe_ratio = paper_metrics['sharpe_ratio'] / backtest_metrics['sharpe_ratio']
    decisions['sharpe_degradation'] = {
        'backtest': backtest_metrics['sharpe_ratio'],
        'paper': paper_metrics['sharpe_ratio'],
        'ratio': sharpe_ratio,
        'pass': sharpe_ratio >= (1 - threshold_pct),
    }

    # 2. Win rate within acceptable range
    win_diff = abs(paper_metrics['win_rate'] - backtest_metrics['win_rate'])
    decisions['win_rate_consistency'] = {
        'backtest': backtest_metrics['win_rate'],
        'paper': paper_metrics['win_rate'],
        'difference': win_diff,
        'pass': win_diff < 0.15,  # Within 15 percentage points
    }

    # 3. Max drawdown not significantly worse
    dd_ratio = paper_metrics['max_drawdown'] / backtest_metrics['max_drawdown']
    decisions['drawdown_check'] = {
        'backtest': backtest_metrics['max_drawdown'],
        'paper': paper_metrics['max_drawdown'],
        'ratio': dd_ratio,
        'pass': dd_ratio < 1.5,  # DD no more than 50% worse
    }

    # 4. Sufficient number of trades
    decisions['sufficient_trades'] = {
        'paper_trades': paper_metrics['total_trades'],
        'pass': paper_metrics['total_trades'] >= 30,
    }

    # 5. Positive expectancy
    decisions['positive_expectancy'] = {
        'expectancy': paper_metrics['expectancy'],
        'pass': paper_metrics['expectancy'] > 0,
    }

    # Overall decision
    all_pass = all(d['pass'] for d in decisions.values())
    decisions['overall'] = {
        'go_live': all_pass,
        'checks_passed': sum(1 for d in decisions.values()
                            if isinstance(d, dict) and d.get('pass')),
        'total_checks': len(decisions) - 1,
    }

    return decisions

26.11.4 When to Stop

Equally important is knowing when to stop a live strategy. Define these rules before going live:

  1. Maximum drawdown stop. If the drawdown exceeds 1.5x the worst backtest drawdown, stop trading and investigate.
  2. Performance deviation stop. If the rolling 30-day Sharpe falls below -1.0, pause the strategy.
  3. Structural break stop. If the platform changes fees, rules, or market mechanics, pause and re-evaluate.
  4. Signal frequency stop. If the strategy generates significantly more or fewer signals than expected, investigate.

26.12 Chapter Summary

This chapter has built a complete backtesting infrastructure for prediction market strategies. Let us recapitulate the essential lessons:

Mindset: Backtesting is hypothesis testing, not profit demonstration. The goal is to falsify strategies, not confirm them.

Pitfalls: The seven deadly sins --- lookahead bias, overfitting, survivorship bias, unrealistic execution, ignoring costs, data snooping, and lack of statistical testing --- can each independently invalidate a backtest. Our framework addresses all seven architecturally.

Framework: The event-driven architecture (Data Handler, Strategy, Portfolio, Execution Simulator, Analyzer) enforces correct information flow and prevents lookahead bias structurally.

Fill simulation: Realistic execution modeling is essential in thin prediction markets. Our execution simulator models spreads, market impact (including square-root impact for larger orders), partial fills, and platform-specific fees.

Transaction costs: Trading fees, spread costs, market impact, and opportunity cost of capital lock-up must all be modeled. The cost sensitivity analysis reveals whether a strategy's edge survives real-world friction.

Walk-forward testing: Simple backtesting conflates in-sample and out-of-sample performance. Walk-forward analysis with rolling or anchored windows provides genuine out-of-sample evaluation and reveals parameter stability.

Performance metrics: A complete evaluation requires return metrics (total, annualized), risk metrics (drawdown, VaR, volatility), risk-adjusted metrics (Sharpe, Sortino, Calmar), and trade-level metrics (win rate, profit factor, expectancy).

Statistical significance: Permutation tests, bootstrap confidence intervals, and multiple comparisons correction distinguish genuine alpha from noise. The minimum observation requirements remind us that statistical claims require statistical evidence.

From backtest to live: The disciplined progression through shadow mode, simulated execution, small live trading, and gradual scale-up protects capital while validating backtest assumptions in the real world.


What's Next

In Chapter 27, we will explore Real-Time Data Pipelines for Prediction Markets, building the infrastructure needed to collect, process, and analyze prediction market data in real time. This is the natural next step: once you have a backtested strategy that passes all the tests in this chapter, you need a live data infrastructure to deploy it. We will cover streaming data architectures, real-time feature computation, and the engineering challenges of running prediction market strategies in production.


Key Equations Reference

Metric Formula
Sharpe Ratio $S = \frac{R_p - R_f}{\sigma_p}$
Sortino Ratio $Sortino = \frac{R_p - R_f}{\sigma_{downside}}$
Calmar Ratio $Calmar = \frac{R_{annual}}{\|MDD\|}$
Max Drawdown $MDD = \min_t\left(\frac{V_t}{\max_{s \leq t}V_s} - 1\right)$
Profit Factor $PF = \frac{\sum \text{wins}}{\sum \|\text{losses}\|}$
Expectancy $E = W \times \bar{G} - (1-W) \times \bar{L}$
Market Impact $C_{impact} = \sigma \cdot \beta \cdot \sqrt{Q/V}$
Min Observations $n \geq \left(\frac{z_\alpha + z_\beta}{S/\sqrt{252}}\right)^2$