Chapter 26 Key Takeaways

Core Principle

A backtest is a hypothesis test, not a profit demonstration. The disciplined backtester seeks to disprove a strategy, not to prove it. Any strategy that survives rigorous attempts at falsification deserves cautious confidence; any strategy whose only evidence is an impressive equity curve deserves suspicion.

The Big Ideas

1. Lookahead Bias Is the Most Dangerous Backtesting Error

Lookahead bias occurs when a backtest uses information that would not have been available at the time of the trading decision. It inflates performance by letting the strategy "see the future." Common forms in prediction markets:

  • Using resolution outcomes to filter which markets to trade.
  • Computing rolling statistics over the full dataset instead of a rolling window.
  • Training models on future data then generating signals at past timestamps.

Prevention: Use an event-driven architecture where data is fed one timestamp at a time, making it structurally impossible to access future data.

2. Survivorship Bias Inflates Returns by Excluding Failures

When the dataset includes only markets that resolved cleanly (excluding cancelled, delisted, or illiquid markets), the backtest overstates performance. Markets that "survive" tend to be better-quality, more liquid, and more predictable.

Prevention: Include all markets in the universe, even those that were cancelled. If a market disappears, model the exit at the last available price.

3. Overfitting Finds Patterns in Noise

Given enough parameters and enough data to search through, you will always find a combination that appears profitable, even on pure noise. The space of strategies that fit historical noise is enormous; the space of genuine alpha is tiny.

Prevention: - Minimize free parameters. - Use walk-forward analysis with separate training and test windows. - Apply the Bonferroni or Benjamini-Hochberg correction for multiple comparisons. - Demand statistical significance, not just positive returns.

4. Event-Driven Architecture Provides Structural Integrity

An event-driven backtest processes data one timestamp at a time through a pipeline:

$$\text{DataHandler} \rightarrow \text{Strategy} \rightarrow \text{Portfolio} \rightarrow \text{ExecutionSimulator}$$

Each component sees only past data, matching the information flow in live trading. This architecture prevents lookahead bias by design rather than by discipline.

5. Realistic Fill Simulation Is Non-Negotiable

Backtests that assume instant fills at the mid-price dramatically overstate performance. Real execution involves:

Cost Component Source
Spread cost Difference between bid/ask and mid-price
Market impact Your order moves the price against you
Partial fills Not all of your order gets filled
Slippage Price moves between signal and execution
Fees Platform trading fees

The square-root market impact model:

$$\text{Impact} = \sigma \cdot \beta \cdot \sqrt{\frac{Q}{V}}$$

where $\sigma$ is volatility, $\beta$ is the impact coefficient, $Q$ is order size, and $V$ is daily volume.

6. Walk-Forward Analysis Prevents In-Sample Overfitting

Walk-forward analysis splits data into sequential train/test windows:

  1. Train on window 1, test on window 2.
  2. Slide forward: train on window 2, test on window 3.
  3. Repeat until data is exhausted.

Only out-of-sample performance counts. If the strategy works in-sample but fails out-of-sample across multiple windows, it is overfit.

7. Performance Metrics Must Be Comprehensive

No single metric captures strategy quality. The essential battery:

Metric Formula What It Tells You
Sharpe Ratio $\frac{\bar{r}}{\sigma_r} \cdot \sqrt{252}$ Risk-adjusted return
Maximum Drawdown $\max_t (M_t - C_t)$ Worst peak-to-trough loss
Win Rate $\frac{N_{win}}{N_{total}}$ Fraction of profitable trades
Profit Factor $\frac{\sum \text{wins}}{\sum \lvert\text{losses}\rvert}$ Gross profits / gross losses
Expectancy $(\text{WR} \times \bar{W}) - ((1 - \text{WR}) \times \bar{L})$ Expected profit per trade
Calmar Ratio $\frac{\text{Annual Return}}{\text{Max Drawdown}}$ Return per unit of drawdown

8. Statistical Significance Separates Signal from Noise

A strategy's performance may be due to luck. Apply:

  • t-test on returns: Is mean return significantly different from zero?
  • Bailey-Lopez de Prado minimum backtest length: How long must the test run to detect a given Sharpe ratio?
  • Multiple comparison corrections: Bonferroni or BH when testing many strategies on the same data.

Rule of thumb: you need roughly $N \geq (z_\alpha / \text{SR})^2$ trades for a t-test to detect a Sharpe ratio of SR at significance level $\alpha$.

9. Prediction Markets Have Unique Backtesting Challenges

  • Binary outcomes: Markets resolve to 0 or 1, creating discontinuous returns.
  • Thin liquidity: Many markets cannot absorb meaningful position sizes.
  • Non-stationarity: The data-generating process changes as events unfold.
  • Platform-specific costs: Each platform (Polymarket, Kalshi, PredictIt) has different fee structures.
  • Short histories: Many individual markets exist for weeks or months, not years.

10. The Backtest Is Only the Beginning

A passing backtest is necessary but not sufficient. Before deploying capital:

  1. Paper trade the strategy in real time.
  2. Start with minimal capital and scale up gradually.
  3. Monitor live performance against backtest expectations.
  4. Have explicit drawdown limits that trigger strategy review.

Key Code Patterns

# Event-driven backtesting loop
for timestamp in data_handler.timestamps():
    bar = data_handler.get_latest(timestamp)
    signal = strategy.generate_signal(bar)
    order = portfolio.process_signal(signal)
    fill = execution_simulator.execute(order, bar)
    portfolio.update(fill)

Decision Framework

Question Recommendation
Architecture? Event-driven for rigor; vectorized for rapid screening
How to split data? Walk-forward with purged gap between train and test
Fill model? Square-root impact + partial fill probability
How many parameters? Fewer is better; each parameter adds overfitting risk
How to compare strategies? Paired t-test on per-trade or per-period returns
Minimum sample? 100+ trades for meaningful statistical inference
When is a backtest "good enough"? Positive out-of-sample Sharpe across multiple windows

The One-Sentence Summary

Build an event-driven backtest with realistic fills, walk-forward validation, and statistical significance testing, then treat any surviving strategy with cautious optimism rather than certainty.