Chapter 26 Key Takeaways
Core Principle
A backtest is a hypothesis test, not a profit demonstration. The disciplined backtester seeks to disprove a strategy, not to prove it. Any strategy that survives rigorous attempts at falsification deserves cautious confidence; any strategy whose only evidence is an impressive equity curve deserves suspicion.
The Big Ideas
1. Lookahead Bias Is the Most Dangerous Backtesting Error
Lookahead bias occurs when a backtest uses information that would not have been available at the time of the trading decision. It inflates performance by letting the strategy "see the future." Common forms in prediction markets:
- Using resolution outcomes to filter which markets to trade.
- Computing rolling statistics over the full dataset instead of a rolling window.
- Training models on future data then generating signals at past timestamps.
Prevention: Use an event-driven architecture where data is fed one timestamp at a time, making it structurally impossible to access future data.
2. Survivorship Bias Inflates Returns by Excluding Failures
When the dataset includes only markets that resolved cleanly (excluding cancelled, delisted, or illiquid markets), the backtest overstates performance. Markets that "survive" tend to be better-quality, more liquid, and more predictable.
Prevention: Include all markets in the universe, even those that were cancelled. If a market disappears, model the exit at the last available price.
3. Overfitting Finds Patterns in Noise
Given enough parameters and enough data to search through, you will always find a combination that appears profitable, even on pure noise. The space of strategies that fit historical noise is enormous; the space of genuine alpha is tiny.
Prevention: - Minimize free parameters. - Use walk-forward analysis with separate training and test windows. - Apply the Bonferroni or Benjamini-Hochberg correction for multiple comparisons. - Demand statistical significance, not just positive returns.
4. Event-Driven Architecture Provides Structural Integrity
An event-driven backtest processes data one timestamp at a time through a pipeline:
$$\text{DataHandler} \rightarrow \text{Strategy} \rightarrow \text{Portfolio} \rightarrow \text{ExecutionSimulator}$$
Each component sees only past data, matching the information flow in live trading. This architecture prevents lookahead bias by design rather than by discipline.
5. Realistic Fill Simulation Is Non-Negotiable
Backtests that assume instant fills at the mid-price dramatically overstate performance. Real execution involves:
| Cost Component | Source |
|---|---|
| Spread cost | Difference between bid/ask and mid-price |
| Market impact | Your order moves the price against you |
| Partial fills | Not all of your order gets filled |
| Slippage | Price moves between signal and execution |
| Fees | Platform trading fees |
The square-root market impact model:
$$\text{Impact} = \sigma \cdot \beta \cdot \sqrt{\frac{Q}{V}}$$
where $\sigma$ is volatility, $\beta$ is the impact coefficient, $Q$ is order size, and $V$ is daily volume.
6. Walk-Forward Analysis Prevents In-Sample Overfitting
Walk-forward analysis splits data into sequential train/test windows:
- Train on window 1, test on window 2.
- Slide forward: train on window 2, test on window 3.
- Repeat until data is exhausted.
Only out-of-sample performance counts. If the strategy works in-sample but fails out-of-sample across multiple windows, it is overfit.
7. Performance Metrics Must Be Comprehensive
No single metric captures strategy quality. The essential battery:
| Metric | Formula | What It Tells You |
|---|---|---|
| Sharpe Ratio | $\frac{\bar{r}}{\sigma_r} \cdot \sqrt{252}$ | Risk-adjusted return |
| Maximum Drawdown | $\max_t (M_t - C_t)$ | Worst peak-to-trough loss |
| Win Rate | $\frac{N_{win}}{N_{total}}$ | Fraction of profitable trades |
| Profit Factor | $\frac{\sum \text{wins}}{\sum \lvert\text{losses}\rvert}$ | Gross profits / gross losses |
| Expectancy | $(\text{WR} \times \bar{W}) - ((1 - \text{WR}) \times \bar{L})$ | Expected profit per trade |
| Calmar Ratio | $\frac{\text{Annual Return}}{\text{Max Drawdown}}$ | Return per unit of drawdown |
8. Statistical Significance Separates Signal from Noise
A strategy's performance may be due to luck. Apply:
- t-test on returns: Is mean return significantly different from zero?
- Bailey-Lopez de Prado minimum backtest length: How long must the test run to detect a given Sharpe ratio?
- Multiple comparison corrections: Bonferroni or BH when testing many strategies on the same data.
Rule of thumb: you need roughly $N \geq (z_\alpha / \text{SR})^2$ trades for a t-test to detect a Sharpe ratio of SR at significance level $\alpha$.
9. Prediction Markets Have Unique Backtesting Challenges
- Binary outcomes: Markets resolve to 0 or 1, creating discontinuous returns.
- Thin liquidity: Many markets cannot absorb meaningful position sizes.
- Non-stationarity: The data-generating process changes as events unfold.
- Platform-specific costs: Each platform (Polymarket, Kalshi, PredictIt) has different fee structures.
- Short histories: Many individual markets exist for weeks or months, not years.
10. The Backtest Is Only the Beginning
A passing backtest is necessary but not sufficient. Before deploying capital:
- Paper trade the strategy in real time.
- Start with minimal capital and scale up gradually.
- Monitor live performance against backtest expectations.
- Have explicit drawdown limits that trigger strategy review.
Key Code Patterns
# Event-driven backtesting loop
for timestamp in data_handler.timestamps():
bar = data_handler.get_latest(timestamp)
signal = strategy.generate_signal(bar)
order = portfolio.process_signal(signal)
fill = execution_simulator.execute(order, bar)
portfolio.update(fill)
Decision Framework
| Question | Recommendation |
|---|---|
| Architecture? | Event-driven for rigor; vectorized for rapid screening |
| How to split data? | Walk-forward with purged gap between train and test |
| Fill model? | Square-root impact + partial fill probability |
| How many parameters? | Fewer is better; each parameter adds overfitting risk |
| How to compare strategies? | Paired t-test on per-trade or per-period returns |
| Minimum sample? | 100+ trades for meaningful statistical inference |
| When is a backtest "good enough"? | Positive out-of-sample Sharpe across multiple windows |
The One-Sentence Summary
Build an event-driven backtest with realistic fills, walk-forward validation, and statistical significance testing, then treat any surviving strategy with cautious optimism rather than certainty.