Chapter 21: Key Takeaways

Exploratory Data Analysis of Market Data


1. EDA is the Foundation, Not a Formality

Exploratory Data Analysis is not a box to check before modeling---it is the intellectual foundation for every analytical decision that follows. Skipping or rushing EDA means building models on unverified assumptions. Every feature you engineer, every model you select, and every evaluation metric you choose should be informed by what you discovered during EDA.

2. Prediction Markets Require Specialized EDA

Standard financial EDA techniques must be adapted for prediction markets. Prices are bounded between 0 and 1, markets have finite lifetimes with known resolution dates, and volume follows event-driven lifecycles rather than continuous patterns. These peculiarities demand purpose-built analytical approaches.

3. Summary Statistics Reveal Market Character

Begin with descriptive statistics: mean, median, standard deviation, skewness, kurtosis, VWAP, price range, and total volume. These simple numbers tell you whether a market is liquid or thin, stable or volatile, trending or range-bound. Always compute both standard and robust statistics to detect the influence of outliers.

4. Price Time Series Tell the Story

A single price timeline---properly formatted with a fixed 0-to-1 y-axis, event annotations, and volume subplot---communicates more about a market than any table of numbers. Moving averages reveal trends, while price change distributions expose the statistical character of the market's dynamics.

5. Volume Patterns Reflect the Market Lifecycle

Prediction markets follow a characteristic volume lifecycle: low volume at creation, moderate volume during sustained trading, event-driven spikes, and often a final burst near resolution. Volume spikes frequently coincide with information events, and the volume-price relationship reveals whether price moves are informationally driven or noise.

6. Volatility is Mechanically Linked to Price Level

In binary markets, the maximum possible volatility at price $p$ is $\sqrt{p(1-p)}$. This means volatility near the boundaries (0 or 1) is mechanically constrained. Normalize volatility by this factor for fair comparison across different price levels. Volatility clustering---the tendency for volatile periods to persist---is common and can be captured by GARCH models.

7. Autocorrelation Tests Market Efficiency

If price changes are autocorrelated, the market is not fully efficient: past changes contain information about future changes. The ACF, runs test, and Ljung-Box test provide complementary evidence. Positive autocorrelation suggests momentum; negative autocorrelation suggests mean reversion; zero autocorrelation is consistent with efficiency (for linear predictability).

8. The Beta Distribution is Natural for Prediction Market Prices

Because prices are bounded between 0 and 1, the Beta distribution is the natural choice for modeling their distribution. U-shaped Beta distributions (both parameters less than 1) indicate markets that reach strong conclusions, while bell-shaped distributions indicate persistent uncertainty.

9. Cross-Market Relationships Contain Information

Related markets should be analyzed together. Correlations reveal co-movement, lead-lag analysis identifies which markets incorporate information faster, and contagion analysis shows how shocks propagate. Deviations from expected cross-market relationships may signal arbitrage opportunities.

10. Regime Detection Reveals Hidden Structure

Markets pass through qualitatively different phases (calm, volatile, trending, converging). Hidden Markov Models and change-point detection methods can identify these regimes from the data. Regime awareness improves trading strategies: different strategies work in different regimes.

11. Outliers Demand Careful Treatment

Not every extreme price is an error, and not every normal-looking price is correct. Use persistence, volume, cross-market confirmation, and external corroboration to distinguish genuine information shocks from data errors. Robust statistics (median, MAD, IQR) provide protection against outlier influence.

12. Separate Hypothesis Generation from Confirmation

The cardinal rule of EDA: never use the same data to both discover a pattern and confirm it. EDA generates hypotheses; formal testing on separate data confirms them. In prediction markets, where datasets are often small, the temptation to conflate these steps is particularly dangerous.

13. Visualization is Central to Understanding

Effective charts are not decoration---they are analytical tools. Prediction markets demand specialized visualizations: probability timelines, fan charts, probability ribbons, candlestick analogs, and correlation heatmaps. Interactive plots (via Plotly) enable exploration that static charts cannot match.

14. Build Reusable Templates

Create a standardized EDA template that you apply consistently to every market. This ensures thoroughness, enables comparison across markets, and prevents selective analysis. For large-scale analysis, automated report generation makes comprehensive EDA feasible across hundreds of markets.

15. EDA is Iterative, Not Linear

You will return to EDA repeatedly throughout the analysis lifecycle. Modeling failures send you back to understand your data better. New data demands fresh exploration. Feature engineering is informed by EDA findings, and model diagnostics often reveal patterns that were missed in the initial exploration. Treat EDA as an ongoing practice, not a completed step.