Chapter 21: Key Takeaways
Exploratory Data Analysis of Market Data
1. EDA is the Foundation, Not a Formality
Exploratory Data Analysis is not a box to check before modeling---it is the intellectual foundation for every analytical decision that follows. Skipping or rushing EDA means building models on unverified assumptions. Every feature you engineer, every model you select, and every evaluation metric you choose should be informed by what you discovered during EDA.
2. Prediction Markets Require Specialized EDA
Standard financial EDA techniques must be adapted for prediction markets. Prices are bounded between 0 and 1, markets have finite lifetimes with known resolution dates, and volume follows event-driven lifecycles rather than continuous patterns. These peculiarities demand purpose-built analytical approaches.
3. Summary Statistics Reveal Market Character
Begin with descriptive statistics: mean, median, standard deviation, skewness, kurtosis, VWAP, price range, and total volume. These simple numbers tell you whether a market is liquid or thin, stable or volatile, trending or range-bound. Always compute both standard and robust statistics to detect the influence of outliers.
4. Price Time Series Tell the Story
A single price timeline---properly formatted with a fixed 0-to-1 y-axis, event annotations, and volume subplot---communicates more about a market than any table of numbers. Moving averages reveal trends, while price change distributions expose the statistical character of the market's dynamics.
5. Volume Patterns Reflect the Market Lifecycle
Prediction markets follow a characteristic volume lifecycle: low volume at creation, moderate volume during sustained trading, event-driven spikes, and often a final burst near resolution. Volume spikes frequently coincide with information events, and the volume-price relationship reveals whether price moves are informationally driven or noise.
6. Volatility is Mechanically Linked to Price Level
In binary markets, the maximum possible volatility at price $p$ is $\sqrt{p(1-p)}$. This means volatility near the boundaries (0 or 1) is mechanically constrained. Normalize volatility by this factor for fair comparison across different price levels. Volatility clustering---the tendency for volatile periods to persist---is common and can be captured by GARCH models.
7. Autocorrelation Tests Market Efficiency
If price changes are autocorrelated, the market is not fully efficient: past changes contain information about future changes. The ACF, runs test, and Ljung-Box test provide complementary evidence. Positive autocorrelation suggests momentum; negative autocorrelation suggests mean reversion; zero autocorrelation is consistent with efficiency (for linear predictability).
8. The Beta Distribution is Natural for Prediction Market Prices
Because prices are bounded between 0 and 1, the Beta distribution is the natural choice for modeling their distribution. U-shaped Beta distributions (both parameters less than 1) indicate markets that reach strong conclusions, while bell-shaped distributions indicate persistent uncertainty.
9. Cross-Market Relationships Contain Information
Related markets should be analyzed together. Correlations reveal co-movement, lead-lag analysis identifies which markets incorporate information faster, and contagion analysis shows how shocks propagate. Deviations from expected cross-market relationships may signal arbitrage opportunities.
10. Regime Detection Reveals Hidden Structure
Markets pass through qualitatively different phases (calm, volatile, trending, converging). Hidden Markov Models and change-point detection methods can identify these regimes from the data. Regime awareness improves trading strategies: different strategies work in different regimes.
11. Outliers Demand Careful Treatment
Not every extreme price is an error, and not every normal-looking price is correct. Use persistence, volume, cross-market confirmation, and external corroboration to distinguish genuine information shocks from data errors. Robust statistics (median, MAD, IQR) provide protection against outlier influence.
12. Separate Hypothesis Generation from Confirmation
The cardinal rule of EDA: never use the same data to both discover a pattern and confirm it. EDA generates hypotheses; formal testing on separate data confirms them. In prediction markets, where datasets are often small, the temptation to conflate these steps is particularly dangerous.
13. Visualization is Central to Understanding
Effective charts are not decoration---they are analytical tools. Prediction markets demand specialized visualizations: probability timelines, fan charts, probability ribbons, candlestick analogs, and correlation heatmaps. Interactive plots (via Plotly) enable exploration that static charts cannot match.
14. Build Reusable Templates
Create a standardized EDA template that you apply consistently to every market. This ensures thoroughness, enables comparison across markets, and prevents selective analysis. For large-scale analysis, automated report generation makes comprehensive EDA feasible across hundreds of markets.
15. EDA is Iterative, Not Linear
You will return to EDA repeatedly throughout the analysis lifecycle. Modeling failures send you back to understand your data better. New data demands fresh exploration. Feature engineering is informed by EDA findings, and model diagnostics often reveal patterns that were missed in the initial exploration. Treat EDA as an ongoing practice, not a completed step.