31 min read

> "The greatest value of a picture is when it forces us to notice what we never expected to see."

Chapter 21: Exploratory Data Analysis of Market Data

"The greatest value of a picture is when it forces us to notice what we never expected to see." --- John W. Tukey, Exploratory Data Analysis (1977)

Before you build a predictive model, before you design a trading strategy, before you test a hypothesis about market efficiency, you must first look at your data. Exploratory Data Analysis (EDA) is the practice of systematically examining datasets to understand their structure, uncover patterns, detect anomalies, and generate hypotheses. In prediction markets, EDA is not merely a preliminary step---it is a discipline that separates informed participants from those trading blind.

Prediction market data is unlike traditional financial data in several important ways. Prices are bounded between 0 and 1 (or equivalently, 0 and 100 cents). Markets have finite lifetimes with known resolution dates. Volume patterns follow event-driven lifecycles rather than continuous trading patterns. These peculiarities demand specialized analytical approaches.

This chapter provides an exhaustive guide to conducting EDA on prediction market data. We will cover everything from basic summary statistics to advanced regime detection, with working Python code throughout. By the end, you will have a complete toolkit for understanding any prediction market dataset you encounter.


21.1 The EDA Mindset for Prediction Markets

21.1.1 What Questions Should You Ask?

EDA begins not with code but with questions. When you encounter a prediction market dataset, you should be asking:

About the market itself: - What is being predicted? What is the resolution criterion? - How long has the market been active? When does it resolve? - What is the current price, and how has it changed over time? - How liquid is the market? Is there enough volume to trust the price signal?

About price behavior: - Is the price trending or mean-reverting? - Are there sudden jumps or gradual drifts? - Does the price cluster near 0 or 1, or does it stay in the middle? - How volatile is the price? Has volatility changed over time? - Are price changes autocorrelated, or do they behave like a random walk?

About volume and participation: - When do people trade? Are there time-of-day or day-of-week patterns? - Do volume spikes correspond to information events? - Is the market getting more or less active over its lifetime? - What is the typical trade size?

About cross-market relationships: - Do related markets move together? - Does one market lead another? - Are there arbitrage relationships between markets?

About data quality: - Are there missing observations or gaps in the time series? - Are there obviously erroneous prices (e.g., a price of 0.99 immediately followed by 0.01)? - Does the data source provide consistent timestamps and formats?

21.1.2 Hypothesis Generation vs. Confirmation

EDA serves two distinct purposes that should not be confused:

Hypothesis generation is the primary purpose of EDA. You look at data with an open mind, notice patterns, and formulate testable hypotheses. For example, you might notice that prices tend to drift upward in the final week before resolution and hypothesize that this reflects late-arriving confirming information.

Hypothesis confirmation is what comes after EDA, using formal statistical tests and out-of-sample validation. The critical principle is: never use the same data to both generate and confirm a hypothesis. If you notice a pattern during EDA, you must test it on separate data. Violating this principle leads to overfitting and false discoveries.

In prediction markets, this distinction is especially important because the datasets are often small (hundreds or thousands of markets, not millions), making it easy to find spurious patterns.

21.1.3 EDA as the Foundation for Modeling

Every modeling decision you make downstream---choice of features, model family, evaluation metric---should be informed by your EDA findings:

EDA Finding Modeling Implication
Price distribution is bimodal Consider models that handle multi-modal targets
Strong autocorrelation in price changes Time-series models may outperform cross-sectional ones
Volume spikes precede price moves Volume is a useful feature for prediction
Non-stationary volatility Use GARCH or regime-switching models
Fat-tailed return distribution Avoid models that assume normality
Lead-lag between markets Include lagged cross-market features

Without EDA, you are building models on assumptions you have never verified.


21.2 Summary Statistics for Market Data

21.2.1 Descriptive Statistics for Prices

The simplest starting point is computing standard descriptive statistics for prices. Given a vector of observed prices $p_1, p_2, \ldots, p_n$, we compute:

Mean price: $$\bar{p} = \frac{1}{n} \sum_{i=1}^{n} p_i$$

Standard deviation: $$s_p = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (p_i - \bar{p})^2}$$

Skewness (third standardized moment): $$\gamma_1 = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{p_i - \bar{p}}{s_p}\right)^3$$

Kurtosis (fourth standardized moment, excess): $$\gamma_2 = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{p_i - \bar{p}}{s_p}\right)^4 - 3$$

For prediction markets, some additional statistics are particularly informative:

  • Time-weighted average price (TWAP): Weights each price by the duration it was the prevailing price.
  • Volume-weighted average price (VWAP): Weights each price by the volume traded at that price.
  • Price range: The difference between the maximum and minimum observed prices.
  • Final price: The last traded price before resolution, which represents the market's final probability estimate.
  • Resolution outcome: Whether the event occurred (1) or not (0).

21.2.2 Descriptive Statistics for Volumes

Volume statistics reveal the market's liquidity and activity patterns:

  • Total volume: The sum of all traded volume over the market's lifetime.
  • Average daily volume: Total volume divided by the number of active trading days.
  • Volume distribution: The distribution of daily volumes (often highly right-skewed).
  • Volume concentration: What fraction of total volume occurs in the top 10% of trading days.

21.2.3 Spread Statistics

The bid-ask spread is a direct measure of transaction costs and market quality:

  • Average spread: The mean difference between best ask and best bid.
  • Median spread: More robust to outliers than the mean.
  • Spread as percentage of price: Particularly important near the boundaries (a 2-cent spread means very different things at a price of 50 cents vs. 5 cents).

21.2.4 Distribution of Final Outcomes

When analyzing a collection of resolved markets, the distribution of final outcomes is revealing:

  • Resolution rate: What fraction of markets resolved Yes vs. No.
  • Calibration: For markets that traded at price $p$, what fraction actually resolved Yes? Perfect calibration means a price of 0.70 corresponds to a 70% resolution rate.
  • Average Brier score: A platform-level measure of forecast accuracy.

21.2.5 Python Summary Generator

import numpy as np
import pandas as pd
from scipy import stats

def compute_market_summary(prices, volumes, timestamps, outcome=None):
    """
    Compute comprehensive summary statistics for a single prediction market.

    Parameters
    ----------
    prices : array-like
        Observed prices (between 0 and 1).
    volumes : array-like
        Traded volumes corresponding to each price observation.
    timestamps : array-like
        Timestamps for each observation.
    outcome : int or None
        Resolution outcome (1 = Yes, 0 = No, None = unresolved).

    Returns
    -------
    dict
        Dictionary of summary statistics.
    """
    prices = np.array(prices, dtype=float)
    volumes = np.array(volumes, dtype=float)

    # Basic price statistics
    summary = {
        'n_observations': len(prices),
        'price_mean': np.mean(prices),
        'price_median': np.median(prices),
        'price_std': np.std(prices, ddof=1),
        'price_min': np.min(prices),
        'price_max': np.max(prices),
        'price_range': np.max(prices) - np.min(prices),
        'price_skewness': stats.skew(prices),
        'price_kurtosis': stats.kurtosis(prices),
        'price_iqr': np.percentile(prices, 75) - np.percentile(prices, 25),
        'price_first': prices[0],
        'price_last': prices[-1],
        'price_change': prices[-1] - prices[0],
    }

    # Volume statistics
    summary.update({
        'total_volume': np.sum(volumes),
        'mean_volume': np.mean(volumes),
        'median_volume': np.median(volumes),
        'max_volume': np.max(volumes),
        'volume_std': np.std(volumes, ddof=1),
        'volume_skewness': stats.skew(volumes),
    })

    # VWAP
    if np.sum(volumes) > 0:
        summary['vwap'] = np.average(prices, weights=volumes)
    else:
        summary['vwap'] = np.nan

    # Price change statistics
    price_changes = np.diff(prices)
    if len(price_changes) > 0:
        summary.update({
            'mean_price_change': np.mean(price_changes),
            'std_price_change': np.std(price_changes, ddof=1),
            'max_price_increase': np.max(price_changes),
            'max_price_decrease': np.min(price_changes),
            'pct_positive_changes': np.mean(price_changes > 0),
        })

    # Outcome-related
    if outcome is not None:
        summary['outcome'] = outcome
        summary['final_price_error'] = abs(prices[-1] - outcome)
        summary['brier_score'] = (prices[-1] - outcome) ** 2

    return summary


def compute_platform_summary(market_summaries_df):
    """
    Compute platform-level summary statistics from a DataFrame
    of individual market summaries.
    """
    df = market_summaries_df

    platform = {
        'total_markets': len(df),
        'resolved_markets': df['outcome'].notna().sum() if 'outcome' in df.columns else 0,
        'mean_total_volume': df['total_volume'].mean(),
        'median_total_volume': df['total_volume'].median(),
        'mean_price_range': df['price_range'].mean(),
        'mean_observations': df['n_observations'].mean(),
    }

    if 'outcome' in df.columns and 'brier_score' in df.columns:
        resolved = df[df['outcome'].notna()]
        platform['mean_brier_score'] = resolved['brier_score'].mean()
        platform['resolution_rate_yes'] = resolved['outcome'].mean()

    return platform

21.3 Price Time-Series Analysis

21.3.1 Plotting Price Evolution

The single most informative visualization for any prediction market is a time-series plot of prices. Unlike stock prices, prediction market prices are naturally bounded between 0 and 1, which affects how we should display them.

Best practices for prediction market price plots:

  1. Fix the y-axis from 0 to 1 (or 0% to 100%). Do not let the axis auto-scale, as this can make small fluctuations appear dramatic.
  2. Mark key events on the timeline (debates, announcements, data releases).
  3. Show the resolution date and outcome if the market has resolved.
  4. Include volume as a secondary axis or as bar charts below the price chart.

Price movements in prediction markets generally fall into three categories:

Trends are sustained directional movements reflecting gradually arriving information. In an election market, a candidate's price might trend upward over weeks as polls consistently show improvement. A linear trend can be estimated by ordinary least squares:

$$p_t = \alpha + \beta t + \epsilon_t$$

where $\beta$ represents the average price change per unit time.

Regimes are periods of qualitatively different behavior. A market might spend weeks trading in a narrow range around 0.50, then shift to a new range around 0.75 after a major event. We will cover regime detection in detail in Section 21.9.

Jumps are sudden, large price changes that occur within a short time window. These typically correspond to discrete information events. Formally, a jump at time $t$ can be identified when:

$$|p_t - p_{t-1}| > k \cdot \sigma$$

where $\sigma$ is the local standard deviation of price changes and $k$ is a threshold (typically 3 or more).

21.3.3 Moving Averages

Moving averages smooth out short-term fluctuations and reveal underlying trends. For prediction market data, we commonly use:

Simple Moving Average (SMA): $$\text{SMA}_t(w) = \frac{1}{w} \sum_{i=0}^{w-1} p_{t-i}$$

Exponential Moving Average (EMA): $$\text{EMA}_t = \alpha \cdot p_t + (1 - \alpha) \cdot \text{EMA}_{t-1}$$

where $\alpha = \frac{2}{w+1}$ and $w$ is the window size.

The crossover of short-term and long-term moving averages can signal trend changes. When the short-term MA crosses above the long-term MA, it suggests a shift to bullish sentiment. However, in prediction markets this signal is less reliable than in equity markets because the bounded price range limits the extent of trends.

21.3.4 Price Change Distributions

The distribution of price changes $\Delta p_t = p_t - p_{t-1}$ reveals important properties of the market:

  • Symmetric distribution centered at zero: Consistent with an efficient market where price changes are unpredictable.
  • Non-zero mean: Suggests a drift or trend in the price.
  • Fat tails (excess kurtosis): Indicates occasional large price jumps, more extreme than a normal distribution would predict.
  • Skewness: Positive skewness means large upward moves are more common than large downward moves (or vice versa for negative skewness).

For binary prediction markets, the distribution of price changes is inherently bounded. If the current price is 0.90, the maximum possible upward change is 0.10, but the maximum downward change is 0.90. This boundary effect creates asymmetry in the price change distribution near the extremes.

21.3.5 Time-of-Day and Day-of-Week Patterns

Prediction markets on platforms like Polymarket and Kalshi operate continuously, and trading activity often follows predictable temporal patterns:

  • US trading hours (9 AM - 5 PM ET) typically see the highest volume for US-focused markets.
  • Monday mornings often show price adjustments reflecting weekend information accumulation.
  • Event-driven spikes occur at known times (e.g., debate start times, data release schedules).

Detecting these patterns involves grouping observations by hour of day or day of week and computing statistics within each group:

# Group by hour and compute mean volume
hourly_volume = df.groupby(df['timestamp'].dt.hour)['volume'].mean()

21.3.6 Python Time-Series Explorer

See code/example-01-time-series-eda.py for a complete time-series exploration tool that includes:

  • Interactive price and volume plots
  • Moving average overlays
  • Price change distribution analysis
  • Temporal pattern detection
  • Jump detection with event annotation

21.4 Volume Analysis

21.4.1 Volume Patterns Over the Market Lifecycle

Prediction markets follow a distinctive lifecycle pattern that differs markedly from traditional financial instruments:

Early phase (market creation): Volume is typically low as the market is discovered by participants. Prices may be unreliable due to thin trading.

Middle phase (sustained trading): Volume stabilizes at a base level, with occasional spikes driven by information events. This is typically the longest phase.

Late phase (pre-resolution): Volume often increases as the resolution date approaches, particularly if uncertainty remains high. Markets that have already converged to 0 or 1 may see declining volume.

Resolution phase: A final burst of volume may occur as the outcome becomes clear and positions are closed.

Mathematically, the volume lifecycle can be modeled as a function of time-to-resolution $\tau = T - t$:

$$V(\tau) = V_0 + \alpha \cdot \tau^{-\beta} + \epsilon(\tau)$$

where $V_0$ is a baseline volume, $\alpha$ and $\beta$ control the increase in volume as resolution approaches, and $\epsilon(\tau)$ captures event-driven spikes.

21.4.2 Volume Spikes and Events

Volume spikes often correspond to information events. Detecting these spikes is a form of anomaly detection:

Z-score method: Flag observations where volume exceeds $k$ standard deviations above the mean:

$$z_t = \frac{V_t - \bar{V}}{\sigma_V}$$

A volume spike is detected when $z_t > k$ (typically $k = 2$ or $k = 3$).

Rolling window method: Compare each day's volume to a rolling average of recent volume:

$$\text{ratio}_t = \frac{V_t}{\text{SMA}_t(w)}$$

A spike is detected when $\text{ratio}_t > k$ (e.g., $k = 3$ means volume is three times the recent average).

21.4.3 Volume-Price Relationships

The relationship between volume and price changes is informative about market dynamics:

  • High volume + large price change: New information is being incorporated. The market is functioning efficiently.
  • High volume + small price change: Disagreement among participants. Buyers and sellers are roughly balanced.
  • Low volume + large price change: Potentially unreliable price movement. A single large order in a thin market can move the price substantially.
  • Low volume + small price change: Low interest or consensus among participants.

A useful metric is the price impact per unit volume:

$$\text{Impact}_t = \frac{|\Delta p_t|}{V_t}$$

High price impact indicates a thin market where even small trades move prices significantly.

21.4.4 Thin vs. Thick Markets

Market thickness---the amount of liquidity available---is a critical consideration for prediction market analysis:

Thin markets (low volume, wide spreads) pose several challenges: - Prices may not reflect true consensus probabilities. - Large trades can cause significant price impact. - Statistical analyses may be unreliable due to noise.

Thick markets (high volume, tight spreads) offer: - More reliable price signals. - Lower transaction costs. - Better statistical properties for analysis.

A practical rule of thumb: treat markets with fewer than 100 total trades or less than $1,000 in total volume as "thin" and interpret their prices with caution.

21.4.5 Python Volume Profiler

See code/example-02-volume-analysis.py for a complete volume analysis tool.


21.5 Volatility Analysis

21.5.1 Measuring Prediction Market Volatility

Volatility in prediction markets has unique characteristics. Since prices are bounded between 0 and 1, volatility is mechanically constrained by the price level. A price near 0.50 can exhibit much higher volatility than a price near 0.01 or 0.99, simply because there is more room for the price to move.

The simplest volatility measure is the standard deviation of price changes over a rolling window:

$$\sigma_t(w) = \sqrt{\frac{1}{w-1} \sum_{i=0}^{w-1} (\Delta p_{t-i} - \overline{\Delta p})^2}$$

For prediction markets, it is often more informative to compute volatility relative to the theoretical maximum at a given price level. For a binary market, the maximum standard deviation of a Bernoulli random variable with parameter $p$ is:

$$\sigma_{\max}(p) = \sqrt{p(1-p)}$$

The normalized volatility is then:

$$\sigma_{\text{norm}} = \frac{\sigma_{\text{observed}}}{\sigma_{\max}(p)}$$

This measures how volatile the market is relative to how volatile it could be given the current price level.

21.5.2 Realized Volatility

Realized volatility is computed from observed price changes over a specified window. For prediction markets, we typically compute it as:

$$RV_t(w) = \sqrt{\sum_{i=0}^{w-1} (\Delta p_{t-i})^2}$$

Note that this uses squared price changes directly, not demeaned changes. This is because under the efficient market hypothesis, the expected price change is zero, so demeaning adds noise.

Realized volatility can be computed at different frequencies (hourly, daily, weekly) to capture different aspects of market dynamics. Higher-frequency realized volatility captures intraday movements, while lower-frequency captures broader trends.

21.5.3 GARCH Basics for Binary Markets

The Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model captures the empirical phenomenon of volatility clustering: periods of high volatility tend to be followed by high volatility, and periods of low volatility by low volatility.

The basic GARCH(1,1) model for price changes $\Delta p_t$ is:

$$\Delta p_t = \mu + \epsilon_t, \quad \epsilon_t = \sigma_t z_t, \quad z_t \sim N(0,1)$$

$$\sigma_t^2 = \omega + \alpha \epsilon_{t-1}^2 + \beta \sigma_{t-1}^2$$

where: - $\omega > 0$ is the long-run variance component. - $\alpha \geq 0$ is the ARCH coefficient (reaction to recent shocks). - $\beta \geq 0$ is the GARCH coefficient (persistence of volatility). - $\alpha + \beta < 1$ ensures stationarity.

For prediction markets, GARCH has some limitations: - The bounded nature of prices means the normal distribution assumption for $z_t$ may be inappropriate near the boundaries. - Market lifetimes are finite, so "long-run" volatility may not be well-defined. - The model does not account for the systematic decrease in uncertainty as resolution approaches.

Despite these limitations, GARCH remains a useful descriptive tool for detecting and quantifying volatility clustering in prediction market data.

21.5.4 Volatility Clustering

Volatility clustering is the tendency for large price changes to be followed by large price changes (of either sign), and small changes to be followed by small changes. This is a well-documented phenomenon in financial markets and also appears in prediction markets.

To test for volatility clustering, examine the autocorrelation of squared price changes:

$$\rho_k^{(2)} = \text{Corr}(\Delta p_t^2, \Delta p_{t-k}^2)$$

If $\rho_1^{(2)} > 0$ and statistically significant, volatility clustering is present.

Another test is to divide the time series into blocks and compute volatility within each block. If the block volatilities are positively autocorrelated, clustering is present.

21.5.5 Python Volatility Analyzer

import numpy as np
import pandas as pd
from scipy import stats

def compute_volatility_metrics(prices, window=20):
    """
    Compute various volatility metrics for prediction market prices.

    Parameters
    ----------
    prices : array-like
        Time series of prices (between 0 and 1).
    window : int
        Rolling window size for volatility computation.

    Returns
    -------
    pd.DataFrame
        DataFrame with volatility metrics for each time step.
    """
    prices = pd.Series(prices, dtype=float)
    changes = prices.diff()

    result = pd.DataFrame(index=prices.index)
    result['price'] = prices
    result['price_change'] = changes

    # Rolling standard deviation of price changes
    result['rolling_volatility'] = changes.rolling(window=window).std()

    # Realized volatility (sum of squared changes)
    result['realized_volatility'] = (
        (changes ** 2).rolling(window=window).sum().apply(np.sqrt)
    )

    # Theoretical maximum volatility at current price level
    result['max_volatility'] = np.sqrt(prices * (1 - prices))

    # Normalized volatility
    result['normalized_volatility'] = (
        result['rolling_volatility'] / result['max_volatility']
    )

    # Squared changes for clustering analysis
    result['squared_change'] = changes ** 2

    return result


def test_volatility_clustering(price_changes, max_lag=10):
    """
    Test for volatility clustering using autocorrelation of squared changes.
    """
    squared = price_changes ** 2
    squared_clean = squared.dropna()

    results = []
    n = len(squared_clean)

    for lag in range(1, max_lag + 1):
        corr, pval = stats.pearsonr(
            squared_clean.iloc[lag:].values,
            squared_clean.iloc[:-lag].values
        )
        results.append({
            'lag': lag,
            'autocorrelation': corr,
            'p_value': pval,
            'significant': pval < 0.05
        })

    return pd.DataFrame(results)

21.6 Autocorrelation and Serial Dependence

21.6.1 Testing for Autocorrelation in Price Changes

If a prediction market is efficient, price changes should be unpredictable from past price changes. Formally, under the efficient market hypothesis:

$$E[\Delta p_t \mid \Delta p_{t-1}, \Delta p_{t-2}, \ldots] = 0$$

This implies zero autocorrelation at all lags:

$$\rho_k = \text{Corr}(\Delta p_t, \Delta p_{t-k}) = 0 \quad \forall k \geq 1$$

The sample autocorrelation function (ACF) estimates these autocorrelations from data:

$$\hat{\rho}_k = \frac{\sum_{t=k+1}^{n} (\Delta p_t - \overline{\Delta p})(\Delta p_{t-k} - \overline{\Delta p})}{\sum_{t=1}^{n} (\Delta p_t - \overline{\Delta p})^2}$$

Under the null hypothesis of no autocorrelation, $\hat{\rho}_k$ is approximately normally distributed with mean 0 and standard deviation $1/\sqrt{n}$. Values outside the $\pm 2/\sqrt{n}$ band are significant at the 5% level.

21.6.2 The Runs Test

The runs test is a non-parametric test for randomness. A "run" is a maximal sequence of consecutive price changes with the same sign. For example, in the sequence $(+, +, -, +, -, -, -)$, there are 4 runs: $(+, +)$, $(-)$, $(+)$, $(-, -, -)$.

If price changes are independent, the expected number of runs is:

$$E[R] = \frac{2 n_+ n_-}{n_+ + n_-} + 1$$

where $n_+$ is the number of positive changes and $n_-$ is the number of negative changes. The variance is:

$$\text{Var}[R] = \frac{2 n_+ n_- (2 n_+ n_- - n_+ - n_-)}{(n_+ + n_-)^2 (n_+ + n_- - 1)}$$

Under the null hypothesis, the test statistic $Z = (R - E[R]) / \sqrt{\text{Var}[R]}$ is approximately standard normal for large samples.

  • Too few runs (negative Z) suggests positive serial correlation (momentum/trending behavior).
  • Too many runs (positive Z) suggests negative serial correlation (mean-reverting behavior).

21.6.3 The Ljung-Box Test

The Ljung-Box test jointly tests whether a set of autocorrelations are all zero:

$$Q = n(n+2) \sum_{k=1}^{m} \frac{\hat{\rho}_k^2}{n-k}$$

Under the null hypothesis of no serial correlation, $Q$ follows a chi-squared distribution with $m$ degrees of freedom. A significant Q statistic (low p-value) indicates that at least one autocorrelation is significantly different from zero.

This is the most commonly used test for serial dependence in financial time series and is well-suited for prediction market analysis.

21.6.4 What Autocorrelation Implies About Market Efficiency

The interpretation of autocorrelation in prediction markets requires nuance:

Positive autocorrelation (momentum): - Prices tend to continue moving in the same direction. - Possible causes: gradual information incorporation, trend-following behavior, underreaction to news. - Trading implication: momentum strategies may be profitable.

Negative autocorrelation (mean reversion): - Prices tend to reverse direction. - Possible causes: overreaction followed by correction, bid-ask bounce, market-making activity. - Trading implication: contrarian strategies may be profitable.

No autocorrelation: - Consistent with an efficient market. - Price changes are unpredictable from past changes. - Trading implication: simple trend-following or mean-reversion strategies will not work.

Important caveat: autocorrelation analysis examines only linear dependence. Nonlinear dependencies (such as volatility clustering) can exist even when linear autocorrelation is zero.

21.6.5 Python Autocorrelation Analysis

import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.diagnostic import acorr_ljungbox

def autocorrelation_analysis(price_changes, max_lag=20):
    """
    Comprehensive autocorrelation analysis for price changes.

    Parameters
    ----------
    price_changes : array-like
        Series of price changes.
    max_lag : int
        Maximum lag to test.

    Returns
    -------
    dict
        Dictionary containing ACF values, runs test, and Ljung-Box results.
    """
    changes = pd.Series(price_changes).dropna()
    n = len(changes)

    # Sample autocorrelation function
    acf_values = []
    for k in range(1, max_lag + 1):
        if k < n:
            corr = changes.autocorr(lag=k)
            acf_values.append({
                'lag': k,
                'acf': corr,
                'significant': abs(corr) > 2 / np.sqrt(n)
            })

    # Runs test
    signs = np.sign(changes[changes != 0])
    n_pos = (signs > 0).sum()
    n_neg = (signs < 0).sum()
    n_total = n_pos + n_neg

    runs = 1 + np.sum(np.diff(signs) != 0)
    expected_runs = (2 * n_pos * n_neg) / n_total + 1

    if n_total > 1:
        var_runs = (
            (2 * n_pos * n_neg * (2 * n_pos * n_neg - n_total))
            / (n_total ** 2 * (n_total - 1))
        )
        z_runs = (runs - expected_runs) / np.sqrt(var_runs) if var_runs > 0 else 0
        p_runs = 2 * (1 - stats.norm.cdf(abs(z_runs)))
    else:
        z_runs = 0
        p_runs = 1.0

    runs_result = {
        'n_runs': runs,
        'expected_runs': expected_runs,
        'z_statistic': z_runs,
        'p_value': p_runs,
        'interpretation': (
            'momentum' if z_runs < -1.96 else
            'mean_reversion' if z_runs > 1.96 else
            'random'
        )
    }

    # Ljung-Box test
    lb_result = acorr_ljungbox(changes, lags=max_lag, return_df=True)

    return {
        'acf': pd.DataFrame(acf_values),
        'runs_test': runs_result,
        'ljung_box': lb_result,
        'n_observations': n
    }

21.7 Distribution Analysis

21.7.1 Price Distribution: Beta Distribution Fit

Prediction market prices are bounded between 0 and 1, making the Beta distribution a natural choice for modeling their distribution. The Beta distribution has probability density function:

$$f(x; \alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$$

where $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the Beta function.

The parameters $\alpha$ and $\beta$ control the shape:

  • $\alpha = \beta = 1$: Uniform distribution (maximum uncertainty).
  • $\alpha = \beta > 1$: Symmetric, bell-shaped, centered at 0.5.
  • $\alpha > \beta$: Right-skewed, concentrated near 1.
  • $\alpha < \beta$: Left-skewed, concentrated near 0.
  • $\alpha < 1$ and $\beta < 1$: U-shaped (bimodal, concentrated near both 0 and 1).

For a collection of prediction market final prices, the Beta distribution fit reveals the overall "confidence" of the market. A U-shaped distribution (small $\alpha, \beta$) indicates markets that tend to reach strong conclusions, while a bell-shaped distribution indicates markets that remain uncertain.

21.7.2 Bimodal Patterns

Many prediction markets exhibit bimodal price distributions, especially over their full lifetime. A market might trade near 0.30 for several weeks, then jump to 0.70 after new information, creating two distinct modes in the price distribution.

Testing for bimodality can be done with:

Hartigan's dip test: Tests the null hypothesis that the distribution is unimodal. A significant p-value indicates multimodality.

Silverman's test: Based on kernel density estimation with varying bandwidths. The minimum bandwidth for which the estimated density is unimodal provides evidence about the number of modes.

Visual inspection: A histogram or kernel density estimate (KDE) plot often reveals bimodality more clearly than formal tests.

21.7.3 Extreme Value Concentration

A striking property of prediction market prices is their tendency to concentrate near the extremes (0 and 1). As markets approach resolution, the true outcome becomes increasingly apparent, and prices migrate toward 0 or 1. This means the time-averaged price distribution across all markets tends to have excess mass at the boundaries.

The degree of extreme value concentration can be quantified by:

$$\text{EVC} = P(p < 0.10) + P(p > 0.90)$$

This measures the fraction of time prices spend within 10% of the boundaries.

21.7.4 Returns Distribution: Fat Tails

While equity returns are often modeled as approximately normal (with known deviations), prediction market price changes exhibit more complex distributional properties:

  • Bounded support: Price changes are bounded by $[-p_t, 1-p_t]$.
  • Fat tails: Large price changes occur more frequently than a normal distribution predicts. This is quantified by excess kurtosis.
  • Asymmetry: Near the price boundaries, the distribution of changes is necessarily asymmetric.

For comparing with a normal distribution, we can compute the proportion of observations beyond $k$ standard deviations:

Threshold Normal Distribution Typical Prediction Market
$\pm 2\sigma$ 4.6% 6-10%
$\pm 3\sigma$ 0.3% 1-3%
$\pm 4\sigma$ 0.006% 0.2-1%

These elevated tail probabilities have important implications for risk management: extreme price movements are more likely than a normal model would suggest.

21.7.5 Python Distribution Fitter

import numpy as np
from scipy import stats
from scipy.optimize import minimize

def fit_beta_distribution(prices):
    """
    Fit a Beta distribution to prediction market prices.

    Parameters
    ----------
    prices : array-like
        Prices bounded between 0 and 1. Values exactly at 0 or 1
        are adjusted slightly inward.

    Returns
    -------
    dict
        Fitted parameters and goodness-of-fit statistics.
    """
    prices = np.array(prices, dtype=float)

    # Adjust boundary values slightly inward for beta fitting
    eps = 1e-6
    prices = np.clip(prices, eps, 1 - eps)

    # Method of moments estimates
    mean_p = np.mean(prices)
    var_p = np.var(prices)

    common = mean_p * (1 - mean_p) / var_p - 1
    alpha_mom = mean_p * common
    beta_mom = (1 - mean_p) * common

    # Maximum likelihood estimation
    alpha_mle, beta_mle, loc, scale = stats.beta.fit(
        prices, floc=0, fscale=1
    )

    # Goodness of fit: KS test
    ks_stat, ks_pvalue = stats.kstest(
        prices, 'beta', args=(alpha_mle, beta_mle)
    )

    return {
        'alpha_mom': alpha_mom,
        'beta_mom': beta_mom,
        'alpha_mle': alpha_mle,
        'beta_mle': beta_mle,
        'ks_statistic': ks_stat,
        'ks_pvalue': ks_pvalue,
        'mean': alpha_mle / (alpha_mle + beta_mle),
        'mode': (alpha_mle - 1) / (alpha_mle + beta_mle - 2)
               if alpha_mle > 1 and beta_mle > 1 else None,
        'is_unimodal': alpha_mle >= 1 and beta_mle >= 1,
        'is_u_shaped': alpha_mle < 1 and beta_mle < 1,
    }


def analyze_price_change_distribution(price_changes):
    """
    Analyze the distribution of price changes, comparing
    to a normal distribution.
    """
    changes = np.array(price_changes, dtype=float)
    changes = changes[~np.isnan(changes)]

    mu = np.mean(changes)
    sigma = np.std(changes, ddof=1)

    # Tail analysis
    tail_analysis = {}
    for k in [2, 3, 4]:
        observed = np.mean(np.abs(changes - mu) > k * sigma)
        expected = 2 * (1 - stats.norm.cdf(k))
        tail_analysis[f'{k}_sigma_observed'] = observed
        tail_analysis[f'{k}_sigma_expected'] = expected
        tail_analysis[f'{k}_sigma_ratio'] = (
            observed / expected if expected > 0 else float('inf')
        )

    # Normality tests
    shapiro_stat, shapiro_p = stats.shapiro(
        changes[:5000] if len(changes) > 5000 else changes
    )
    jb_stat, jb_p = stats.jarque_bera(changes)

    return {
        'mean': mu,
        'std': sigma,
        'skewness': stats.skew(changes),
        'kurtosis': stats.kurtosis(changes),
        'shapiro_stat': shapiro_stat,
        'shapiro_p': shapiro_p,
        'jarque_bera_stat': jb_stat,
        'jarque_bera_p': jb_p,
        'tail_analysis': tail_analysis
    }

21.8 Cross-Market Analysis

Prediction markets often exist in groups where the outcomes are logically related:

  • Complementary markets: "Will candidate A win?" and "Will candidate B win?" in a two-candidate race. These should sum to approximately 1.
  • Conditional markets: "Will policy X pass Congress?" and "Will policy X be signed by the President?" The second is conditional on the first.
  • Thematically related markets: Multiple markets on different aspects of the same event (election outcome, margin of victory, Senate control).

Computing the correlation between price changes in related markets reveals the strength of their relationship:

$$\rho_{AB} = \text{Corr}(\Delta p_t^A, \Delta p_t^B)$$

For complementary markets, we expect $\rho_{AB} \approx -1$. For conditionally related markets, the correlation depends on the nature of the conditioning. Deviations from expected correlations may indicate arbitrage opportunities or market inefficiencies.

21.8.2 Lead-Lag Relationships

A lead-lag relationship exists when price changes in one market systematically precede price changes in another. This can be detected using the cross-correlation function (CCF):

$$\rho_{AB}(k) = \text{Corr}(\Delta p_t^A, \Delta p_{t-k}^B)$$

If $\rho_{AB}(k) > 0$ for $k > 0$, then market A leads market B (changes in A precede changes in B). This might occur because:

  • Market A is more liquid and incorporates information faster.
  • Market A is on a platform with a more sophisticated participant base.
  • Market A covers a more fundamental question, while Market B covers a derivative question.

Lead-lag relationships, if robust and persistent, may represent trading opportunities: observe a move in the leading market and trade in the lagging market before it adjusts.

21.8.3 Contagion Effects

Contagion occurs when a shock in one market spreads to other, seemingly unrelated markets. In prediction markets, this can happen when:

  • A surprise outcome in one domain changes beliefs about the reliability of forecasts in other domains.
  • A large trader exits positions across multiple markets simultaneously.
  • Platform-level events (regulatory concerns, liquidity crises) affect all markets.

Contagion can be detected by examining whether correlations increase during periods of stress. The DCC (Dynamic Conditional Correlation) model or a simpler rolling correlation analysis can capture this phenomenon.

21.8.4 Python Cross-Market Analyzer with Heatmaps

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def compute_cross_market_correlations(price_dict, change=True):
    """
    Compute pairwise correlations between multiple markets.

    Parameters
    ----------
    price_dict : dict
        Dictionary mapping market names to price Series.
        All Series should share the same index (e.g., timestamps).
    change : bool
        If True, compute correlations of price changes.
        If False, compute correlations of price levels.

    Returns
    -------
    pd.DataFrame
        Correlation matrix.
    """
    df = pd.DataFrame(price_dict)

    if change:
        df = df.diff().dropna()

    return df.corr()


def compute_lead_lag(series_a, series_b, max_lag=10):
    """
    Compute the cross-correlation function between two series
    to detect lead-lag relationships.
    """
    changes_a = series_a.diff().dropna()
    changes_b = series_b.diff().dropna()

    # Align the series
    aligned = pd.concat([changes_a, changes_b], axis=1).dropna()
    a = aligned.iloc[:, 0].values
    b = aligned.iloc[:, 1].values

    results = []
    for lag in range(-max_lag, max_lag + 1):
        if lag > 0:
            corr = np.corrcoef(a[lag:], b[:-lag])[0, 1]
        elif lag < 0:
            corr = np.corrcoef(a[:lag], b[-lag:])[0, 1]
        else:
            corr = np.corrcoef(a, b)[0, 1]

        results.append({'lag': lag, 'cross_correlation': corr})

    results_df = pd.DataFrame(results)

    # Identify the lag with maximum absolute correlation
    max_idx = results_df['cross_correlation'].abs().idxmax()
    dominant_lag = results_df.loc[max_idx, 'lag']

    return results_df, dominant_lag


def plot_correlation_heatmap(corr_matrix, title="Cross-Market Correlations"):
    """
    Plot a correlation heatmap for multiple markets.
    """
    fig, ax = plt.subplots(figsize=(10, 8))

    mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)

    sns.heatmap(
        corr_matrix,
        mask=mask,
        annot=True,
        fmt='.2f',
        cmap='RdBu_r',
        center=0,
        vmin=-1,
        vmax=1,
        square=True,
        ax=ax
    )

    ax.set_title(title, fontsize=14)
    plt.tight_layout()
    return fig

21.9 Regime Detection

21.9.1 Identifying Market Regimes

A regime is a period during which the statistical properties of a time series remain roughly constant. In prediction markets, common regime types include:

Calm regime: Low volatility, prices fluctuating in a narrow range. Typically occurs when no new information is arriving. Characterized by small price changes and moderate volume.

Volatile regime: High volatility, large price swings. Typically occurs during information-rich periods (debates, data releases, breaking news). Characterized by large price changes and high volume.

Trending regime: Sustained directional movement. Occurs when information consistently favors one outcome. Characterized by positive autocorrelation in price changes.

Mean-reverting regime: Prices oscillate around a stable level. Occurs when the market has reached a consensus that is not being challenged by new information. Characterized by negative autocorrelation.

21.9.2 Hidden Markov Model Basics

The Hidden Markov Model (HMM) is a powerful framework for regime detection. The key idea is that the observed price changes are generated by an underlying, unobserved (hidden) state that switches between a finite number of regimes.

Model specification:

Let $S_t \in \{1, 2, \ldots, K\}$ denote the hidden state (regime) at time $t$. The model consists of:

  1. Transition probabilities: $a_{ij} = P(S_t = j \mid S_{t-1} = i)$, the probability of transitioning from regime $i$ to regime $j$.

  2. Emission distributions: In each regime $k$, price changes follow a distribution with regime-specific parameters. For a Gaussian HMM:

$$\Delta p_t \mid S_t = k \sim N(\mu_k, \sigma_k^2)$$

  1. Initial state distribution: $\pi_k = P(S_1 = k)$.

Estimation:

The parameters $\{a_{ij}, \mu_k, \sigma_k^2, \pi_k\}$ are estimated using the Expectation-Maximization (EM) algorithm, also known as the Baum-Welch algorithm in the HMM context. The algorithm iterates between:

  • E-step: Compute the posterior probability of each regime at each time step, given the current parameters and observed data (using the forward-backward algorithm).
  • M-step: Update the parameters to maximize the expected log-likelihood.

Decoding:

Once the model is estimated, the most likely sequence of regimes can be found using the Viterbi algorithm, which efficiently computes:

$$\hat{S}_1, \hat{S}_2, \ldots, \hat{S}_T = \arg\max_{S_1, \ldots, S_T} P(S_1, \ldots, S_T \mid \Delta p_1, \ldots, \Delta p_T)$$

21.9.3 Change-Point Detection

An alternative to HMMs for regime detection is change-point detection, which identifies specific time points where the statistical properties of the series change abruptly.

CUSUM (Cumulative Sum) method:

The CUSUM statistic tracks the cumulative deviation of observations from a reference value:

$$C_t = \max(0, C_{t-1} + \Delta p_t - \mu_0 - k)$$

where $\mu_0$ is the expected price change (typically 0) and $k$ is a slack parameter. A change point is detected when $C_t$ exceeds a threshold $h$.

PELT (Pruned Exact Linear Time) algorithm:

PELT is an efficient algorithm for finding the optimal set of change points by minimizing a penalized cost function:

$$\sum_{i=0}^{m} C(y_{(\tau_i+1):\tau_{i+1}}) + \beta \cdot m$$

where $C$ is a cost function (e.g., negative log-likelihood), $\tau_0, \tau_1, \ldots, \tau_m$ are the change points, and $\beta$ is a penalty for each additional change point.

Bayesian change-point detection:

Bayesian methods compute the posterior probability of a change point at each time step, providing a probabilistic framework for regime boundaries. The algorithm of Adams and MacKay (2007) provides an online Bayesian approach that is particularly well-suited for streaming prediction market data.

21.9.4 Choosing the Number of Regimes

A key question in regime detection is: how many regimes should we use? Several approaches can guide this choice:

  • Information criteria (AIC, BIC): Fit models with different numbers of regimes and select the one that minimizes BIC (which penalizes model complexity more heavily than AIC).
  • Domain knowledge: In prediction markets, 2-3 regimes are often sufficient (calm/volatile, or calm/trending/volatile).
  • Cross-validation: Split the data into training and validation sets and choose the number of regimes that minimizes prediction error on the validation set.
  • Interpretability: More regimes may fit the data better but be harder to interpret and act upon.

21.9.5 Python Regime Detector

See code/example-03-regime-detection.py for a complete implementation using both HMM and change-point detection methods.


21.10 Visualization Best Practices

21.10.1 Effective Charts for Prediction Market Data

Prediction markets require specialized visualizations that account for their unique properties. Here are the most effective chart types:

Probability timeline: The foundational chart. A line chart of price (interpreted as probability) over time, with y-axis fixed from 0 to 1. Horizontal reference lines at 0.25, 0.50, and 0.75 provide context.

Candlestick analogs: While prediction markets do not have traditional OHLC (Open-High-Low-Close) bars, an analogous chart can show the opening price, closing price, high, and low for each period (e.g., each day). This reveals intraday volatility that a simple closing price chart would miss.

Fan charts: Show the range of possible future prices as a "fan" spreading outward from the current price. The fan can be based on historical volatility or on a model's uncertainty estimates. Wider fans indicate greater uncertainty.

Probability ribbons: Show the probability of different outcomes over time as stacked areas. Useful for multi-outcome markets (e.g., "Which candidate will win?" with more than two candidates).

Small multiples: Display many related markets side by side in a grid, each with the same axes. This enables rapid comparison across markets and identification of common patterns.

21.10.2 Color and Design Principles

For prediction market visualizations:

  • Use a consistent color scheme across all charts. A common choice: blue for prices, orange for volume, green for regime 1, red for regime 2.
  • Avoid chart junk: Remove gridlines, unnecessary borders, and 3D effects. Follow Tufte's principles of data-ink ratio.
  • Label key events: Major price movements should be annotated with the corresponding real-world event.
  • Use appropriate scales: Linear for prices (which are bounded 0-1), log for volumes (which can span orders of magnitude).

21.10.3 Interactive Plots with Plotly

Interactive charts are invaluable for EDA because they allow zooming, panning, and hovering to inspect individual data points. Plotly is the standard Python library for interactive visualization:

import plotly.graph_objects as go
from plotly.subplots import make_subplots

def create_interactive_market_chart(timestamps, prices, volumes,
                                     title="Market Overview"):
    """
    Create an interactive chart with price and volume subplots.
    """
    fig = make_subplots(
        rows=2, cols=1,
        shared_xaxes=True,
        vertical_spacing=0.03,
        row_heights=[0.7, 0.3],
        subplot_titles=('Price (Probability)', 'Volume')
    )

    # Price line
    fig.add_trace(
        go.Scatter(
            x=timestamps,
            y=prices,
            mode='lines',
            name='Price',
            line=dict(color='#1f77b4', width=2),
            hovertemplate='%{x}<br>Price: %{y:.3f}<extra></extra>'
        ),
        row=1, col=1
    )

    # Reference lines
    for level in [0.25, 0.50, 0.75]:
        fig.add_hline(
            y=level, line_dash="dash",
            line_color="gray", opacity=0.5,
            row=1, col=1
        )

    # Volume bars
    fig.add_trace(
        go.Bar(
            x=timestamps,
            y=volumes,
            name='Volume',
            marker_color='#ff7f0e',
            opacity=0.7,
            hovertemplate='%{x}<br>Volume: %{y:,.0f}<extra></extra>'
        ),
        row=2, col=1
    )

    fig.update_yaxes(range=[0, 1], title_text="Probability", row=1, col=1)
    fig.update_yaxes(title_text="Volume", row=2, col=1)
    fig.update_xaxes(title_text="Date", row=2, col=1)

    fig.update_layout(
        title=title,
        height=600,
        showlegend=False,
        template='plotly_white'
    )

    return fig

A well-organized EDA should include a standard set of visualizations:

  1. Price timeline with volume subplot and event annotations.
  2. Price change histogram with normal distribution overlay.
  3. Autocorrelation function (ACF) plot for price changes.
  4. Rolling volatility plot showing volatility evolution over time.
  5. Volume profile showing cumulative and daily volume.
  6. Distribution fit plot comparing observed price distribution to Beta fit.
  7. Cross-market correlation heatmap for groups of related markets.
  8. Regime plot showing price colored by detected regime.
  9. QQ plot comparing price change quantiles to normal distribution.
  10. Seasonality plot showing hour-of-day and day-of-week patterns.

21.11 Outlier Detection and Data Cleaning

21.11.1 Identifying Anomalous Prices

Not all extreme prices are errors---many represent genuine market reactions to surprising information. The challenge is distinguishing data errors from real events. Common types of anomalous prices include:

Stale prices: A market that has not traded in hours or days may show a price that no longer reflects current information. These are not errors but should be handled carefully in time-series analysis.

Fat-finger errors: A trader accidentally enters an order at the wrong price (e.g., buying at 0.90 instead of 0.09). These often appear as extreme price spikes that immediately reverse.

Liquidity-driven spikes: In thin markets, a moderately large order can move the price substantially, then it reverts as other participants provide liquidity.

Genuine jumps: Real information events (election results, court rulings, scientific discoveries) can cause large, permanent price changes. These should not be treated as outliers.

21.11.2 Statistical Outlier Detection Methods

Z-score method: Flag any observation where the price change exceeds $k$ standard deviations from the mean:

$$\text{outlier if } |\Delta p_t| > \mu + k\sigma$$

This method is sensitive to the presence of outliers in the data used to compute $\mu$ and $\sigma$. Using the median and MAD (median absolute deviation) instead provides a more robust alternative:

$$\text{outlier if } |\Delta p_t - \tilde{m}| > k \cdot 1.4826 \cdot \text{MAD}$$

where $\tilde{m}$ is the median and MAD = $\text{median}(|\Delta p_t - \tilde{m}|)$. The factor 1.4826 makes the MAD consistent with the standard deviation for normally distributed data.

Isolation Forest: A machine learning approach that identifies anomalies as points that are easily isolated by random partitioning. Outliers tend to require fewer partitions to isolate.

Local Outlier Factor (LOF): Compares the local density of a point to the local densities of its neighbors. Points with significantly lower density are flagged as outliers.

21.11.3 Data Errors vs. Real Events

To distinguish genuine events from data errors, apply these heuristics:

  1. Persistence: Genuine information shocks cause permanent price changes. Errors typically revert quickly (within minutes or hours).
  2. Volume: Genuine events are usually accompanied by high volume. Errors often involve a single trade.
  3. Cross-market confirmation: If the price jump is reflected in related markets, it is likely genuine.
  4. External corroboration: Check whether a real-world event occurred at the time of the price change.
  5. Price level: An error that would result in a price outside the [0, 1] range is obviously invalid. Less obvious errors might produce prices far from the prevailing level but still within bounds.

21.11.4 Robust Statistics

When outliers are present, standard statistics (mean, standard deviation) can be heavily influenced. Robust alternatives include:

Standard Statistic Robust Alternative Description
Mean Median Middle value; unaffected by extreme values
Mean Trimmed mean Mean after removing top and bottom $k$%
Standard deviation MAD Median absolute deviation from median
Standard deviation IQR Interquartile range (75th - 25th percentile)
Correlation Spearman rank correlation Based on ranks, not values

For prediction market EDA, it is good practice to compute both standard and robust statistics and compare them. Large discrepancies indicate the presence of influential outliers.

21.11.5 Python Outlier Detector

import numpy as np
import pandas as pd
from scipy import stats

def detect_outliers(price_changes, method='mad', threshold=3.0):
    """
    Detect outliers in price changes using various methods.

    Parameters
    ----------
    price_changes : array-like
        Series of price changes.
    method : str
        Detection method: 'zscore', 'mad', or 'iqr'.
    threshold : float
        Number of standard deviations (or MAD/IQR multiples)
        for the outlier threshold.

    Returns
    -------
    pd.DataFrame
        DataFrame with original values and outlier flags.
    """
    changes = pd.Series(price_changes).dropna()

    result = pd.DataFrame({'price_change': changes})

    if method == 'zscore':
        mu = changes.mean()
        sigma = changes.std()
        result['score'] = (changes - mu).abs() / sigma
        result['is_outlier'] = result['score'] > threshold

    elif method == 'mad':
        median = changes.median()
        mad = (changes - median).abs().median()
        mad_scaled = 1.4826 * mad  # consistency factor
        result['score'] = (changes - median).abs() / mad_scaled
        result['is_outlier'] = result['score'] > threshold

    elif method == 'iqr':
        q1 = changes.quantile(0.25)
        q3 = changes.quantile(0.75)
        iqr = q3 - q1
        lower = q1 - threshold * iqr
        upper = q3 + threshold * iqr
        result['is_outlier'] = (changes < lower) | (changes > upper)
        result['score'] = np.maximum(
            (q1 - changes) / iqr,
            (changes - q3) / iqr
        ).clip(lower=0)

    result['is_persistent'] = False
    if len(changes) > 1:
        # Check if the price change persisted (did not revert immediately)
        for idx in result[result['is_outlier']].index:
            pos = changes.index.get_loc(idx)
            if pos < len(changes) - 1:
                next_change = changes.iloc[pos + 1]
                # If the next change does not reverse >50% of the outlier,
                # consider it persistent (likely a real event)
                if abs(next_change + changes.iloc[pos]) > 0.5 * abs(changes.iloc[pos]):
                    result.loc[idx, 'is_persistent'] = True

    return result


def clean_prices(prices, max_change=0.30, revert_window=3,
                 revert_threshold=0.7):
    """
    Clean price series by identifying and optionally removing
    likely data errors (large changes that quickly revert).

    Parameters
    ----------
    prices : array-like
        Price series.
    max_change : float
        Maximum allowed single-period price change.
    revert_window : int
        Number of periods to check for reversion.
    revert_threshold : float
        Fraction of the initial change that must revert
        to flag as error.

    Returns
    -------
    pd.Series
        Cleaned price series with suspected errors replaced by NaN.
    """
    prices = pd.Series(prices, dtype=float)
    cleaned = prices.copy()
    changes = prices.diff()

    large_changes = changes.abs() > max_change

    for idx in changes[large_changes].index:
        pos = prices.index.get_loc(idx)
        initial_change = changes.iloc[pos]

        # Check if the change reverts within the window
        for future in range(1, min(revert_window + 1, len(prices) - pos)):
            cumulative_revert = prices.iloc[pos - 1] - prices.iloc[pos + future]
            # Negative of the ratio means reversion
            revert_ratio = cumulative_revert / initial_change

            if revert_ratio > revert_threshold:
                # Likely a data error; mark as NaN
                cleaned.iloc[pos] = np.nan
                break

    # Optionally interpolate NaN values
    cleaned = cleaned.interpolate(method='linear')

    return cleaned

21.12 Building an EDA Template

21.12.1 Reusable EDA Notebook Template

After covering individual EDA techniques, we now bring them together into a reusable template. A well-designed EDA template ensures consistency across analyses and prevents you from forgetting important checks.

The template should follow this structure:

1. Data Loading and Inspection
   - Load the dataset
   - Check dimensions, dtypes, missing values
   - Display first and last few rows

2. Summary Statistics
   - Compute and display descriptive statistics
   - Identify obvious anomalies in the summary

3. Price Analysis
   - Plot price time series
   - Compute and plot moving averages
   - Analyze price change distribution
   - Test for trends (linear regression)

4. Volume Analysis
   - Plot volume over time
   - Detect volume spikes
   - Analyze volume-price relationship

5. Volatility Analysis
   - Compute and plot rolling volatility
   - Test for volatility clustering
   - Fit GARCH model (if appropriate)

6. Serial Dependence
   - Plot ACF and PACF
   - Runs test
   - Ljung-Box test

7. Distribution Analysis
   - Fit Beta distribution to prices
   - Analyze price change distribution
   - QQ plots

8. Regime Detection
   - Apply change-point detection
   - Fit HMM (if appropriate)
   - Characterize each regime

9. Cross-Market Analysis (if applicable)
   - Correlation heatmap
   - Lead-lag analysis

10. Outlier Detection
    - Apply outlier detection methods
    - Distinguish errors from events

11. Key Findings and Hypotheses
    - Summarize main patterns
    - List hypotheses for further testing

21.12.2 Automated Report Generation

For large-scale analysis (e.g., analyzing hundreds or thousands of markets), manual EDA is infeasible. An automated report generator can produce a standardized EDA report for each market:

import json
from datetime import datetime

def generate_eda_report(prices, volumes, timestamps,
                         market_name="Unknown Market",
                         outcome=None):
    """
    Generate a comprehensive EDA report for a single market.

    Parameters
    ----------
    prices : array-like
        Time series of prices.
    volumes : array-like
        Corresponding volumes.
    timestamps : array-like
        Corresponding timestamps.
    market_name : str
        Name of the market.
    outcome : int or None
        Resolution outcome.

    Returns
    -------
    dict
        Comprehensive EDA report as a dictionary.
    """
    report = {
        'market_name': market_name,
        'report_generated': datetime.now().isoformat(),
        'data_summary': {},
        'price_analysis': {},
        'volume_analysis': {},
        'volatility_analysis': {},
        'serial_dependence': {},
        'distribution_analysis': {},
        'outlier_analysis': {},
    }

    # Data summary
    report['data_summary'] = compute_market_summary(
        prices, volumes, timestamps, outcome
    )

    # Price analysis
    prices_series = pd.Series(prices, index=pd.to_datetime(timestamps))
    changes = prices_series.diff().dropna()

    report['price_analysis'] = {
        'trend_slope': np.polyfit(
            range(len(prices)), prices, 1
        )[0],
        'mean_change': float(changes.mean()),
        'std_change': float(changes.std()),
        'max_jump_up': float(changes.max()),
        'max_jump_down': float(changes.min()),
        'num_large_moves': int((changes.abs() > 0.05).sum()),
    }

    # Volume analysis
    volumes_series = pd.Series(volumes, index=pd.to_datetime(timestamps))
    vol_mean = volumes_series.mean()
    vol_std = volumes_series.std()

    report['volume_analysis'] = {
        'total_volume': float(volumes_series.sum()),
        'mean_daily_volume': float(vol_mean),
        'volume_spikes': int(
            (volumes_series > vol_mean + 2 * vol_std).sum()
        ),
        'volume_concentration_top10pct': float(
            volumes_series.nlargest(
                max(1, len(volumes_series) // 10)
            ).sum() / volumes_series.sum()
        ) if volumes_series.sum() > 0 else 0,
    }

    # Volatility
    vol_metrics = compute_volatility_metrics(prices)
    report['volatility_analysis'] = {
        'mean_rolling_vol': float(
            vol_metrics['rolling_volatility'].mean()
        ),
        'max_rolling_vol': float(
            vol_metrics['rolling_volatility'].max()
        ),
    }

    # Serial dependence
    ac_result = autocorrelation_analysis(changes.values)
    report['serial_dependence'] = {
        'runs_test': ac_result['runs_test'],
        'acf_lag1': float(
            ac_result['acf']['acf'].iloc[0]
        ) if len(ac_result['acf']) > 0 else None,
        'any_significant_acf': bool(
            ac_result['acf']['significant'].any()
        ) if len(ac_result['acf']) > 0 else False,
    }

    # Distribution analysis
    if len(prices) > 10:
        beta_fit = fit_beta_distribution(prices)
        report['distribution_analysis'] = {
            'alpha': beta_fit['alpha_mle'],
            'beta': beta_fit['beta_mle'],
            'is_unimodal': beta_fit['is_unimodal'],
            'ks_pvalue': beta_fit['ks_pvalue'],
        }

    # Outlier analysis
    outlier_result = detect_outliers(changes.values)
    report['outlier_analysis'] = {
        'num_outliers': int(outlier_result['is_outlier'].sum()),
        'num_persistent': int(outlier_result['is_persistent'].sum()),
        'outlier_fraction': float(outlier_result['is_outlier'].mean()),
    }

    return report

21.13 Chapter Summary

This chapter has provided a comprehensive toolkit for exploratory data analysis of prediction market data. The key themes and techniques are:

The EDA mindset is about asking questions, generating hypotheses, and understanding your data before modeling. In prediction markets, the bounded nature of prices, finite market lifetimes, and event-driven dynamics require specialized analytical approaches.

Summary statistics provide the first layer of understanding. For prediction markets, the standard descriptive statistics (mean, median, standard deviation, skewness, kurtosis) should be supplemented with market-specific metrics like VWAP, price range, and resolution-based statistics.

Price time-series analysis reveals trends, regimes, jumps, and temporal patterns. Moving averages help identify trends, while price change distributions reveal the statistical properties of market dynamics.

Volume analysis uncovers the market's liquidity structure and lifecycle patterns. Volume spikes often coincide with information events, and the volume-price relationship reveals the market's ability to incorporate information.

Volatility analysis captures the time-varying nature of market uncertainty. Prediction market volatility is mechanically linked to price level through the $\sqrt{p(1-p)}$ relationship. GARCH models can capture volatility clustering.

Autocorrelation analysis tests whether price changes are predictable from their own past, providing direct evidence about market efficiency. The runs test, ACF, and Ljung-Box test offer complementary perspectives.

Distribution analysis characterizes the statistical properties of prices and price changes. The Beta distribution is a natural choice for modeling the distribution of bounded prediction market prices.

Cross-market analysis reveals relationships between related markets, including correlations, lead-lag relationships, and contagion effects.

Regime detection identifies periods of qualitatively different market behavior using Hidden Markov Models and change-point detection methods.

Visualization is central to EDA, and prediction markets require specialized chart types including probability timelines, fan charts, and correlation heatmaps.

Outlier detection distinguishes genuine information shocks from data errors, using both statistical methods and domain-specific heuristics.

The EDA template brings all these techniques together into a reusable workflow for systematic analysis.


What's Next

With a thorough understanding of how to explore and characterize prediction market data, we are ready to move from description to prediction. In Chapter 22: Feature Engineering for Prediction Markets, we will build on the patterns and insights discovered through EDA to construct informative features for machine learning models. Many of the statistics computed in this chapter---rolling volatility, autocorrelation, volume profiles, regime indicators---will become features in predictive models.

The journey from EDA to modeling is not a one-way street. As you build and evaluate models, you will return to EDA to understand errors, discover new patterns, and refine your features. Treat this chapter's techniques as a toolkit you will use throughout your prediction market analysis career.