Chapter 21: Exercises

Exploratory Data Analysis of Market Data

Exercise 1: Basic Summary Statistics

Given the following price observations from a binary prediction market: [0.45, 0.48, 0.52, 0.50, 0.55, 0.60, 0.58, 0.62, 0.65, 0.70], compute by hand (then verify with Python): (a) The mean, median, and standard deviation. (b) The skewness and excess kurtosis. (c) The price range and interquartile range. (d) If the market resolved Yes (outcome = 1), what is the Brier score based on the final price?

Exercise 2: VWAP Calculation

A prediction market records the following trades:

Trade	Price	Volume
1	0.40	100
2	0.42	200
3	0.38	50
4	0.45	300
5	0.43	150

(a) Compute the Volume-Weighted Average Price (VWAP). (b) How does it compare to the simple mean price? (c) Why might VWAP be a better summary of the market's consensus probability than the simple mean?

Exercise 3: Price Change Distribution Analysis

Generate 500 simulated price changes from a prediction market using:

np.random.seed(42)
changes = np.random.standard_t(df=3, size=500) * 0.02

(a) Plot a histogram of the price changes with a normal distribution overlay. (b) Compute the skewness and excess kurtosis. (c) What fraction of observations fall beyond 2, 3, and 4 standard deviations? Compare to the normal distribution expectations. (d) Perform the Jarque-Bera test for normality. What do you conclude?

Exercise 4: Moving Average Crossover

Using the synthetic price series:

np.random.seed(0)
prices = 0.5 + np.cumsum(np.random.randn(200) * 0.01)
prices = np.clip(prices, 0.01, 0.99)

(a) Compute and plot the 10-period SMA and 30-period SMA. (b) Identify all crossover points where the short MA crosses the long MA. (c) For each crossover, determine whether it was a bullish (short crosses above long) or bearish signal. (d) If you traded based on these crossovers (buy on bullish, sell on bearish), what would your return be?

Exercise 5: Time-of-Day Patterns

Simulate hourly prediction market data where volume follows a daily pattern:

hours = np.tile(np.arange(24), 30)  # 30 days
base_volume = 100 + 80 * np.sin(np.pi * (hours - 6) / 12) * (hours >= 8) * (hours <= 22)
volume = base_volume + np.random.exponential(20, len(hours))

(a) Plot the average volume by hour of day. (b) Identify the peak trading hours. (c) Is there a statistically significant difference in volume between peak and off-peak hours? Use an appropriate statistical test.

Exercise 6: Volume Spike Detection

Given a daily volume series for a prediction market spanning 100 days:

np.random.seed(1)
volumes = np.random.lognormal(mean=5, sigma=1, size=100)
volumes[30] *= 10  # Information event
volumes[60] *= 8   # Another event
volumes[61] *= 6   # Continued elevated volume

(a) Detect volume spikes using the z-score method with threshold k=2. (b) Detect volume spikes using the rolling window method (window=10, threshold=3). (c) Compare the two methods. Which detects the injected events? Are there false positives? (d) For each detected spike, check whether the corresponding price change was also unusually large.

Exercise 7: Volume-Price Relationship

Create a scatter plot of |price change| vs. volume for a simulated dataset. Generate data where: - Baseline: small price changes with moderate volume. - Information events (10% of observations): large price changes with high volume. - Noise trades (5% of observations): small price changes with high volume.

(a) Create the scatter plot and visually identify the three groups. (b) Compute the Spearman rank correlation between |price change| and volume. (c) Compute the price impact metric (|price change| / volume) for each observation. What is its distribution?

Exercise 8: Rolling Volatility

Using daily prices from a 200-day prediction market simulation where volatility increases in the second half:

np.random.seed(2)
changes1 = np.random.randn(100) * 0.01  # Low vol regime
changes2 = np.random.randn(100) * 0.03  # High vol regime
changes = np.concatenate([changes1, changes2])
prices = 0.5 + np.cumsum(changes)
prices = np.clip(prices, 0.01, 0.99)

(a) Compute 20-day rolling volatility. (b) Plot the rolling volatility and visually identify the regime change. (c) Compute the normalized volatility (dividing by sqrt(p*(1-p))) at each point. (d) Does normalization change the apparent timing of the regime change?

Exercise 9: GARCH(1,1) Estimation

Fit a GARCH(1,1) model to the price changes from Exercise 8: (a) Estimate the parameters omega, alpha, and beta. (b) Plot the conditional volatility estimated by the GARCH model alongside the rolling volatility. (c) Compute the sum alpha + beta. What does this tell you about volatility persistence? (d) Use the fitted model to forecast volatility for the next 10 periods.

Exercise 10: Volatility Clustering Test

Using the same data from Exercise 8: (a) Compute the autocorrelation of squared price changes at lags 1 through 10. (b) Plot these autocorrelations with significance bands. (c) Is there evidence of volatility clustering? (d) Perform the Ljung-Box test on squared price changes. Report the test statistic and p-value for m=10 lags.

Exercise 11: ACF Analysis

Given 500 price changes from a market with mild positive autocorrelation:

np.random.seed(3)
n = 500
eps = np.random.randn(n) * 0.02
changes = np.zeros(n)
changes[0] = eps[0]
for i in range(1, n):
    changes[i] = 0.15 * changes[i-1] + eps[i]

(a) Plot the sample ACF for lags 1 through 20. (b) Which lags are statistically significant? (c) The true AR(1) coefficient is 0.15. How close is the sample ACF at lag 1? (d) What does positive autocorrelation at lag 1 imply about trading opportunities in this market?

Exercise 12: Runs Test

For the following sequence of price changes (signs only): +, +, +, -, -, +, -, +, +, +, +, -, -, -, +, -, +, +, -, - (a) Count the number of runs. (b) Compute the expected number of runs under the null hypothesis of randomness. (c) Compute the test statistic and p-value. (d) What do you conclude about the serial dependence of this sequence?

Exercise 13: Ljung-Box Test

Apply the Ljung-Box test to the price changes from Exercise 11: (a) Compute the test statistic Q for m=5, m=10, and m=20 lags. (b) Report the corresponding p-values. (c) At the 5% significance level, do you reject the null hypothesis of no serial correlation? (d) How does this result relate to the known AR(1) structure of the data?

Exercise 14: Beta Distribution Fitting

Consider the following final prices from 200 resolved prediction markets:

np.random.seed(4)
final_prices = np.concatenate([
    np.random.beta(0.5, 0.5, 100),  # U-shaped
    np.random.beta(2, 5, 100)        # Left-skewed
])

(a) Fit a Beta distribution to the entire sample using maximum likelihood. (b) Plot the histogram with the fitted Beta density overlay. (c) Does the single Beta distribution fit the data well? (Use the KS test.) (d) What would you conclude about the mixture of these two populations?

Exercise 15: Bimodality Detection

Generate a bimodal price distribution:

np.random.seed(5)
prices = np.concatenate([
    np.random.normal(0.3, 0.05, 150),
    np.random.normal(0.7, 0.05, 150)
])
prices = np.clip(prices, 0.01, 0.99)

(a) Plot the histogram and KDE. (b) Use Hartigan's dip test (from the diptest library or implement a simple version) to test for unimodality. (c) Fit a two-component Gaussian mixture model. What are the estimated means and standard deviations? (d) In what real-world prediction market scenario might you see this bimodal pattern?

Exercise 16: Fat Tail Analysis

Compare the tail behavior of prediction market price changes to a normal distribution: (a) Generate 10,000 price changes from a t-distribution with 4 degrees of freedom (scaled to have the same standard deviation as a standard normal). (b) Compute the fraction of observations beyond 2, 3, 4, and 5 standard deviations for both the t-distribution sample and the corresponding normal distribution expectations. (c) Create a QQ plot comparing the sample quantiles to the normal distribution. (d) What are the practical implications of fat tails for risk management in prediction market trading?

Exercise 17: Cross-Market Correlation

Simulate three related prediction markets:

np.random.seed(6)
common_factor = np.random.randn(200) * 0.01
market_a = 0.5 + np.cumsum(common_factor + np.random.randn(200) * 0.005)
market_b = 0.5 + np.cumsum(common_factor + np.random.randn(200) * 0.008)
market_c = 0.5 + np.cumsum(np.random.randn(200) * 0.01)  # Independent

(a) Compute the correlation matrix of price changes across the three markets. (b) Plot the correlation heatmap. (c) Which markets are related and which is independent? Does the analysis correctly identify this? (d) How would you test whether the observed correlations are statistically significant?

Exercise 18: Lead-Lag Detection

Simulate two markets where market A leads market B by 3 periods:

np.random.seed(7)
signal = np.random.randn(200) * 0.02
market_a = 0.5 + np.cumsum(signal + np.random.randn(200) * 0.005)
market_b = 0.5 + np.cumsum(np.roll(signal, 3) + np.random.randn(200) * 0.005)

(a) Compute the cross-correlation function for lags -10 to +10. (b) Plot the cross-correlation function. (c) At which lag is the cross-correlation maximized? Does this match the true lag of 3? (d) Discuss why the estimated lag might differ from the true lag.

Exercise 19: HMM for Regime Detection

Generate a two-regime price series:

np.random.seed(8)
n = 300
regimes = np.zeros(n, dtype=int)
regimes[100:200] = 1  # Regime 1 from t=100 to t=200
changes = np.where(regimes == 0,
                   np.random.randn(n) * 0.01,
                   np.random.randn(n) * 0.04 + 0.005)
prices = 0.5 + np.cumsum(changes)
prices = np.clip(prices, 0.01, 0.99)

(a) Fit a 2-state Gaussian HMM to the price changes using the hmmlearn library. (b) Use the Viterbi algorithm to decode the most likely regime sequence. (c) Plot the prices colored by the decoded regime. (d) Compare the decoded regimes to the true regimes. What is the accuracy?

Exercise 20: Change-Point Detection

Using the same data from Exercise 19: (a) Implement a simple CUSUM detector to find change points in the mean of price changes. (b) Apply the ruptures library's PELT algorithm with the "rbf" cost function to detect change points. (c) Compare the detected change points from both methods to the true change points at t=100 and t=200. (d) How sensitive are the results to the choice of penalty parameter in PELT?

Exercise 21: BIC for Model Selection

Fit HMMs with 1, 2, 3, and 4 states to the data from Exercise 19: (a) Compute the BIC for each model. (b) Which number of states is selected by BIC? (c) Plot the BIC values vs. number of states. (d) Does BIC correctly identify the true number of regimes (2)?

Exercise 22: Interactive Visualization

Using Plotly, create an interactive chart for a simulated prediction market that includes: (a) A price line chart with hover information showing date and price. (b) A volume bar chart below the price chart. (c) Horizontal reference lines at 0.25, 0.50, and 0.75. (d) Annotations marking the three largest price jumps.

Exercise 23: Candlestick Analog

Create a daily "candlestick" chart for a prediction market using Plotly: (a) From intraday (hourly) data, compute the daily open, high, low, and close prices. (b) Create a candlestick chart using plotly.graph_objects.Candlestick. (c) Add a 5-day moving average overlay. (d) Compare the information conveyed by the candlestick chart vs. a simple line chart of closing prices.

Exercise 24: Outlier Detection Comparison

Generate a price series with three injected outliers:

np.random.seed(9)
changes = np.random.randn(200) * 0.01
changes[50] = 0.15   # Large positive outlier
changes[51] = -0.14  # Immediate reversion (data error)
changes[120] = 0.12  # Large positive, no reversion (real event)
prices = 0.5 + np.cumsum(changes)
prices = np.clip(prices, 0.01, 0.99)

(a) Apply z-score, MAD, and IQR outlier detection methods to the price changes. (b) Which methods detect which outliers? (c) Use the persistence heuristic (does the change revert?) to distinguish the data error from the real event. (d) After removing the data error, recompute the mean and standard deviation of price changes. How much do they change?

Exercise 25: Robust Statistics Comparison

For a price series with 5% outlier contamination:

np.random.seed(10)
clean = np.random.randn(950) * 0.01
outliers = np.random.randn(50) * 0.10
changes = np.concatenate([clean, outliers])
np.random.shuffle(changes)

(a) Compute the mean, median, and 10% trimmed mean. (b) Compute the standard deviation, MAD, and IQR. (c) How much do the non-robust statistics (mean, std) differ from the robust alternatives? (d) What threshold of outlier contamination makes the standard deviation more than double the MAD * 1.4826?

Exercise 26: Calibration Analysis

Given a dataset of 500 resolved markets with final prices and outcomes:

np.random.seed(11)
true_probs = np.random.uniform(0, 1, 500)
outcomes = (np.random.rand(500) < true_probs).astype(int)
market_prices = true_probs + np.random.randn(500) * 0.05
market_prices = np.clip(market_prices, 0.01, 0.99)

(a) Bin the market prices into deciles (0-0.1, 0.1-0.2, ..., 0.9-1.0). (b) For each bin, compute the fraction of markets that resolved Yes. (c) Plot the calibration curve (predicted probability vs. observed frequency). (d) Compute the average Brier score. Is the market well-calibrated?

Exercise 27: Seasonality Decomposition

Simulate a price series with weekly seasonality:

np.random.seed(12)
t = np.arange(365)
trend = 0.5 + 0.1 * t / 365
seasonality = 0.02 * np.sin(2 * np.pi * t / 7)
noise = np.random.randn(365) * 0.01
prices = np.clip(trend + seasonality + noise, 0.01, 0.99)

(a) Apply a 7-day moving average to estimate the trend. (b) Subtract the trend to isolate the seasonal + noise component. (c) Average the detrended series by day of week to estimate the seasonal pattern. (d) Compare the estimated seasonal pattern to the true pattern. How accurate is the estimate?

Exercise 28: Platform-Level EDA

Simulate summary statistics for 100 markets on a platform:

np.random.seed(13)
market_data = pd.DataFrame({
    'total_volume': np.random.lognormal(8, 2, 100),
    'n_observations': np.random.poisson(500, 100),
    'price_range': np.random.beta(2, 3, 100),
    'final_price': np.random.beta(0.5, 0.5, 100),
    'outcome': (np.random.rand(100) < np.random.beta(0.5, 0.5, 100)).astype(int)
})

(a) Compute the platform-level summary statistics (mean volume, median volume, mean Brier score, etc.). (b) Create a scatter plot of total volume vs. number of observations. (c) Create a histogram of final prices. Is it U-shaped? (d) Compute the calibration curve using the final prices and outcomes.

Exercise 29: Complete EDA Pipeline

Choose one of the following datasets (or generate synthetic data matching its characteristics): - A 6-month daily election market. - A 30-day daily sports outcome market. - A 1-year daily economic indicator market.

Apply the full EDA template from Section 21.12: (a) Compute and display all summary statistics. (b) Create the full set of 10 standard visualizations listed in Section 21.10.4. (c) Run all statistical tests (Ljung-Box, runs test, normality tests, KS test for beta fit). (d) Summarize your findings in a brief report (500-1000 words).

Exercise 30: EDA Critique

Consider the following (deliberately flawed) EDA conclusions. For each, explain why the conclusion is problematic and what analysis would be more appropriate:

(a) "The ACF at lag 7 is significant, so there is a weekly pattern in price changes." (Hint: multiple comparisons.)

(b) "The price has been trending upward for the last 30 days, so we should buy because the trend will continue." (Hint: EDA vs. prediction.)

(d) "We detected 3 regimes using HMM and the BIC was better than for 2 regimes, so we conclude there are exactly 3 regimes." (Hint: model selection uncertainty.)

(e) "The correlation between markets A and B is 0.95, so market A causes market B to move." (Hint: correlation vs. causation, and common factors.)