Chapter 21: Quiz

Exploratory Data Analysis of Market Data


Question 1

What is the primary purpose of Exploratory Data Analysis (EDA)?

A) To confirm pre-existing hypotheses about the data B) To build the best predictive model as quickly as possible C) To systematically examine datasets to understand structure, uncover patterns, and generate hypotheses D) To compute p-values for all possible statistical tests

Answer: C EDA is fundamentally about understanding your data before modeling. It is a hypothesis-generating activity, not a hypothesis-confirming one. The goal is to understand structure, detect anomalies, and generate testable hypotheses.


Question 2

Why is it problematic to use the same data for both generating and confirming a hypothesis?

A) It uses too much computational power B) It leads to overfitting and false discoveries C) It violates copyright law D) It makes the analysis take too long

Answer: B Using the same data to both generate and confirm a hypothesis is a form of data snooping that leads to overfitting. Patterns found during exploration may be spurious, and testing them on the same data gives a falsely high confidence. Separate data must be used for hypothesis testing.


Question 3

In a prediction market with prices bounded between 0 and 1, the Volume-Weighted Average Price (VWAP) is preferred over the simple mean because:

A) VWAP is always lower than the mean B) VWAP gives more weight to prices at which more trading occurred, better reflecting consensus C) VWAP is easier to compute D) VWAP is required by regulators

Answer: B VWAP weights each price observation by its associated trading volume, so prices established during periods of high activity (presumably reflecting more informed or more participatory consensus) receive greater weight.


Question 4

For a binary prediction market price at $p = 0.50$, the theoretical maximum standard deviation of a Bernoulli random variable is:

A) 0.25 B) 0.50 C) 0.10 D) 1.00

Answer: B The maximum standard deviation is $\sqrt{p(1-p)} = \sqrt{0.50 \times 0.50} = \sqrt{0.25} = 0.50$.


Question 5

A prediction market shows positive autocorrelation in price changes at lag 1. This is most consistent with:

A) An efficient market B) Momentum or trend-following behavior C) Mean-reverting behavior D) A random walk

Answer: B Positive autocorrelation means that positive price changes tend to be followed by positive changes, and negative by negative. This is characteristic of momentum or trending behavior, suggesting that information is being incorporated gradually rather than instantaneously.


Question 6

The runs test applied to a sequence of price changes yields a z-statistic of -2.5. This indicates:

A) Too many runs, suggesting mean reversion B) Too few runs, suggesting momentum/trending C) The expected number of runs, suggesting randomness D) The test is inconclusive

Answer: B A negative z-statistic means there are fewer runs than expected under randomness. Fewer runs mean longer streaks of consecutive positive or negative changes, which is consistent with momentum or trending behavior.


Question 7

The Ljung-Box test with $m = 10$ lags produces a p-value of 0.003. What do you conclude?

A) The price changes are definitely independent B) There is significant evidence of serial correlation in the price changes C) The market is efficient D) Exactly one of the first 10 autocorrelations is significant

Answer: B A small p-value (0.003 < 0.05) leads us to reject the null hypothesis of no serial correlation. This means at least one of the first 10 autocorrelations is statistically significantly different from zero, indicating serial dependence in the price changes.


Question 8

When fitting a Beta distribution to prediction market prices, parameters $\alpha < 1$ and $\beta < 1$ indicate:

A) A bell-shaped distribution centered at 0.5 B) A uniform distribution C) A U-shaped distribution with mass concentrated near 0 and 1 D) A distribution skewed toward 1

Answer: C When both parameters are less than 1, the Beta distribution is U-shaped, with density highest at the boundaries (0 and 1). This is typical for prediction markets where prices tend to converge toward 0 or 1 as resolution approaches.


Question 9

Which of the following best describes "volatility clustering"?

A) Volatility is constant over time B) Periods of high volatility tend to be followed by high volatility, and low by low C) Volatility always increases as a market approaches resolution D) Volatility only occurs during US business hours

Answer: B Volatility clustering is the empirical phenomenon where large price changes (of either sign) tend to be followed by large changes, and small changes by small changes. This creates temporal persistence in volatility levels.


Question 10

In a GARCH(1,1) model, $\sigma_t^2 = \omega + \alpha \epsilon_{t-1}^2 + \beta \sigma_{t-1}^2$, the condition for stationarity is:

A) $\alpha + \beta > 1$ B) $\alpha + \beta < 1$ C) $\alpha = \beta$ D) $\omega > \alpha + \beta$

Answer: B The stationarity condition for GARCH(1,1) requires $\alpha + \beta < 1$. When this holds, the unconditional variance exists and is finite, equal to $\omega / (1 - \alpha - \beta)$. When $\alpha + \beta \geq 1$, the process is non-stationary with potentially infinite variance.


Question 11

You observe that a prediction market price jumped from 0.40 to 0.65 in one period but immediately reverted to 0.42 in the next period. This is most likely:

A) A genuine information event B) A data error or fat-finger trade C) Volatility clustering D) A regime change

Answer: B Genuine information events cause permanent price changes. A large jump that immediately and almost completely reverts is characteristic of a data error (e.g., a fat-finger trade) in a thin market. The quick reversion suggests no fundamental information was driving the price change.


Question 12

The Median Absolute Deviation (MAD) is preferred over the standard deviation as a measure of spread when:

A) The data is perfectly normally distributed B) The sample size is very large C) The data contains outliers D) You want to maximize statistical power

Answer: C The MAD is a robust measure of spread that is much less sensitive to outliers than the standard deviation. A single extreme outlier can dramatically inflate the standard deviation but has minimal effect on the MAD.


Question 13

A cross-correlation analysis between markets A and B shows the maximum correlation at lag $k = +3$ (where positive lag means A leads). This means:

A) Market B leads market A by 3 periods B) Market A leads market B by 3 periods C) The two markets are independent D) The markets are perfectly correlated

Answer: B A maximum cross-correlation at positive lag $k = 3$ means that changes in market A at time $t$ are most correlated with changes in market B at time $t + 3$, indicating that A leads B by 3 periods. Information appears to be incorporated first in market A and then propagates to market B.


Question 14

In the Hidden Markov Model (HMM) framework for regime detection, the "hidden" component refers to:

A) The observed prices, which are hidden from other traders B) The underlying regime (state) that is not directly observed C) The model parameters, which are secret D) The missing data points in the time series

Answer: B In an HMM, the "hidden" part is the underlying state (regime) at each time step. We observe the prices or price changes (the "emissions"), but the regime generating those observations is not directly observable. The model infers the most likely regime sequence from the observed data.


Question 15

The PELT algorithm for change-point detection minimizes a penalized cost function. The penalty term serves to:

A) Make the algorithm run faster B) Prevent overfitting by penalizing too many change points C) Ensure change points are equally spaced D) Guarantee at least one change point is detected

Answer: B The penalty term in PELT controls the trade-off between fitting the data well (low cost) and model parsimony (few change points). Without a penalty, the algorithm would place a change point at nearly every observation, overfitting the noise. The penalty encourages finding only the most significant structural changes.


Question 16

When analyzing volume patterns in a prediction market, "volume concentration" (what fraction of total volume occurs in the top 10% of trading days) is useful because:

A) It determines the market's resolution date B) It measures whether trading is evenly distributed or driven by a few high-activity periods C) It is required for computing VWAP D) It is the same as the Herfindahl index

Answer: B Volume concentration measures the degree to which trading activity is clustered in a few periods vs. spread evenly across the market's lifetime. High concentration (e.g., 60% of volume in the top 10% of days) suggests event-driven trading, while low concentration suggests steady, continuous activity.


Question 17

A fan chart for prediction market data shows:

A) Volume distribution over time B) The range of possible future prices expanding outward from the current price C) The autocorrelation function D) A comparison of multiple markets

Answer: B A fan chart visualizes forecast uncertainty by showing progressively wider bands (the "fan") extending into the future from the current price. Wider bands indicate greater uncertainty about future price movements. These charts are particularly useful for communicating the uncertainty inherent in probability estimates.


Question 18

If you find that the price change distribution has excess kurtosis of 5 (compared to 0 for a normal distribution), the practical implication is:

A) The market is perfectly efficient B) Extreme price movements occur more frequently than a normal distribution predicts C) The mean price change is 5 D) The market has exactly 5 regimes

Answer: B Excess kurtosis greater than zero (leptokurtic distribution) means the distribution has heavier tails than the normal distribution. This translates to a higher probability of extreme price movements---large jumps or drops occur more often than a normal model would predict, which has important implications for risk management.


Question 19

For complementary prediction markets (e.g., "Candidate A wins" and "Candidate B wins" in a two-person race), the expected correlation between their price changes is:

A) Approximately +1 B) Approximately 0 C) Approximately -1 D) Cannot be determined

Answer: C In a two-person race, the prices should sum to approximately 1 (assuming no other candidates and ignoring the overround). Therefore, when one price goes up, the other must go down by a similar amount, producing a correlation of approximately -1 between their price changes.


Question 20

When normalizing volatility by dividing by $\sqrt{p(1-p)}$, why is this useful for prediction markets?

A) It makes the computation faster B) It accounts for the mechanical relationship between price level and volatility capacity C) It converts prices to log-returns D) It removes all trends from the data

Answer: B A market trading at $p = 0.50$ has the greatest potential for volatility (the widest possible range of outcomes), while markets near 0 or 1 are mechanically constrained. Normalizing by $\sqrt{p(1-p)}$ adjusts for this, allowing fair comparison of volatility across different price levels.


Question 21

The Hartigan dip test is used to:

A) Test for normality B) Test for stationarity C) Test for unimodality (vs. multimodality) D) Test for serial correlation

Answer: C The Hartigan dip test evaluates the null hypothesis that a distribution is unimodal against the alternative that it is multimodal. In prediction market analysis, it can identify bimodal price distributions that might indicate the market has spent time in two distinct price ranges (e.g., before and after a major information event).


Question 22

A "thin" prediction market (low volume, wide spreads) poses challenges for EDA because:

A) It has too much data to process B) Prices may not reflect true consensus, and statistical analyses may be unreliable due to noise C) Thin markets always have zero autocorrelation D) Thin markets cannot be visualized

Answer: B Thin markets have fewer participants and less trading activity, which means prices are more susceptible to noise, manipulation, and the impact of individual trades. Statistical analyses based on thin market data may produce unreliable results because the sample sizes are effectively smaller and the signal-to-noise ratio is lower.


Question 23

In the context of EDA for prediction markets, "contagion" refers to:

A) A computer virus affecting market data B) A shock in one market spreading to other seemingly unrelated markets C) The spread of information from insiders D) Price changes that are exactly zero

Answer: B Contagion occurs when a shock or surprise in one market propagates to other markets that may not be directly related. This can happen through common participants, shared platforms, or broader shifts in risk sentiment. Detecting contagion involves examining whether cross-market correlations increase during periods of stress.


Question 24

An automated EDA report generator is valuable because:

A) It replaces the need for any human analysis B) It ensures consistent analysis across many markets and prevents overlooking important checks C) It always finds the best trading strategy D) It is required by all prediction market platforms

Answer: B When analyzing hundreds or thousands of markets, manual EDA is infeasible. An automated template ensures that every market receives the same set of analyses, preventing the analyst from cherry-picking which tests to run or overlooking important patterns. However, the human analyst's judgment is still essential for interpreting the results.


Question 25

You discover during EDA that price changes in a prediction market have significant positive autocorrelation at lag 1, fat-tailed distribution, and volatility clustering. Based on these findings, which modeling approach is LEAST appropriate?

A) A GARCH model for volatility B) An AR(1) model for price changes C) A simple random walk model with normally distributed increments D) A regime-switching model

Answer: C A simple random walk with normal increments assumes zero autocorrelation (contradicted by the lag-1 finding), thin tails (contradicted by the fat-tailed distribution), and constant volatility (contradicted by the clustering finding). All three EDA findings directly violate the assumptions of this model. The other options each address at least one of the discovered features.