Chapter 25 Quiz

Question 1: What is the primary mechanism by which simple averaging reduces ensemble error?

A) It reduces bias by correcting individual model biases B) It reduces variance by canceling out uncorrelated errors C) It increases the complexity of the overall model D) It eliminates irreducible noise

Answer: B

Explanation: Simple averaging primarily works by reducing variance. When individual models have uncorrelated errors, averaging cancels out the noise, reducing the ensemble's variance proportional to $1/K$ where $K$ is the number of models. Bias is generally unchanged (it equals the average of individual biases), and irreducible noise cannot be eliminated by any method.

Question 2: If you have 4 models with equal error variance $\sigma^2 = 0.10$ and pairwise error correlation $\rho = 0.4$, what is the ensemble variance of a simple average?

A) 0.025 B) 0.055 C) 0.070 D) 0.040

Answer: C

Explanation: Using the formula $\sigma_{\text{ensemble}}^2 = \frac{\sigma^2}{K}[1 + (K-1)\rho] = \frac{0.10}{4}[1 + 3 \times 0.4] = 0.025 \times 2.2 = 0.055$. Wait, let me recalculate: $0.10/4 = 0.025$. $1 + 3 \times 0.4 = 2.2$. $0.025 \times 2.2 = 0.055$. The answer is B) 0.055.

Corrected Answer: B

Explanation: $\sigma_{\text{ensemble}}^2 = \frac{0.10}{4}[1 + (4-1)(0.4)] = 0.025 \times 2.2 = 0.055$.

Question 3: The "forecast combination puzzle" refers to which empirical finding?

A) Ensembles always outperform individual models B) Simple averages often perform as well as optimally weighted combinations C) More models always lead to better ensembles D) Logarithmic pooling always outperforms linear pooling

Answer: B

Explanation: The forecast combination puzzle, first noted by Bates and Granger (1969) and confirmed in numerous subsequent studies, is the finding that simple equal-weight averages often match or beat sophisticated optimization-based combination methods. This occurs because estimation error in the optimal weights and non-stationarity of relative model performance erode the theoretical advantage of optimal weighting.

Question 4: Which of the following is NOT a valid strategy for mitigating overfitting when optimizing ensemble weights?

A) Shrinkage toward equal weights B) Cross-validation to select weights C) Using all available data for both training base models and estimating weights D) Regularization penalizing deviation from equal weights

Answer: C

Explanation: Using the same data for training base models and estimating combination weights leads to information leakage, which causes overfitting. The correct approach is to use cross-validated (out-of-fold) predictions from the base models when estimating the combination weights.

Question 5: The linear opinion pool for combining probability forecasts:

A) Can produce probabilities more extreme than any individual forecast B) Preserves marginal calibration when component forecasts are calibrated C) Operates in log-odds space D) Requires conditional independence of forecasters

Answer: B

Explanation: The linear pool (weighted average of probabilities) preserves marginal calibration: if each component forecast is calibrated, the weighted average is also calibrated. It cannot produce probabilities outside the range of individual forecasts (it is bounded by the convex hull). It operates in probability space, not log-odds space. Conditional independence is an assumption of the logarithmic pool, not the linear pool.

Question 6: The logarithmic opinion pool combines forecasts in:

A) Probability space using arithmetic averaging B) Log-odds space using weighted summation C) Rank space using median D) Variance space using harmonic mean

Answer: B

Explanation: The logarithmic pool operates in log-odds space, where it computes a weighted sum of individual log-odds: $\text{logit}(p_{\text{LogP}}) = \sum_i w_i \cdot \text{logit}(p_i)$. This is equivalent to taking the weighted geometric mean of likelihood ratios.

Question 7: Extremizing a forecast aggregate with factor $d > 1$ does which of the following?

A) Pushes the forecast toward 0.5 B) Pushes the forecast away from 0.5 C) Leaves the forecast unchanged D) Randomly perturbs the forecast

Answer: B

Explanation: Extremizing with $d > 1$ pushes forecasts away from 0.5 (toward 0 or 1). In log-odds space, this is simply multiplication: $\text{logit}(p_{\text{ext}}) = d \cdot \text{logit}(\bar{p})$. Since $d > 1$, the absolute log-odds increase, making the probability more extreme.

Question 8: The theoretical justification for extremizing is based on:

A) Markets always underreact to new information B) Forecasters share common background information that is over-counted in simple averaging C) Statistical models are always underconfident D) The law of large numbers

Answer: B

Explanation: When forecasters share common background information, simple averaging (or even logarithmic pooling) over-counts the shared component. Each forecaster's probability already reflects the shared information, so averaging effectively treats it as if there are $K$ independent pieces of evidence when some of it is shared. Extremizing corrects for this by pushing the aggregate toward what the full unique evidence would imply.

Question 9: In stacking (stacked generalization), what is the role of the meta-learner?

A) It trains the base models B) It selects which base model to use for each prediction C) It learns how to optimally combine base model predictions D) It generates additional training data for the base models

Answer: C

Explanation: The meta-learner (Level 1 model) takes the predictions of the base models (Level 0) as inputs and learns how to combine them optimally. It can learn non-linear combination functions and context-dependent weighting, going beyond simple weighted averaging.

Question 10: Cross-validated stacking is necessary because:

A) It speeds up computation B) It prevents information leakage between base model training and meta-learner training C) It ensures all models use the same features D) It reduces the number of base models needed

Answer: B

Explanation: If the meta-learner were trained on the same data used to train the base models, the base model predictions would be overfit to the training data, and the meta-learner would learn to trust these overfit predictions. Cross-validated stacking generates out-of-fold predictions from the base models, ensuring the meta-learner sees predictions made on data the base models did not train on.

Question 11: In Bayesian Model Averaging, the marginal likelihood $P(D|M_k)$ automatically penalizes model complexity because:

A) It uses regularization B) It integrates the likelihood over the prior on parameters, spreading probability mass across the parameter space C) It always selects the simplest model D) It uses cross-validation internally

Answer: B

Explanation: The marginal likelihood integrates the likelihood over the prior distribution of parameters: $P(D|M_k) = \int P(D|\theta_k, M_k) P(\theta_k|M_k) d\theta_k$. Complex models with many parameters spread their prior mass across a large parameter space. Unless the data strongly supports the additional parameters, this spreading reduces the marginal likelihood. This is the Bayesian Occam's razor.

Question 12: The BIC approximation to the marginal likelihood penalizes complexity via which term?

A) $\sqrt{n}$ times the number of parameters B) $d_k \log n$ where $d_k$ is the number of parameters and $n$ is the sample size C) $d_k^2$ where $d_k$ is the number of parameters D) The AIC penalty of $2d_k$

Answer: B

Explanation: BIC = $-2 \log \hat{L}_k + d_k \log n$. The penalty term $d_k \log n$ penalizes each additional parameter by $\log n$, which grows with sample size. This is stronger than AIC's penalty of $2d_k$ for large samples and captures the spirit of the marginal likelihood's complexity penalty.

Question 13: When combining your model's forecast with a prediction market price, the optimal weight on your model is higher when:

A) The market has high liquidity B) Your model uses newer data not yet reflected in the price C) Many traders are participating in the market D) The market has been running for a long time

Answer: B

Explanation: You should weight your model more heavily when it has an informational advantage over the market. If your model incorporates data that has not yet been reflected in the market price (such as a recently released poll or economic indicator), it provides information beyond what the market already aggregates. High liquidity, many traders, and market maturity all tend to make the market price more informative, favoring higher weight on the market.

Question 14: The trimmed mean is preferred over the simple mean when:

A) All forecasters are equally skilled B) There are outlier forecasters who give extreme or unreliable predictions C) You want the most extreme possible aggregate D) You have fewer than 5 forecasters

Answer: B

Explanation: The trimmed mean removes the most extreme forecasts from each end before computing the average. This makes it robust to outlier forecasters who may be trolls, poorly informed, or using a broken model. With all equally skilled forecasters, the simple mean is typically adequate. The trimmed mean requires enough forecasters to make trimming meaningful.

Question 15: The ambiguity decomposition states that:

A) Ensemble MSE = Average Individual MSE + Diversity B) Ensemble MSE = Average Individual MSE - Diversity C) Ensemble MSE = Best Individual MSE - Diversity D) Ensemble MSE = Worst Individual MSE / Diversity

Answer: B

Explanation: The ambiguity decomposition (Krogh and Vedelsby, 1995) states: $\text{MSE}_{\text{ensemble}} = \overline{\text{MSE}} - \overline{\text{Diversity}}$. This means the ensemble error is always less than or equal to the average individual error, and the gap is exactly the diversity (average disagreement among models). Higher diversity = greater ensemble improvement.

Question 16: Which strategy is LEAST effective for creating model diversity?

A) Using different learning algorithms B) Using different random seeds for the same algorithm with identical hyperparameters C) Using different feature subsets D) Using fundamentally different data sources

Answer: B

Explanation: Using different random seeds for the same algorithm with identical hyperparameters typically produces models with highly correlated errors. While there is some randomness (in initialization, bootstrap sampling, etc.), the models use the same algorithm, features, and hyperparameters, so they tend to make similar mistakes. Using different algorithms, features, or data sources creates much more meaningful diversity.

Question 17: When adding a new model to an existing ensemble, the marginal value of the new model is approximately proportional to:

A) The new model's accuracy alone B) $(1 - \bar{\rho})$ where $\bar{\rho}$ is the average error correlation with existing models C) The number of existing models D) The computational cost of the new model

Answer: B

Explanation: The marginal value of a new model depends critically on its error correlation with existing ensemble members. A new model that is highly correlated with existing models ($\bar{\rho}$ near 1) adds almost no value, while a new model with low correlation ($\bar{\rho}$ near 0) provides substantial variance reduction. The formula is approximately $\Delta \text{MSE} \approx \frac{(1 - \bar{\rho})\sigma^2}{(K+1)^2}$.

Question 18: If the logistic recalibration of an averaged forecast gives $\beta_0 = 0.1$ and $\beta_1 = 1.8$, this suggests:

A) The average forecast is well calibrated and no extremizing is needed B) The average forecast is slightly biased upward and too moderate; extremizing with $d = 1.8$ and a small bias correction would help C) The average forecast is too extreme and should be moderated D) The individual models should be discarded

Answer: B

Explanation: $\beta_1 = 1.8 > 1$ indicates the average forecast is too moderate and should be extremized by a factor of 1.8. $\beta_0 = 0.1 > 0$ indicates a slight upward bias in the average forecast (the aggregate slightly underestimates base rates). Together, the recalibrated forecast pushes the average away from 0.5 and corrects for the small bias.

Question 19: In the supra-Bayesian framework for combining forecasts, each forecaster's probability is treated as:

A) A fixed, known quantity B) Data to be used for Bayesian updating C) An error to be minimized D) A vote to be counted

Answer: B

Explanation: The supra-Bayesian approach treats each forecaster's reported probability as a piece of data (an observation). The decision maker has a prior over the true state and updates it by computing the likelihood of observing each reported probability conditional on the true state. This provides a principled Bayesian framework for opinion aggregation.

Question 20: What is the typical range for the optimal extremizing factor $d$ when applied to a linear pool aggregate?

A) $d \in [0.1, 0.5]$ B) $d \in [0.5, 1.0]$ C) $d \in [1.5, 3.0]$ D) $d \in [5.0, 10.0]$

Answer: C

Explanation: For linear pool aggregates (simple averages of probabilities), the optimal extremizing factor typically falls in the range $d \in [1.5, 3.0]$. Values in this range push the average forecast away from 0.5 sufficiently to correct for the shared-information problem. Values below 1 would moderate the forecast further (anti-extremizing), and values much above 3 risk overconfidence.

Question 21: Which of these is a property of the logarithmic pool but NOT the linear pool?

A) Output is always a valid probability B) Can produce forecasts more extreme than any individual input C) Preserves marginal calibration D) Is a weighted average of probabilities

Answer: B

Explanation: The logarithmic pool can produce forecasts more extreme than any individual input because it combines log-odds, and the weighted sum of log-odds can exceed any individual log-odds value. The linear pool is bounded by the range of individual forecasts (it is a convex combination). Both produce valid probabilities. Marginal calibration preservation is a property of the linear pool.

Question 22: You have an ensemble of 3 models. Removing Model A increases the ensemble Brier score by 0.008. Removing Model B increases it by 0.001. Removing Model C increases it by 0.012. Which model has the highest marginal value?

A) Model A B) Model B C) Model C D) Cannot be determined

Answer: C

Explanation: The marginal value of a model is measured by how much the ensemble performance degrades when that model is removed. Model C's removal causes the largest increase in Brier score (0.012), indicating it contributes the most unique information to the ensemble. Model B contributes the least (0.001) and might be a candidate for removal to simplify the ensemble.

Question 23: When should you stop adding models to an ensemble?

A) After exactly 5 models B) When the marginal improvement falls below a meaningful threshold and new models are highly correlated with existing ones C) When you run out of algorithms to try D) Never; more models are always better

Answer: B

Explanation: The optimal ensemble size balances accuracy gains against diminishing returns and added complexity. You should stop when: (1) the marginal improvement in cross-validated performance is negligible, (2) new candidate models have high error correlation with existing members, or (3) the computational/maintenance cost outweighs the accuracy gain. More models are not always better, especially when they are correlated.

Question 24: A prediction market shows 0.55 for a contract, and your calibrated model gives 0.72. The market has thin liquidity (few trades). What is the BEST approach?

A) Ignore the market entirely and use your model B) Ignore your model and use the market price C) Combine them with higher weight on your model, reflecting the market's thin liquidity D) Average them equally

Answer: C

Explanation: Thin liquidity reduces the market's informational content because fewer traders have expressed their views and the price may not reflect all available information. Your calibrated model may contain information not yet in the market. The best approach combines both, but weights the model more heavily given the market's low liquidity. Ignoring either source entirely wastes information.

Question 25: The Q-statistic between two models equals 0. This indicates:

A) The models are perfectly positively dependent (always agree) B) The models are statistically independent (errors are uncorrelated) C) One model is strictly better than the other D) The models should not be combined

Answer: B

Explanation: Yule's Q-statistic measures the association between two models' correct/incorrect predictions. $Q = 0$ indicates statistical independence: knowing whether one model is correct provides no information about whether the other is correct. This is ideal for ensemble diversity; independent models provide maximum variance reduction when combined. $Q = 1$ would indicate perfect positive dependence.