Case Study 1: XGBoost vs Neural Net — Predicting Market Outcomes
Overview
In this case study, we build and rigorously compare two ML models — XGBoost and a PyTorch neural network — for predicting binary outcomes in a prediction market. We follow the full pipeline: data preparation, model training, calibration, evaluation, and SHAP interpretability analysis. At the end, we pick a winner based on multiple criteria relevant to live trading.
Scenario
A prediction market trader wants to develop a systematic model for predicting the outcomes of political event markets. The trader has assembled a dataset of 3,000 historical events with 12 features each, including polling data, economic indicators, sentiment scores, and market-derived features. Events span a 10-year period, and outcomes are binary (event occurred = 1, did not occur = 0).
The trader wants to determine: Should I use XGBoost or a neural network for this problem?
Dataset Description
The dataset contains:
| Feature | Description | Type |
|---|---|---|
approval_rating |
Incumbent approval rating (%) | Continuous |
gdp_growth |
Quarter-over-quarter GDP growth (%) | Continuous |
unemployment |
Unemployment rate (%) | Continuous |
inflation |
Year-over-year inflation rate (%) | Continuous |
challenger_quality |
Challenger quality score (0-10) | Continuous |
months_to_event |
Months until event resolution | Continuous |
polling_margin |
Polling margin (+ favors incumbent) | Continuous |
fundraising_ratio |
Incumbent/challenger fundraising ratio | Continuous |
sentiment_score |
NLP sentiment from news articles | Continuous |
market_price |
Current market contract price | Continuous |
volatility_30d |
30-day rolling volatility of market price | Continuous |
volume_rank |
Trading volume percentile rank | Continuous |
The base rate of positive outcomes is approximately 45%, so the data is reasonably balanced.
Step 1: Data Preparation
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import brier_score_loss, log_loss
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
n = 3000
# Generate realistic features
data = pd.DataFrame({
'approval_rating': np.random.normal(48, 10, n),
'gdp_growth': np.random.normal(2.3, 1.8, n),
'unemployment': np.random.normal(5.5, 1.5, n),
'inflation': np.random.normal(2.5, 1.2, n),
'challenger_quality': np.random.uniform(1, 10, n),
'months_to_event': np.random.randint(1, 24, n),
'polling_margin': np.random.normal(0, 6, n),
'fundraising_ratio': np.random.lognormal(0, 0.5, n),
'sentiment_score': np.random.normal(0, 1, n),
'market_price': np.random.beta(5, 5, n),
'volatility_30d': np.random.exponential(0.05, n),
'volume_rank': np.random.uniform(0, 1, n),
})
# Generate outcomes with nonlinear relationships
logit = (
0.04 * data['approval_rating']
+ 0.25 * data['gdp_growth']
- 0.15 * data['unemployment']
- 0.08 * data['inflation']
- 0.12 * data['challenger_quality']
+ 0.18 * data['polling_margin']
+ 0.15 * np.log(data['fundraising_ratio'] + 0.01)
+ 0.20 * data['sentiment_score']
+ 0.5 * data['market_price']
- 0.8 * data['volatility_30d']
# Nonlinear interaction
+ 0.003 * data['approval_rating'] * data['gdp_growth']
# Threshold effect
+ 0.5 * (data['polling_margin'] > 5).astype(float)
- 3.5
)
prob = 1 / (1 + np.exp(-logit))
data['outcome'] = np.random.binomial(1, prob)
print(f"Dataset shape: {data.shape}")
print(f"Outcome distribution: {data['outcome'].value_counts().to_dict()}")
print(f"Base rate: {data['outcome'].mean():.3f}")
# Temporal split: 70% train, 15% validation, 15% test
n_train = 2100
n_val = 450
feature_cols = [c for c in data.columns if c != 'outcome']
train = data.iloc[:n_train]
val = data.iloc[n_train:n_train + n_val]
test = data.iloc[n_train + n_val:]
X_train, y_train = train[feature_cols], train['outcome']
X_val, y_val = val[feature_cols], val['outcome']
X_test, y_test = test[feature_cols], test['outcome']
print(f"\nTrain: {len(train)}, Val: {len(val)}, Test: {len(test)}")
Step 2: Train XGBoost
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_cols)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=feature_cols)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_cols)
xgb_params = {
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 5,
'learning_rate': 0.05,
'subsample': 0.8,
'colsample_bytree': 0.8,
'min_child_weight': 10,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'gamma': 0.1,
'seed': 42,
}
xgb_model = xgb.train(
xgb_params, dtrain,
num_boost_round=1000,
evals=[(dtrain, 'train'), (dval, 'val')],
early_stopping_rounds=50,
verbose_eval=100
)
xgb_probs_val = xgb_model.predict(dval)
xgb_probs_test = xgb_model.predict(dtest)
print(f"\nXGBoost Results:")
print(f" Best iteration: {xgb_model.best_iteration}")
print(f" Val Brier Score: {brier_score_loss(y_val, xgb_probs_val):.4f}")
print(f" Val Log Loss: {log_loss(y_val, xgb_probs_val):.4f}")
print(f" Test Brier Score: {brier_score_loss(y_test, xgb_probs_test):.4f}")
print(f" Test Log Loss: {log_loss(y_test, xgb_probs_test):.4f}")
Step 3: Train Neural Network
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Scale features for neural network
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
# Convert to tensors
X_train_t = torch.FloatTensor(X_train_scaled)
y_train_t = torch.FloatTensor(y_train.values).unsqueeze(1)
X_val_t = torch.FloatTensor(X_val_scaled)
y_val_t = torch.FloatTensor(y_val.values).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test_scaled)
y_test_t = torch.FloatTensor(y_test.values).unsqueeze(1)
train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
class PredictionNet(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 64),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 32),
nn.BatchNorm1d(32),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.net(x)
nn_model = PredictionNet(len(feature_cols))
criterion = nn.BCELoss()
optimizer = optim.Adam(nn_model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)
# Training loop
best_val_loss = float('inf')
patience_counter = 0
train_losses = []
val_losses = []
for epoch in range(300):
nn_model.train()
epoch_loss = 0
for X_batch, y_batch in train_loader:
optimizer.zero_grad()
out = nn_model(X_batch)
loss = criterion(out, y_batch)
loss.backward()
optimizer.step()
epoch_loss += loss.item() * len(X_batch)
train_loss = epoch_loss / len(train_dataset)
train_losses.append(train_loss)
nn_model.eval()
with torch.no_grad():
val_out = nn_model(X_val_t)
val_loss = criterion(val_out, y_val_t).item()
val_losses.append(val_loss)
scheduler.step(val_loss)
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(nn_model.state_dict(), 'best_nn_cs1.pt')
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= 25:
print(f"Early stopping at epoch {epoch}")
break
if epoch % 50 == 0:
print(f"Epoch {epoch}: Train={train_loss:.4f}, Val={val_loss:.4f}")
# Load best model
nn_model.load_state_dict(torch.load('best_nn_cs1.pt'))
nn_model.eval()
with torch.no_grad():
nn_probs_val = nn_model(X_val_t).numpy().flatten()
nn_probs_test = nn_model(X_test_t).numpy().flatten()
print(f"\nNeural Network Results:")
print(f" Val Brier Score: {brier_score_loss(y_val, nn_probs_val):.4f}")
print(f" Val Log Loss: {log_loss(y_val, nn_probs_val):.4f}")
print(f" Test Brier Score: {brier_score_loss(y_test, nn_probs_test):.4f}")
print(f" Test Log Loss: {log_loss(y_test, nn_probs_test):.4f}")
Step 4: Plot Training Curves
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_losses, label='Training Loss', alpha=0.7)
ax.plot(val_losses, label='Validation Loss', alpha=0.7)
ax.set_xlabel('Epoch')
ax.set_ylabel('BCE Loss')
ax.set_title('Neural Network Training Curves')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('nn_training_curves.png', dpi=150)
plt.show()
Step 5: Calibrate Both Models
from sklearn.calibration import calibration_curve
from sklearn.linear_model import LogisticRegression
from sklearn.isotonic import IsotonicRegression
def calibrate_platt(probs_cal, y_cal, probs_apply):
lr = LogisticRegression(C=1e10, solver='lbfgs')
lr.fit(probs_cal.reshape(-1, 1), y_cal)
return lr.predict_proba(probs_apply.reshape(-1, 1))[:, 1], lr
def calibrate_isotonic(probs_cal, y_cal, probs_apply):
ir = IsotonicRegression(out_of_bounds='clip')
ir.fit(probs_cal, y_cal)
return ir.predict(probs_apply), ir
# Calibrate XGBoost
xgb_platt_test, xgb_platt_model = calibrate_platt(
xgb_probs_val, y_val.values, xgb_probs_test
)
xgb_iso_test, xgb_iso_model = calibrate_isotonic(
xgb_probs_val, y_val.values, xgb_probs_test
)
# Calibrate Neural Network
nn_platt_test, nn_platt_model = calibrate_platt(
nn_probs_val, y_val.values, nn_probs_test
)
nn_iso_test, nn_iso_model = calibrate_isotonic(
nn_probs_val, y_val.values, nn_probs_test
)
# Print calibrated results
print("=" * 60)
print("CALIBRATION RESULTS (Test Set)")
print("=" * 60)
for name, probs in [
('XGBoost (raw)', xgb_probs_test),
('XGBoost (Platt)', xgb_platt_test),
('XGBoost (Isotonic)', xgb_iso_test),
('Neural Net (raw)', nn_probs_test),
('Neural Net (Platt)', nn_platt_test),
('Neural Net (Isotonic)', nn_iso_test),
]:
brier = brier_score_loss(y_test, probs)
ll = log_loss(y_test, np.clip(probs, 1e-7, 1-1e-7))
print(f" {name:25s} Brier={brier:.4f} LogLoss={ll:.4f}")
Step 6: Reliability Diagrams
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# XGBoost calibration
ax = axes[0]
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
for name, probs in [
('Raw', xgb_probs_test),
('Platt', xgb_platt_test),
('Isotonic', xgb_iso_test),
]:
frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
ax.plot(mean_pred, frac_pos, 's-', label=name)
ax.set_title('XGBoost Calibration')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Observed Frequency')
ax.legend()
ax.grid(True, alpha=0.3)
# Neural Net calibration
ax = axes[1]
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
for name, probs in [
('Raw', nn_probs_test),
('Platt', nn_platt_test),
('Isotonic', nn_iso_test),
]:
frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
ax.plot(mean_pred, frac_pos, 's-', label=name)
ax.set_title('Neural Network Calibration')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Observed Frequency')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('calibration_comparison.png', dpi=150)
plt.show()
Step 7: SHAP Analysis
import shap
# XGBoost SHAP
xgb_explainer = shap.TreeExplainer(xgb_model)
xgb_shap = xgb_explainer.shap_values(X_test)
print("\nXGBoost SHAP — Mean |SHAP| per feature:")
mean_abs_shap = np.abs(xgb_shap).mean(axis=0)
shap_ranking = pd.Series(mean_abs_shap, index=feature_cols).sort_values(ascending=False)
print(shap_ranking.to_string())
# SHAP Summary Plot
shap.summary_plot(xgb_shap, X_test, feature_names=feature_cols, show=False)
plt.title('XGBoost SHAP Summary')
plt.tight_layout()
plt.savefig('xgb_shap_summary.png', dpi=150)
plt.show()
# SHAP Dependence Plot for top feature
top_feature = shap_ranking.index[0]
shap.dependence_plot(top_feature, xgb_shap, X_test,
feature_names=feature_cols, show=False)
plt.title(f'XGBoost SHAP Dependence: {top_feature}')
plt.tight_layout()
plt.savefig(f'xgb_shap_dep_{top_feature}.png', dpi=150)
plt.show()
# Individual prediction explanations
high_prob_idx = np.argmax(xgb_probs_test)
low_prob_idx = np.argmin(xgb_probs_test)
print(f"\nHigh probability prediction (idx={high_prob_idx}):")
print(f" Predicted probability: {xgb_probs_test[high_prob_idx]:.4f}")
print(f" Actual outcome: {y_test.iloc[high_prob_idx]}")
print(f" SHAP contributions:")
for feat, val in zip(feature_cols, xgb_shap[high_prob_idx]):
if abs(val) > 0.01:
print(f" {feat}: {val:+.4f}")
print(f"\nLow probability prediction (idx={low_prob_idx}):")
print(f" Predicted probability: {xgb_probs_test[low_prob_idx]:.4f}")
print(f" Actual outcome: {y_test.iloc[low_prob_idx]}")
print(f" SHAP contributions:")
for feat, val in zip(feature_cols, xgb_shap[low_prob_idx]):
if abs(val) > 0.01:
print(f" {feat}: {val:+.4f}")
Step 8: Head-to-Head Comparison
from scipy import stats
print("\n" + "=" * 70)
print("HEAD-TO-HEAD COMPARISON: XGBoost vs Neural Network")
print("=" * 70)
# Choose best calibrated version of each
xgb_best = xgb_platt_test # or xgb_iso_test, whichever is better
nn_best = nn_platt_test
# Metrics
metrics = {}
for name, probs in [('XGBoost+Platt', xgb_best), ('NeuralNet+Platt', nn_best)]:
probs_c = np.clip(probs, 1e-7, 1-1e-7)
brier = brier_score_loss(y_test, probs_c)
ll = log_loss(y_test, probs_c)
# ECE
n_bins = 10
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
for j in range(n_bins):
mask = (probs_c >= bin_edges[j]) & (probs_c < bin_edges[j+1])
if mask.sum() > 0:
ece += mask.sum() / len(y_test) * abs(
y_test.values[mask].mean() - probs_c[mask].mean()
)
metrics[name] = {'Brier': brier, 'LogLoss': ll, 'ECE': ece}
metrics_df = pd.DataFrame(metrics).T
print(metrics_df.to_string())
# Statistical significance test
eps = 1e-7
loss_xgb = -(y_test.values * np.log(np.clip(xgb_best, eps, 1-eps)) +
(1-y_test.values) * np.log(np.clip(1-xgb_best, eps, 1-eps)))
loss_nn = -(y_test.values * np.log(np.clip(nn_best, eps, 1-eps)) +
(1-y_test.values) * np.log(np.clip(1-nn_best, eps, 1-eps)))
t_stat, p_value = stats.ttest_rel(loss_xgb, loss_nn)
print(f"\nPaired t-test (per-instance log-loss):")
print(f" t-statistic: {t_stat:.3f}")
print(f" p-value: {p_value:.4f}")
mean_diff = (loss_xgb - loss_nn).mean()
if p_value < 0.05:
winner = 'Neural Network' if mean_diff > 0 else 'XGBoost'
print(f" Significant difference: {winner} is better (p={p_value:.4f})")
else:
print(f" No statistically significant difference (p={p_value:.4f})")
Step 9: Final Decision
print("\n" + "=" * 70)
print("DECISION MATRIX")
print("=" * 70)
criteria = {
'Predictive Performance (Brier)': 'See metrics above',
'Calibration (ECE)': 'See metrics above',
'Training Speed': 'XGBoost: ~5 sec, Neural Net: ~30 sec',
'Inference Speed': 'XGBoost: <1ms, Neural Net: ~2ms',
'Interpretability': 'XGBoost: Excellent (TreeSHAP), Neural Net: Limited',
'Ease of Tuning': 'XGBoost: Moderate, Neural Net: Harder',
'Robustness': 'XGBoost: No feature scaling needed, Neural Net: Requires scaling',
'Deployment Simplicity': 'XGBoost: Simple JSON, Neural Net: Requires PyTorch runtime',
}
for criterion, note in criteria.items():
print(f" {criterion}: {note}")
print("\nRECOMMENDATION:")
print("-" * 40)
print("For this dataset (3,000 samples, 12 tabular features, binary outcome):")
print(" XGBoost is the recommended choice.")
print()
print("Rationale:")
print(" 1. Comparable or better predictive performance")
print(" 2. Superior interpretability via TreeSHAP")
print(" 3. Faster training and inference")
print(" 4. Simpler deployment (no GPU/PyTorch dependency)")
print(" 5. More robust to feature scaling issues")
print()
print("When the neural network would win:")
print(" - Dataset size > 50,000 samples")
print(" - Multi-modal data (text + tabular)")
print(" - Need for transfer learning")
print(" - Multi-output prediction (multiple related markets)")
Key Takeaways
-
For tabular prediction market data with moderate sample sizes, XGBoost typically matches or exceeds neural network performance. This is consistent with the broader ML literature on tabular data.
-
Calibration is essential for both models. Neither raw XGBoost nor raw neural network outputs are well-calibrated enough for direct use in trading. Platt scaling provides a simple, effective fix.
-
SHAP interpretability is a major advantage of tree-based models. TreeSHAP is exact, fast, and provides deep insight into model behavior. Neural network SHAP (via DeepExplainer or KernelExplainer) is slower and approximate.
-
Statistical testing matters. Even when one model appears better on point estimates, the difference may not be statistically significant. The paired t-test on per-instance losses provides a rigorous comparison.
-
The decision should weigh multiple criteria beyond raw performance. Training speed, deployment complexity, interpretability, and robustness all matter for a production trading system.