Case Study 1: XGBoost vs Neural Net — Predicting Market Outcomes

Overview

In this case study, we build and rigorously compare two ML models — XGBoost and a PyTorch neural network — for predicting binary outcomes in a prediction market. We follow the full pipeline: data preparation, model training, calibration, evaluation, and SHAP interpretability analysis. At the end, we pick a winner based on multiple criteria relevant to live trading.

Scenario

A prediction market trader wants to develop a systematic model for predicting the outcomes of political event markets. The trader has assembled a dataset of 3,000 historical events with 12 features each, including polling data, economic indicators, sentiment scores, and market-derived features. Events span a 10-year period, and outcomes are binary (event occurred = 1, did not occur = 0).

The trader wants to determine: Should I use XGBoost or a neural network for this problem?

Dataset Description

The dataset contains:

Feature Description Type
approval_rating Incumbent approval rating (%) Continuous
gdp_growth Quarter-over-quarter GDP growth (%) Continuous
unemployment Unemployment rate (%) Continuous
inflation Year-over-year inflation rate (%) Continuous
challenger_quality Challenger quality score (0-10) Continuous
months_to_event Months until event resolution Continuous
polling_margin Polling margin (+ favors incumbent) Continuous
fundraising_ratio Incumbent/challenger fundraising ratio Continuous
sentiment_score NLP sentiment from news articles Continuous
market_price Current market contract price Continuous
volatility_30d 30-day rolling volatility of market price Continuous
volume_rank Trading volume percentile rank Continuous

The base rate of positive outcomes is approximately 45%, so the data is reasonably balanced.

Step 1: Data Preparation

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import brier_score_loss, log_loss
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000

# Generate realistic features
data = pd.DataFrame({
    'approval_rating': np.random.normal(48, 10, n),
    'gdp_growth': np.random.normal(2.3, 1.8, n),
    'unemployment': np.random.normal(5.5, 1.5, n),
    'inflation': np.random.normal(2.5, 1.2, n),
    'challenger_quality': np.random.uniform(1, 10, n),
    'months_to_event': np.random.randint(1, 24, n),
    'polling_margin': np.random.normal(0, 6, n),
    'fundraising_ratio': np.random.lognormal(0, 0.5, n),
    'sentiment_score': np.random.normal(0, 1, n),
    'market_price': np.random.beta(5, 5, n),
    'volatility_30d': np.random.exponential(0.05, n),
    'volume_rank': np.random.uniform(0, 1, n),
})

# Generate outcomes with nonlinear relationships
logit = (
    0.04 * data['approval_rating']
    + 0.25 * data['gdp_growth']
    - 0.15 * data['unemployment']
    - 0.08 * data['inflation']
    - 0.12 * data['challenger_quality']
    + 0.18 * data['polling_margin']
    + 0.15 * np.log(data['fundraising_ratio'] + 0.01)
    + 0.20 * data['sentiment_score']
    + 0.5 * data['market_price']
    - 0.8 * data['volatility_30d']
    # Nonlinear interaction
    + 0.003 * data['approval_rating'] * data['gdp_growth']
    # Threshold effect
    + 0.5 * (data['polling_margin'] > 5).astype(float)
    - 3.5
)
prob = 1 / (1 + np.exp(-logit))
data['outcome'] = np.random.binomial(1, prob)

print(f"Dataset shape: {data.shape}")
print(f"Outcome distribution: {data['outcome'].value_counts().to_dict()}")
print(f"Base rate: {data['outcome'].mean():.3f}")

# Temporal split: 70% train, 15% validation, 15% test
n_train = 2100
n_val = 450
feature_cols = [c for c in data.columns if c != 'outcome']

train = data.iloc[:n_train]
val = data.iloc[n_train:n_train + n_val]
test = data.iloc[n_train + n_val:]

X_train, y_train = train[feature_cols], train['outcome']
X_val, y_val = val[feature_cols], val['outcome']
X_test, y_test = test[feature_cols], test['outcome']

print(f"\nTrain: {len(train)}, Val: {len(val)}, Test: {len(test)}")

Step 2: Train XGBoost

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_cols)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=feature_cols)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_cols)

xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'max_depth': 5,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 10,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'gamma': 0.1,
    'seed': 42,
}

xgb_model = xgb.train(
    xgb_params, dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

xgb_probs_val = xgb_model.predict(dval)
xgb_probs_test = xgb_model.predict(dtest)

print(f"\nXGBoost Results:")
print(f"  Best iteration: {xgb_model.best_iteration}")
print(f"  Val Brier Score: {brier_score_loss(y_val, xgb_probs_val):.4f}")
print(f"  Val Log Loss: {log_loss(y_val, xgb_probs_val):.4f}")
print(f"  Test Brier Score: {brier_score_loss(y_test, xgb_probs_test):.4f}")
print(f"  Test Log Loss: {log_loss(y_test, xgb_probs_test):.4f}")

Step 3: Train Neural Network

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Scale features for neural network
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Convert to tensors
X_train_t = torch.FloatTensor(X_train_scaled)
y_train_t = torch.FloatTensor(y_train.values).unsqueeze(1)
X_val_t = torch.FloatTensor(X_val_scaled)
y_val_t = torch.FloatTensor(y_val.values).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test_scaled)
y_test_t = torch.FloatTensor(y_test.values).unsqueeze(1)

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

class PredictionNet(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(64, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.net(x)

nn_model = PredictionNet(len(feature_cols))
criterion = nn.BCELoss()
optimizer = optim.Adam(nn_model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10, factor=0.5)

# Training loop
best_val_loss = float('inf')
patience_counter = 0
train_losses = []
val_losses = []

for epoch in range(300):
    nn_model.train()
    epoch_loss = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        out = nn_model(X_batch)
        loss = criterion(out, y_batch)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item() * len(X_batch)
    train_loss = epoch_loss / len(train_dataset)
    train_losses.append(train_loss)

    nn_model.eval()
    with torch.no_grad():
        val_out = nn_model(X_val_t)
        val_loss = criterion(val_out, y_val_t).item()
    val_losses.append(val_loss)
    scheduler.step(val_loss)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(nn_model.state_dict(), 'best_nn_cs1.pt')
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= 25:
            print(f"Early stopping at epoch {epoch}")
            break

    if epoch % 50 == 0:
        print(f"Epoch {epoch}: Train={train_loss:.4f}, Val={val_loss:.4f}")

# Load best model
nn_model.load_state_dict(torch.load('best_nn_cs1.pt'))
nn_model.eval()
with torch.no_grad():
    nn_probs_val = nn_model(X_val_t).numpy().flatten()
    nn_probs_test = nn_model(X_test_t).numpy().flatten()

print(f"\nNeural Network Results:")
print(f"  Val Brier Score: {brier_score_loss(y_val, nn_probs_val):.4f}")
print(f"  Val Log Loss: {log_loss(y_val, nn_probs_val):.4f}")
print(f"  Test Brier Score: {brier_score_loss(y_test, nn_probs_test):.4f}")
print(f"  Test Log Loss: {log_loss(y_test, nn_probs_test):.4f}")

Step 4: Plot Training Curves

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(train_losses, label='Training Loss', alpha=0.7)
ax.plot(val_losses, label='Validation Loss', alpha=0.7)
ax.set_xlabel('Epoch')
ax.set_ylabel('BCE Loss')
ax.set_title('Neural Network Training Curves')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('nn_training_curves.png', dpi=150)
plt.show()

Step 5: Calibrate Both Models

from sklearn.calibration import calibration_curve
from sklearn.linear_model import LogisticRegression
from sklearn.isotonic import IsotonicRegression

def calibrate_platt(probs_cal, y_cal, probs_apply):
    lr = LogisticRegression(C=1e10, solver='lbfgs')
    lr.fit(probs_cal.reshape(-1, 1), y_cal)
    return lr.predict_proba(probs_apply.reshape(-1, 1))[:, 1], lr

def calibrate_isotonic(probs_cal, y_cal, probs_apply):
    ir = IsotonicRegression(out_of_bounds='clip')
    ir.fit(probs_cal, y_cal)
    return ir.predict(probs_apply), ir

# Calibrate XGBoost
xgb_platt_test, xgb_platt_model = calibrate_platt(
    xgb_probs_val, y_val.values, xgb_probs_test
)
xgb_iso_test, xgb_iso_model = calibrate_isotonic(
    xgb_probs_val, y_val.values, xgb_probs_test
)

# Calibrate Neural Network
nn_platt_test, nn_platt_model = calibrate_platt(
    nn_probs_val, y_val.values, nn_probs_test
)
nn_iso_test, nn_iso_model = calibrate_isotonic(
    nn_probs_val, y_val.values, nn_probs_test
)

# Print calibrated results
print("=" * 60)
print("CALIBRATION RESULTS (Test Set)")
print("=" * 60)
for name, probs in [
    ('XGBoost (raw)', xgb_probs_test),
    ('XGBoost (Platt)', xgb_platt_test),
    ('XGBoost (Isotonic)', xgb_iso_test),
    ('Neural Net (raw)', nn_probs_test),
    ('Neural Net (Platt)', nn_platt_test),
    ('Neural Net (Isotonic)', nn_iso_test),
]:
    brier = brier_score_loss(y_test, probs)
    ll = log_loss(y_test, np.clip(probs, 1e-7, 1-1e-7))
    print(f"  {name:25s}  Brier={brier:.4f}  LogLoss={ll:.4f}")

Step 6: Reliability Diagrams

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# XGBoost calibration
ax = axes[0]
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
for name, probs in [
    ('Raw', xgb_probs_test),
    ('Platt', xgb_platt_test),
    ('Isotonic', xgb_iso_test),
]:
    frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    ax.plot(mean_pred, frac_pos, 's-', label=name)
ax.set_title('XGBoost Calibration')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Observed Frequency')
ax.legend()
ax.grid(True, alpha=0.3)

# Neural Net calibration
ax = axes[1]
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
for name, probs in [
    ('Raw', nn_probs_test),
    ('Platt', nn_platt_test),
    ('Isotonic', nn_iso_test),
]:
    frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    ax.plot(mean_pred, frac_pos, 's-', label=name)
ax.set_title('Neural Network Calibration')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Observed Frequency')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('calibration_comparison.png', dpi=150)
plt.show()

Step 7: SHAP Analysis

import shap

# XGBoost SHAP
xgb_explainer = shap.TreeExplainer(xgb_model)
xgb_shap = xgb_explainer.shap_values(X_test)

print("\nXGBoost SHAP — Mean |SHAP| per feature:")
mean_abs_shap = np.abs(xgb_shap).mean(axis=0)
shap_ranking = pd.Series(mean_abs_shap, index=feature_cols).sort_values(ascending=False)
print(shap_ranking.to_string())

# SHAP Summary Plot
shap.summary_plot(xgb_shap, X_test, feature_names=feature_cols, show=False)
plt.title('XGBoost SHAP Summary')
plt.tight_layout()
plt.savefig('xgb_shap_summary.png', dpi=150)
plt.show()

# SHAP Dependence Plot for top feature
top_feature = shap_ranking.index[0]
shap.dependence_plot(top_feature, xgb_shap, X_test,
                     feature_names=feature_cols, show=False)
plt.title(f'XGBoost SHAP Dependence: {top_feature}')
plt.tight_layout()
plt.savefig(f'xgb_shap_dep_{top_feature}.png', dpi=150)
plt.show()

# Individual prediction explanations
high_prob_idx = np.argmax(xgb_probs_test)
low_prob_idx = np.argmin(xgb_probs_test)

print(f"\nHigh probability prediction (idx={high_prob_idx}):")
print(f"  Predicted probability: {xgb_probs_test[high_prob_idx]:.4f}")
print(f"  Actual outcome: {y_test.iloc[high_prob_idx]}")
print(f"  SHAP contributions:")
for feat, val in zip(feature_cols, xgb_shap[high_prob_idx]):
    if abs(val) > 0.01:
        print(f"    {feat}: {val:+.4f}")

print(f"\nLow probability prediction (idx={low_prob_idx}):")
print(f"  Predicted probability: {xgb_probs_test[low_prob_idx]:.4f}")
print(f"  Actual outcome: {y_test.iloc[low_prob_idx]}")
print(f"  SHAP contributions:")
for feat, val in zip(feature_cols, xgb_shap[low_prob_idx]):
    if abs(val) > 0.01:
        print(f"    {feat}: {val:+.4f}")

Step 8: Head-to-Head Comparison

from scipy import stats

print("\n" + "=" * 70)
print("HEAD-TO-HEAD COMPARISON: XGBoost vs Neural Network")
print("=" * 70)

# Choose best calibrated version of each
xgb_best = xgb_platt_test  # or xgb_iso_test, whichever is better
nn_best = nn_platt_test

# Metrics
metrics = {}
for name, probs in [('XGBoost+Platt', xgb_best), ('NeuralNet+Platt', nn_best)]:
    probs_c = np.clip(probs, 1e-7, 1-1e-7)
    brier = brier_score_loss(y_test, probs_c)
    ll = log_loss(y_test, probs_c)

    # ECE
    n_bins = 10
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for j in range(n_bins):
        mask = (probs_c >= bin_edges[j]) & (probs_c < bin_edges[j+1])
        if mask.sum() > 0:
            ece += mask.sum() / len(y_test) * abs(
                y_test.values[mask].mean() - probs_c[mask].mean()
            )

    metrics[name] = {'Brier': brier, 'LogLoss': ll, 'ECE': ece}

metrics_df = pd.DataFrame(metrics).T
print(metrics_df.to_string())

# Statistical significance test
eps = 1e-7
loss_xgb = -(y_test.values * np.log(np.clip(xgb_best, eps, 1-eps)) +
             (1-y_test.values) * np.log(np.clip(1-xgb_best, eps, 1-eps)))
loss_nn = -(y_test.values * np.log(np.clip(nn_best, eps, 1-eps)) +
            (1-y_test.values) * np.log(np.clip(1-nn_best, eps, 1-eps)))

t_stat, p_value = stats.ttest_rel(loss_xgb, loss_nn)
print(f"\nPaired t-test (per-instance log-loss):")
print(f"  t-statistic: {t_stat:.3f}")
print(f"  p-value: {p_value:.4f}")

mean_diff = (loss_xgb - loss_nn).mean()
if p_value < 0.05:
    winner = 'Neural Network' if mean_diff > 0 else 'XGBoost'
    print(f"  Significant difference: {winner} is better (p={p_value:.4f})")
else:
    print(f"  No statistically significant difference (p={p_value:.4f})")

Step 9: Final Decision

print("\n" + "=" * 70)
print("DECISION MATRIX")
print("=" * 70)

criteria = {
    'Predictive Performance (Brier)': 'See metrics above',
    'Calibration (ECE)': 'See metrics above',
    'Training Speed': 'XGBoost: ~5 sec, Neural Net: ~30 sec',
    'Inference Speed': 'XGBoost: <1ms, Neural Net: ~2ms',
    'Interpretability': 'XGBoost: Excellent (TreeSHAP), Neural Net: Limited',
    'Ease of Tuning': 'XGBoost: Moderate, Neural Net: Harder',
    'Robustness': 'XGBoost: No feature scaling needed, Neural Net: Requires scaling',
    'Deployment Simplicity': 'XGBoost: Simple JSON, Neural Net: Requires PyTorch runtime',
}

for criterion, note in criteria.items():
    print(f"  {criterion}: {note}")

print("\nRECOMMENDATION:")
print("-" * 40)
print("For this dataset (3,000 samples, 12 tabular features, binary outcome):")
print("  XGBoost is the recommended choice.")
print()
print("Rationale:")
print("  1. Comparable or better predictive performance")
print("  2. Superior interpretability via TreeSHAP")
print("  3. Faster training and inference")
print("  4. Simpler deployment (no GPU/PyTorch dependency)")
print("  5. More robust to feature scaling issues")
print()
print("When the neural network would win:")
print("  - Dataset size > 50,000 samples")
print("  - Multi-modal data (text + tabular)")
print("  - Need for transfer learning")
print("  - Multi-output prediction (multiple related markets)")

Key Takeaways

  1. For tabular prediction market data with moderate sample sizes, XGBoost typically matches or exceeds neural network performance. This is consistent with the broader ML literature on tabular data.

  2. Calibration is essential for both models. Neither raw XGBoost nor raw neural network outputs are well-calibrated enough for direct use in trading. Platt scaling provides a simple, effective fix.

  3. SHAP interpretability is a major advantage of tree-based models. TreeSHAP is exact, fast, and provides deep insight into model behavior. Neural network SHAP (via DeepExplainer or KernelExplainer) is slower and approximate.

  4. Statistical testing matters. Even when one model appears better on point estimates, the difference may not be statistically significant. The paired t-test on per-instance losses provides a rigorous comparison.

  5. The decision should weigh multiple criteria beyond raw performance. Training speed, deployment complexity, interpretability, and robustness all matter for a production trading system.