Case Study: Evaluating Model Performance Against Market Efficiency
How Good is Your Model? Using Betting Markets as the Ultimate Benchmark
Introduction
You've built an NFL prediction model. It looks good in backtesting. Your accuracy metrics seem impressive. But how do you really know if it's any good?
The answer lies in comparing your model to the most efficient information aggregation system available: the betting market. This case study walks through a comprehensive evaluation of a prediction model against NFL betting lines, using Closing Line Value (CLV) as the primary performance metric.
The Scenario
The Model: An ensemble combining Elo ratings, efficiency metrics, and machine learning features (similar to what we built in Chapters 19-20).
The Test Period: 2021-2023 NFL seasons (816 regular season games)
The Question: Does our model provide any predictive value beyond what the betting market already knows?
Step 1: Gathering the Data
Model Predictions
Our model outputs a predicted point spread for each game:
Week 1, 2021: TB @ DAL
Model Prediction: TB -6.2
Market Data
We collected: - Opening lines (Sunday/Monday of prior week) - Closing lines (kickoff) - Line movements throughout the week
Week 1, 2021: TB @ DAL
Opening Line: TB -6.5
Closing Line: TB -7.5
Final Score: TB 31, DAL 29 (TB wins by 2)
The Dataset
| Season | Games | Avg Model-Market Diff | Model MAE | Market MAE |
|---|---|---|---|---|
| 2021 | 272 | 2.1 pts | 10.8 pts | 10.2 pts |
| 2022 | 272 | 2.3 pts | 11.2 pts | 10.5 pts |
| 2023 | 272 | 1.9 pts | 10.6 pts | 10.1 pts |
| Total | 816 | 2.1 pts | 10.9 pts | 10.3 pts |
Initial Observation: The market's Mean Absolute Error (10.3) is lower than our model's (10.9), suggesting the market is more accurate overall.
Step 2: Defining "Value" Bets
Rather than betting every game, we focused on games where our model meaningfully disagreed with the market.
Threshold Analysis
We tested different disagreement thresholds:
| Threshold | Games | Model Win Rate | vs 52.4% Break-Even |
|---|---|---|---|
| > 0 pts | 816 | 49.8% | -2.6% |
| > 1 pt | 512 | 50.4% | -2.0% |
| > 2 pts | 298 | 51.7% | -0.7% |
| > 3 pts | 156 | 53.2% | +0.8% |
| > 4 pts | 72 | 55.6% | +3.2% |
Key Finding: Only at significant disagreement (3+ points) did the model show potential edge. At smaller differences, the market was right more often than not.
Step 3: Calculating Closing Line Value
Instead of tracking wins and losses (which are noisy), we measured whether our model identified value before the market.
The CLV Framework
For each game where our model disagreed with the opening line:
If Model < Opening Line (favoring away team):
CLV = Opening Line - Closing Line
If Model > Opening Line (favoring home team):
CLV = Closing Line - Opening Line
Positive CLV means the line moved in the direction we predicted.
CLV Results by Disagreement Level
| Model Disagreement | N | Avg CLV | % Positive CLV |
|---|---|---|---|
| Model 1-2 pts off | 214 | -0.2 pts | 47% |
| Model 2-3 pts off | 142 | +0.3 pts | 52% |
| Model 3-4 pts off | 84 | +0.6 pts | 56% |
| Model 4+ pts off | 72 | +0.9 pts | 61% |
Key Finding: When our model disagreed significantly (3+ points), lines tended to move in our direction. This suggests our model captured information that wasn't fully priced in at open.
Step 4: Deep Dive - Where Did the Model Add Value?
By Game Context
| Situation | Games | Model CLV | Market Better? |
|---|---|---|---|
| Early season (Wks 1-4) | 68 | +0.8 pts | No |
| Post-bye | 64 | +0.4 pts | Slightly |
| Division games | 192 | -0.1 pts | Yes |
| Primetime | 140 | +0.1 pts | Neutral |
| Injury situations | 89 | +1.1 pts | No |
Insights: 1. Early season: Our model's regression-based projections provided value before market fully adjusted to roster changes 2. Injury situations: Our injury adjustment framework added value the market may have under-weighted 3. Division games: Market's familiarity-based adjustments were superior to our model
By Model Component
We decomposed which features contributed to positive CLV:
| Feature Source | Contribution to CLV |
|---|---|
| Elo difference | +0.1 pts |
| Efficiency metrics | +0.3 pts |
| Injury adjustments | +0.4 pts |
| Weather factors | +0.1 pts |
| Recent form | -0.2 pts |
Key Finding: Our efficiency metrics and injury adjustments provided genuine value. However, our "recent form" features were actually harmful - the market already incorporated this information better than we did.
Step 5: Statistical Significance Testing
The Core Question
Is our positive CLV result skill or luck?
Bootstrap Analysis
We ran 10,000 bootstrap samples of our CLV results:
Mean CLV (high-conviction bets): +0.72 points
95% CI: [+0.18, +1.24]
p-value for CLV > 0: 0.008
Conclusion: With 99.2% confidence, our model provided positive CLV on high-conviction bets. This is statistically significant.
But Does CLV Translate to Profit?
The relationship between CLV and expected win rate:
CLV of +1 point ≈ 3% additional win probability
Our +0.72 average CLV ≈ 2.2% additional win probability
Expected win rate: 52.4% + 2.2% = 54.6%
At 54.6% win rate betting at -110: - Expected ROI: +4.2% - On 100 bets of $100: +$420 expected profit
Actual vs Expected Results
| Metric | Expected | Actual | Difference |
|---|---|---|---|
| Win Rate | 54.6% | 55.6% | +1.0% |
| ROI | +4.2% | +5.8% | +1.6% |
| Profit (100 bets) | +$420 | +$580 | +$160 |
Note: Our actual results slightly exceeded expectations, but this is within normal variance. The CLV-predicted results are more reliable than the actual outcomes for such a small sample.
Step 6: Year-Over-Year Consistency
A key test of any edge: does it persist?
Annual CLV Performance
| Season | High-Conviction Bets | Avg CLV | Win Rate |
|---|---|---|---|
| 2021 | 56 | +0.82 pts | 57.1% |
| 2022 | 51 | +0.58 pts | 52.9% |
| 2023 | 49 | +0.71 pts | 55.1% |
Observation: CLV remained positive across all three seasons, though with year-to-year variance. This consistency suggests persistent (not random) edge.
Decay Analysis
Did our edge diminish over time (suggesting market adaptation)?
2021 H1: +0.91 CLV
2021 H2: +0.73 CLV
2022 H1: +0.62 CLV
2022 H2: +0.54 CLV
2023 H1: +0.78 CLV
2023 H2: +0.64 CLV
Observation: Some decay within seasons (market learns), but edge reset each new season. This suggests our model captures information about roster/team changes that takes time for the market to fully price.
Step 7: Actionable Insights
What We Learned About Our Model
- Overall, the market is better - Don't bet every game
- High-conviction disagreements have value - Focus on 3+ point differences
- Efficiency metrics add value - Continue developing this component
- Recent form is already priced - Reduce weight on this feature
- Early season is our edge - Emphasize projections over recent results early
Model Improvements Identified
Based on this analysis, we would:
- Remove over-weighted recent form features
- Enhance injury impact quantification
- Add market-derived features (opening line as input)
- Focus predictions on early season and injury situations
Using Market as Benchmark Going Forward
For ongoing model evaluation:
def evaluate_model_vs_market(predictions, opening_lines, closing_lines):
"""
Track model performance against market.
"""
results = {
'total_predictions': len(predictions),
'high_conviction': 0,
'positive_clv_count': 0,
'total_clv': 0
}
for pred, open_line, close_line in zip(predictions, opening_lines, closing_lines):
disagreement = abs(pred - open_line)
if disagreement >= 3.0: # High conviction threshold
results['high_conviction'] += 1
# Calculate CLV based on prediction direction
if pred < open_line: # We favor away
clv = open_line - close_line
else: # We favor home
clv = close_line - open_line
results['total_clv'] += clv
if clv > 0:
results['positive_clv_count'] += 1
if results['high_conviction'] > 0:
results['avg_clv'] = results['total_clv'] / results['high_conviction']
results['clv_hit_rate'] = results['positive_clv_count'] / results['high_conviction']
return results
Key Takeaways
1. The Market is the Benchmark
Any model should be evaluated against market lines, not just historical outcomes. The market represents collective wisdom of thousands of informed participants.
2. CLV > Win Rate
Closing Line Value is a better indicator of skill than win rate. Win rate is noisy; CLV measures whether you identified value before the market.
3. Be Selective
Models rarely beat markets across all games. Identify where your model has genuine edge and focus there.
4. Persistence Matters
A real edge should persist across seasons. If it disappears quickly, it was likely luck or the market adapted.
5. Learn from the Market
Use market feedback to improve your model. If the market consistently disagrees in certain situations and is right, adjust your model.
Conclusion
Our model evaluation reveals a common pattern in sports analytics: we can add value, but not everywhere. The betting market is remarkably efficient, but not perfectly so. By focusing on high-conviction disagreements and using CLV as our evaluation metric, we identified genuine (if modest) predictive value.
More importantly, this framework showed us how to improve. By understanding where we add value (efficiency metrics, injuries, early season) and where we don't (recent form, division games), we can build a better model.
The ultimate test of a prediction model isn't whether it makes good predictions in isolation - it's whether it makes better predictions than the best available alternative. In NFL analytics, that alternative is the betting market. Respecting its efficiency while seeking to understand and occasionally outperform it is the path to genuine analytical value.
Discussion Questions
-
Why might efficiency metrics provide value that the market hasn't fully incorporated?
-
How would you explain the decay in edge during a season followed by reset at season start?
-
If our model showed negative CLV in division games, what might we learn from the market's approach to these games?
-
What are the ethical considerations of using betting market data for non-betting analytical purposes?
-
How might this framework apply to player-level predictions (fantasy sports)?
Technical Appendix: CLV Calculation Details
Full CLV Formula
For a bet on Team A:
CLV_points = (Your_Line - Closing_Line) × Direction
Where Direction = +1 if betting favorite, -1 if betting underdog
Example:
- You bet Chiefs -6 (favoring Chiefs to cover)
- Line closes at Chiefs -7.5
- CLV = (-6) - (-7.5) = +1.5 points
- Your price was 1.5 points better than closing
Converting CLV to Expected Edge
Edge_percent = CLV_points × 3% (approximate)
1 point CLV ≈ 3% edge
0.5 point CLV ≈ 1.5% edge
Break-even CLV at -110: approximately -0.8 points
(Because you're paying ~2.4% vig)
Sample Size Requirements
To detect X% edge with 95% confidence:
N ≈ (1.96)² × p(1-p) / (edge)²
For 3% edge: N ≈ 1,000 bets
For 5% edge: N ≈ 400 bets
For 10% edge: N ≈ 100 bets