Case Study: Evaluating Model Performance Against Market Efficiency

How Good is Your Model? Using Betting Markets as the Ultimate Benchmark


Introduction

You've built an NFL prediction model. It looks good in backtesting. Your accuracy metrics seem impressive. But how do you really know if it's any good?

The answer lies in comparing your model to the most efficient information aggregation system available: the betting market. This case study walks through a comprehensive evaluation of a prediction model against NFL betting lines, using Closing Line Value (CLV) as the primary performance metric.


The Scenario

The Model: An ensemble combining Elo ratings, efficiency metrics, and machine learning features (similar to what we built in Chapters 19-20).

The Test Period: 2021-2023 NFL seasons (816 regular season games)

The Question: Does our model provide any predictive value beyond what the betting market already knows?


Step 1: Gathering the Data

Model Predictions

Our model outputs a predicted point spread for each game:

Week 1, 2021: TB @ DAL
Model Prediction: TB -6.2

Market Data

We collected: - Opening lines (Sunday/Monday of prior week) - Closing lines (kickoff) - Line movements throughout the week

Week 1, 2021: TB @ DAL
Opening Line: TB -6.5
Closing Line: TB -7.5
Final Score: TB 31, DAL 29 (TB wins by 2)

The Dataset

Season Games Avg Model-Market Diff Model MAE Market MAE
2021 272 2.1 pts 10.8 pts 10.2 pts
2022 272 2.3 pts 11.2 pts 10.5 pts
2023 272 1.9 pts 10.6 pts 10.1 pts
Total 816 2.1 pts 10.9 pts 10.3 pts

Initial Observation: The market's Mean Absolute Error (10.3) is lower than our model's (10.9), suggesting the market is more accurate overall.


Step 2: Defining "Value" Bets

Rather than betting every game, we focused on games where our model meaningfully disagreed with the market.

Threshold Analysis

We tested different disagreement thresholds:

Threshold Games Model Win Rate vs 52.4% Break-Even
> 0 pts 816 49.8% -2.6%
> 1 pt 512 50.4% -2.0%
> 2 pts 298 51.7% -0.7%
> 3 pts 156 53.2% +0.8%
> 4 pts 72 55.6% +3.2%

Key Finding: Only at significant disagreement (3+ points) did the model show potential edge. At smaller differences, the market was right more often than not.


Step 3: Calculating Closing Line Value

Instead of tracking wins and losses (which are noisy), we measured whether our model identified value before the market.

The CLV Framework

For each game where our model disagreed with the opening line:

If Model < Opening Line (favoring away team):
    CLV = Opening Line - Closing Line

If Model > Opening Line (favoring home team):
    CLV = Closing Line - Opening Line

Positive CLV means the line moved in the direction we predicted.

CLV Results by Disagreement Level

Model Disagreement N Avg CLV % Positive CLV
Model 1-2 pts off 214 -0.2 pts 47%
Model 2-3 pts off 142 +0.3 pts 52%
Model 3-4 pts off 84 +0.6 pts 56%
Model 4+ pts off 72 +0.9 pts 61%

Key Finding: When our model disagreed significantly (3+ points), lines tended to move in our direction. This suggests our model captured information that wasn't fully priced in at open.


Step 4: Deep Dive - Where Did the Model Add Value?

By Game Context

Situation Games Model CLV Market Better?
Early season (Wks 1-4) 68 +0.8 pts No
Post-bye 64 +0.4 pts Slightly
Division games 192 -0.1 pts Yes
Primetime 140 +0.1 pts Neutral
Injury situations 89 +1.1 pts No

Insights: 1. Early season: Our model's regression-based projections provided value before market fully adjusted to roster changes 2. Injury situations: Our injury adjustment framework added value the market may have under-weighted 3. Division games: Market's familiarity-based adjustments were superior to our model

By Model Component

We decomposed which features contributed to positive CLV:

Feature Source Contribution to CLV
Elo difference +0.1 pts
Efficiency metrics +0.3 pts
Injury adjustments +0.4 pts
Weather factors +0.1 pts
Recent form -0.2 pts

Key Finding: Our efficiency metrics and injury adjustments provided genuine value. However, our "recent form" features were actually harmful - the market already incorporated this information better than we did.


Step 5: Statistical Significance Testing

The Core Question

Is our positive CLV result skill or luck?

Bootstrap Analysis

We ran 10,000 bootstrap samples of our CLV results:

Mean CLV (high-conviction bets): +0.72 points
95% CI: [+0.18, +1.24]
p-value for CLV > 0: 0.008

Conclusion: With 99.2% confidence, our model provided positive CLV on high-conviction bets. This is statistically significant.

But Does CLV Translate to Profit?

The relationship between CLV and expected win rate:

CLV of +1 point ≈ 3% additional win probability
Our +0.72 average CLV ≈ 2.2% additional win probability
Expected win rate: 52.4% + 2.2% = 54.6%

At 54.6% win rate betting at -110: - Expected ROI: +4.2% - On 100 bets of $100: +$420 expected profit

Actual vs Expected Results

Metric Expected Actual Difference
Win Rate 54.6% 55.6% +1.0%
ROI +4.2% +5.8% +1.6%
Profit (100 bets) +$420 | +$580 +$160

Note: Our actual results slightly exceeded expectations, but this is within normal variance. The CLV-predicted results are more reliable than the actual outcomes for such a small sample.


Step 6: Year-Over-Year Consistency

A key test of any edge: does it persist?

Annual CLV Performance

Season High-Conviction Bets Avg CLV Win Rate
2021 56 +0.82 pts 57.1%
2022 51 +0.58 pts 52.9%
2023 49 +0.71 pts 55.1%

Observation: CLV remained positive across all three seasons, though with year-to-year variance. This consistency suggests persistent (not random) edge.

Decay Analysis

Did our edge diminish over time (suggesting market adaptation)?

2021 H1: +0.91 CLV
2021 H2: +0.73 CLV
2022 H1: +0.62 CLV
2022 H2: +0.54 CLV
2023 H1: +0.78 CLV
2023 H2: +0.64 CLV

Observation: Some decay within seasons (market learns), but edge reset each new season. This suggests our model captures information about roster/team changes that takes time for the market to fully price.


Step 7: Actionable Insights

What We Learned About Our Model

  1. Overall, the market is better - Don't bet every game
  2. High-conviction disagreements have value - Focus on 3+ point differences
  3. Efficiency metrics add value - Continue developing this component
  4. Recent form is already priced - Reduce weight on this feature
  5. Early season is our edge - Emphasize projections over recent results early

Model Improvements Identified

Based on this analysis, we would:

  1. Remove over-weighted recent form features
  2. Enhance injury impact quantification
  3. Add market-derived features (opening line as input)
  4. Focus predictions on early season and injury situations

Using Market as Benchmark Going Forward

For ongoing model evaluation:

def evaluate_model_vs_market(predictions, opening_lines, closing_lines):
    """
    Track model performance against market.
    """
    results = {
        'total_predictions': len(predictions),
        'high_conviction': 0,
        'positive_clv_count': 0,
        'total_clv': 0
    }

    for pred, open_line, close_line in zip(predictions, opening_lines, closing_lines):
        disagreement = abs(pred - open_line)

        if disagreement >= 3.0:  # High conviction threshold
            results['high_conviction'] += 1

            # Calculate CLV based on prediction direction
            if pred < open_line:  # We favor away
                clv = open_line - close_line
            else:  # We favor home
                clv = close_line - open_line

            results['total_clv'] += clv
            if clv > 0:
                results['positive_clv_count'] += 1

    if results['high_conviction'] > 0:
        results['avg_clv'] = results['total_clv'] / results['high_conviction']
        results['clv_hit_rate'] = results['positive_clv_count'] / results['high_conviction']

    return results

Key Takeaways

1. The Market is the Benchmark

Any model should be evaluated against market lines, not just historical outcomes. The market represents collective wisdom of thousands of informed participants.

2. CLV > Win Rate

Closing Line Value is a better indicator of skill than win rate. Win rate is noisy; CLV measures whether you identified value before the market.

3. Be Selective

Models rarely beat markets across all games. Identify where your model has genuine edge and focus there.

4. Persistence Matters

A real edge should persist across seasons. If it disappears quickly, it was likely luck or the market adapted.

5. Learn from the Market

Use market feedback to improve your model. If the market consistently disagrees in certain situations and is right, adjust your model.


Conclusion

Our model evaluation reveals a common pattern in sports analytics: we can add value, but not everywhere. The betting market is remarkably efficient, but not perfectly so. By focusing on high-conviction disagreements and using CLV as our evaluation metric, we identified genuine (if modest) predictive value.

More importantly, this framework showed us how to improve. By understanding where we add value (efficiency metrics, injuries, early season) and where we don't (recent form, division games), we can build a better model.

The ultimate test of a prediction model isn't whether it makes good predictions in isolation - it's whether it makes better predictions than the best available alternative. In NFL analytics, that alternative is the betting market. Respecting its efficiency while seeking to understand and occasionally outperform it is the path to genuine analytical value.


Discussion Questions

  1. Why might efficiency metrics provide value that the market hasn't fully incorporated?

  2. How would you explain the decay in edge during a season followed by reset at season start?

  3. If our model showed negative CLV in division games, what might we learn from the market's approach to these games?

  4. What are the ethical considerations of using betting market data for non-betting analytical purposes?

  5. How might this framework apply to player-level predictions (fantasy sports)?


Technical Appendix: CLV Calculation Details

Full CLV Formula

For a bet on Team A:
CLV_points = (Your_Line - Closing_Line) × Direction

Where Direction = +1 if betting favorite, -1 if betting underdog

Example:
- You bet Chiefs -6 (favoring Chiefs to cover)
- Line closes at Chiefs -7.5
- CLV = (-6) - (-7.5) = +1.5 points
- Your price was 1.5 points better than closing

Converting CLV to Expected Edge

Edge_percent = CLV_points × 3%  (approximate)

1 point CLV ≈ 3% edge
0.5 point CLV ≈ 1.5% edge

Break-even CLV at -110: approximately -0.8 points
(Because you're paying ~2.4% vig)

Sample Size Requirements

To detect X% edge with 95% confidence:
N ≈ (1.96)² × p(1-p) / (edge)²

For 3% edge: N ≈ 1,000 bets
For 5% edge: N ≈ 400 bets
For 10% edge: N ≈ 100 bets