Case Study: Evaluating Model Performance Against Market Efficiency

How Good is Your Model? Using Betting Markets as the Ultimate Benchmark

Introduction

You've built an NFL prediction model. It looks good in backtesting. Your accuracy metrics seem impressive. But how do you really know if it's any good?

The answer lies in comparing your model to the most efficient information aggregation system available: the betting market. This case study walks through a comprehensive evaluation of a prediction model against NFL betting lines, using Closing Line Value (CLV) as the primary performance metric.

The Scenario

The Model: An ensemble combining Elo ratings, efficiency metrics, and machine learning features (similar to what we built in Chapters 19-20).

The Test Period: 2021-2023 NFL seasons (816 regular season games)

The Question: Does our model provide any predictive value beyond what the betting market already knows?

Step 1: Gathering the Data

Model Predictions

Our model outputs a predicted point spread for each game:

Week 1, 2021: TB @ DAL
Model Prediction: TB -6.2

Market Data

We collected: - Opening lines (Sunday/Monday of prior week) - Closing lines (kickoff) - Line movements throughout the week

Week 1, 2021: TB @ DAL
Opening Line: TB -6.5
Closing Line: TB -7.5
Final Score: TB 31, DAL 29 (TB wins by 2)

The Dataset

Season	Games	Avg Model-Market Diff	Model MAE	Market MAE
2021	272	2.1 pts	10.8 pts	10.2 pts
2022	272	2.3 pts	11.2 pts	10.5 pts
2023	272	1.9 pts	10.6 pts	10.1 pts
Total	816	2.1 pts	10.9 pts	10.3 pts

Initial Observation: The market's Mean Absolute Error (10.3) is lower than our model's (10.9), suggesting the market is more accurate overall.

Step 2: Defining "Value" Bets

Rather than betting every game, we focused on games where our model meaningfully disagreed with the market.

Threshold Analysis

We tested different disagreement thresholds:

Threshold	Games	Model Win Rate	vs 52.4% Break-Even
> 0 pts	816	49.8%	-2.6%
> 1 pt	512	50.4%	-2.0%
> 2 pts	298	51.7%	-0.7%
> 3 pts	156	53.2%	+0.8%
> 4 pts	72	55.6%	+3.2%

Key Finding: Only at significant disagreement (3+ points) did the model show potential edge. At smaller differences, the market was right more often than not.

Step 3: Calculating Closing Line Value

Instead of tracking wins and losses (which are noisy), we measured whether our model identified value before the market.

The CLV Framework

For each game where our model disagreed with the opening line:

If Model < Opening Line (favoring away team):
    CLV = Opening Line - Closing Line

If Model > Opening Line (favoring home team):
    CLV = Closing Line - Opening Line

Positive CLV means the line moved in the direction we predicted.

CLV Results by Disagreement Level

Model Disagreement	N	Avg CLV	% Positive CLV
Model 1-2 pts off	214	-0.2 pts	47%
Model 2-3 pts off	142	+0.3 pts	52%
Model 3-4 pts off	84	+0.6 pts	56%
Model 4+ pts off	72	+0.9 pts	61%

Key Finding: When our model disagreed significantly (3+ points), lines tended to move in our direction. This suggests our model captured information that wasn't fully priced in at open.

Step 4: Deep Dive - Where Did the Model Add Value?

By Game Context

Situation	Games	Model CLV	Market Better?
Early season (Wks 1-4)	68	+0.8 pts	No
Post-bye	64	+0.4 pts	Slightly
Division games	192	-0.1 pts	Yes
Primetime	140	+0.1 pts	Neutral
Injury situations	89	+1.1 pts	No

Insights: 1. Early season: Our model's regression-based projections provided value before market fully adjusted to roster changes 2. Injury situations: Our injury adjustment framework added value the market may have under-weighted 3. Division games: Market's familiarity-based adjustments were superior to our model

By Model Component

We decomposed which features contributed to positive CLV:

Feature Source	Contribution to CLV
Elo difference	+0.1 pts
Efficiency metrics	+0.3 pts
Injury adjustments	+0.4 pts
Weather factors	+0.1 pts
Recent form	-0.2 pts

Key Finding: Our efficiency metrics and injury adjustments provided genuine value. However, our "recent form" features were actually harmful - the market already incorporated this information better than we did.

Step 5: Statistical Significance Testing

The Core Question

Is our positive CLV result skill or luck?

Bootstrap Analysis

We ran 10,000 bootstrap samples of our CLV results:

Mean CLV (high-conviction bets): +0.72 points
95% CI: [+0.18, +1.24]
p-value for CLV > 0: 0.008

Conclusion: With 99.2% confidence, our model provided positive CLV on high-conviction bets. This is statistically significant.

But Does CLV Translate to Profit?

The relationship between CLV and expected win rate:

CLV of +1 point ≈ 3% additional win probability
Our +0.72 average CLV ≈ 2.2% additional win probability
Expected win rate: 52.4% + 2.2% = 54.6%

At 54.6% win rate betting at -110: - Expected ROI: +4.2% - On 100 bets of $100: +$420 expected profit

Actual vs Expected Results

Metric	Expected	Actual	Difference
Win Rate	54.6%	55.6%	+1.0%
ROI	+4.2%	+5.8%	+1.6%
Profit (100 bets)	+$420 \| +$580	+$160

Note: Our actual results slightly exceeded expectations, but this is within normal variance. The CLV-predicted results are more reliable than the actual outcomes for such a small sample.

Step 6: Year-Over-Year Consistency

A key test of any edge: does it persist?

Annual CLV Performance

Season	High-Conviction Bets	Avg CLV	Win Rate
2021	56	+0.82 pts	57.1%
2022	51	+0.58 pts	52.9%
2023	49	+0.71 pts	55.1%

Observation: CLV remained positive across all three seasons, though with year-to-year variance. This consistency suggests persistent (not random) edge.

Decay Analysis

Did our edge diminish over time (suggesting market adaptation)?

2021 H1: +0.91 CLV
2021 H2: +0.73 CLV
2022 H1: +0.62 CLV
2022 H2: +0.54 CLV
2023 H1: +0.78 CLV
2023 H2: +0.64 CLV

Observation: Some decay within seasons (market learns), but edge reset each new season. This suggests our model captures information about roster/team changes that takes time for the market to fully price.

Step 7: Actionable Insights

What We Learned About Our Model

Overall, the market is better - Don't bet every game
High-conviction disagreements have value - Focus on 3+ point differences
Efficiency metrics add value - Continue developing this component
Recent form is already priced - Reduce weight on this feature
Early season is our edge - Emphasize projections over recent results early

Model Improvements Identified

Based on this analysis, we would:

Remove over-weighted recent form features
Enhance injury impact quantification
Add market-derived features (opening line as input)
Focus predictions on early season and injury situations

Using Market as Benchmark Going Forward

For ongoing model evaluation:

def evaluate_model_vs_market(predictions, opening_lines, closing_lines):
    """
    Track model performance against market.
    """
    results = {
        'total_predictions': len(predictions),
        'high_conviction': 0,
        'positive_clv_count': 0,
        'total_clv': 0
    }

    for pred, open_line, close_line in zip(predictions, opening_lines, closing_lines):
        disagreement = abs(pred - open_line)

        if disagreement >= 3.0:  # High conviction threshold
            results['high_conviction'] += 1

            # Calculate CLV based on prediction direction
            if pred < open_line:  # We favor away
                clv = open_line - close_line
            else:  # We favor home
                clv = close_line - open_line

            results['total_clv'] += clv
            if clv > 0:
                results['positive_clv_count'] += 1

    if results['high_conviction'] > 0:
        results['avg_clv'] = results['total_clv'] / results['high_conviction']
        results['clv_hit_rate'] = results['positive_clv_count'] / results['high_conviction']

    return results

Key Takeaways

1. The Market is the Benchmark

Any model should be evaluated against market lines, not just historical outcomes. The market represents collective wisdom of thousands of informed participants.

2. CLV > Win Rate

Closing Line Value is a better indicator of skill than win rate. Win rate is noisy; CLV measures whether you identified value before the market.

3. Be Selective

Models rarely beat markets across all games. Identify where your model has genuine edge and focus there.

4. Persistence Matters

A real edge should persist across seasons. If it disappears quickly, it was likely luck or the market adapted.

5. Learn from the Market

Use market feedback to improve your model. If the market consistently disagrees in certain situations and is right, adjust your model.

Conclusion

Our model evaluation reveals a common pattern in sports analytics: we can add value, but not everywhere. The betting market is remarkably efficient, but not perfectly so. By focusing on high-conviction disagreements and using CLV as our evaluation metric, we identified genuine (if modest) predictive value.

More importantly, this framework showed us how to improve. By understanding where we add value (efficiency metrics, injuries, early season) and where we don't (recent form, division games), we can build a better model.

The ultimate test of a prediction model isn't whether it makes good predictions in isolation - it's whether it makes better predictions than the best available alternative. In NFL analytics, that alternative is the betting market. Respecting its efficiency while seeking to understand and occasionally outperform it is the path to genuine analytical value.

Discussion Questions

Why might efficiency metrics provide value that the market hasn't fully incorporated?
How would you explain the decay in edge during a season followed by reset at season start?
If our model showed negative CLV in division games, what might we learn from the market's approach to these games?
What are the ethical considerations of using betting market data for non-betting analytical purposes?
How might this framework apply to player-level predictions (fantasy sports)?

Technical Appendix: CLV Calculation Details

Full CLV Formula

For a bet on Team A:
CLV_points = (Your_Line - Closing_Line) × Direction

Where Direction = +1 if betting favorite, -1 if betting underdog

Example:
- You bet Chiefs -6 (favoring Chiefs to cover)
- Line closes at Chiefs -7.5
- CLV = (-6) - (-7.5) = +1.5 points
- Your price was 1.5 points better than closing

Converting CLV to Expected Edge

Edge_percent = CLV_points × 3%  (approximate)

1 point CLV ≈ 3% edge
0.5 point CLV ≈ 1.5% edge

Break-even CLV at -110: approximately -0.8 points
(Because you're paying ~2.4% vig)

Sample Size Requirements

To detect X% edge with 95% confidence:
N ≈ (1.96)² × p(1-p) / (edge)²

For 3% edge: N ≈ 1,000 bets
For 5% edge: N ≈ 400 bets
For 10% edge: N ≈ 100 bets