Case Study 1: Bias Audit --- Analyzing 1,000 Betting Decisions


Executive Summary

Every bettor believes they are rational. The data usually tells a different story. This case study follows a quantitative sports bettor --- Marcus, a data scientist by profession who has been betting seriously for three years --- as he conducts a comprehensive audit of 1,000 consecutive betting decisions to detect cognitive biases in his decision-making. Using his own meticulous bet log, which captures not just outcomes but reasoning, confidence levels, and model outputs, Marcus applies statistical techniques to uncover five distinct biases operating beneath his conscious awareness. The audit reveals that his supposedly disciplined, model-driven approach was systematically distorted by anchoring to opening lines, confirmation bias in his research process, recency bias in stake sizing, overconfidence in his probability estimates, and narrative-driven departures from his model. The combined impact: an estimated 1.8 percentage points of lost edge per year --- roughly 40% of his theoretical advantage. The case provides a complete methodology for conducting a personal bias audit, including Python code for each analysis, and demonstrates that even experienced, quantitatively sophisticated bettors are not immune to the biases they study.


Background

The Bettor Profile

Marcus started sports betting in 2022 after completing a master's degree in statistics. His day job involves building machine learning models for a fintech company. He brought genuine technical skill to his betting operation: a Poisson-based model for NFL totals, an Elo-plus-regression model for NBA spreads, and a logistic regression model for MLB moneylines. His Python code was clean, his backtests were rigorous, and his bankroll management followed a fractional Kelly framework.

For three years, Marcus tracked every bet in a PostgreSQL database with unusual discipline. Each record contained the standard fields --- date, sport, bet type, odds, stake, result --- plus fields that most bettors neglect: his model's predicted probability, the opening line at the time he first analyzed the game, the line at the time he placed the bet, his written reasoning for the bet, a 1-to-10 confidence rating, and the closing line. He also recorded what he called "override flags" --- instances where he deviated from his model's recommendation.

At the end of Year 3, Marcus's cumulative record showed a 3.2% yield on 2,847 lifetime bets. Respectable, but below his model's backtested theoretical yield of 4.8%. He wondered where the 1.6 percentage points were going.

The Audit Decision

Marcus decided to audit a contiguous block of 1,000 bets from the middle of Year 3 --- a period when his model was stable (no major updates) and his betting covered all three sports. He chose this window specifically because it was recent enough to remember the context of many decisions but large enough for statistical power.

His audit methodology examined five hypotheses, each corresponding to a known cognitive bias.


The Analysis

Bias 1: Anchoring to Opening Lines

Hypothesis: Marcus's probability estimates are systematically pulled toward the implied probability of the opening line, even when his model and the closing line both suggest the opening line was inaccurate.

Method: For each bet, Marcus computed three implied probabilities: (1) the opening line implied probability, (2) his model's probability, and (3) the closing line implied probability. He then measured the correlation between the opening line and his stated confidence rating, controlling for the model's output.

Finding: Marcus discovered a correlation of r = 0.31 between the opening line's implied probability and his confidence rating, after partialing out the model's probability. This meant that even holding his model's output constant, he was more confident in bets where the opening line supported his position and less confident when it did not. More critically, he found that his "override" bets --- where he deviated from his model --- were 2.3 times more likely to deviate toward the opening line than away from it.

The magnitude of anchoring was not uniform. It was strongest in NFL games (where opening lines are released on Sunday evening and discussed extensively for a full week) and weakest in MLB (where lines open the morning of the game). This pattern made sense: the longer the interval between line release and bet placement, the more time the anchor has to embed itself.

Impact: Marcus estimated that anchoring cost him approximately 0.4 percentage points of yield. His model's recommendations were slightly better than his actual bets, and the discrepancy was statistically correlated with the direction of the opening line.

Bias 2: Confirmation Bias in Research

Hypothesis: After his model identifies a potential bet, Marcus's subsequent research selectively supports rather than challenges the model's recommendation.

Method: Marcus had recorded written reasoning for each bet, averaging 40-60 words per entry. He performed a text analysis, categorizing statements as "supporting" (reasons why the bet should win) or "challenging" (reasons why it might lose). He counted the ratio of supporting to challenging statements and analyzed whether this ratio correlated with bet quality.

Finding: Across all 1,000 bets, the average ratio of supporting to challenging statements was 3.7:1. Marcus was recording nearly four supporting reasons for every one challenging reason. When he looked at bets that ultimately lost, the pre-bet reasoning showed the same 3.7:1 ratio --- he was not identifying the risks that materialized.

More revealing was the pattern within his research process. His browser history (which he had the foresight to log) showed that after his model flagged a bet, he averaged 4.2 web searches. The first search was almost always a confirmation-seeking query ("Chiefs offense improvement 2024" rather than "Chiefs offensive weaknesses"). When the first few searches supported the bet, he often stopped searching entirely. When early results contradicted the bet, he searched more (averaging 6.1 searches) --- but these additional searches were almost entirely seeking alternative supporting evidence.

Impact: Bets where Marcus's written reasoning included at least two specific challenging points yielded 4.1%, compared to 2.5% for bets with zero or one challenging point. The act of engaging with counterarguments appeared to improve decision quality, likely because it caused him to pass on the weakest opportunities.

Bias 3: Recency Bias in Stake Sizing

Hypothesis: Marcus's stake sizes are influenced by the outcomes of recent bets, independent of the current bet's expected value.

Method: Marcus computed the autocorrelation between his stake size (as a percentage of bankroll) and the outcomes of his previous 1, 2, 3, 5, and 10 bets. His Kelly-based staking system should produce stake sizes that are a function only of the current bet's edge and odds, not recent outcomes.

Finding: Marcus found a statistically significant negative autocorrelation between recent outcomes and subsequent stake sizes. After a sequence of three or more losses, his average stake dropped to 1.4% of bankroll, compared to his system's recommended average of 2.1%. After three or more wins, his average stake increased to 2.5%.

The pattern was not one of reckless loss-chasing. Instead, Marcus exhibited a subtler form of recency bias: loss-induced timidity. After losses, he found himself unconsciously rounding down when his model suggested a bet size near his comfort threshold. A model recommendation of 2.3% of bankroll became 2.0% after losses but remained 2.3% (or became 2.5%) after wins. He described this post-hoc as "being prudent" during rough patches, but the data showed it was systematic and costly.

Impact: The under-betting after losses reduced his realized edge by approximately 0.3 percentage points. Because losing streaks are when the bankroll naturally compresses (reducing absolute bet sizes), the additional fractional reduction compounded the effect. His bankroll growth curve showed a distinctive "staircase" pattern: gains during neutral or positive periods, then flattening during losing streaks when he should have been capturing the same percentage returns.

Bias 4: Overconfidence in Probability Estimates

Hypothesis: Marcus's probability estimates are systematically overconfident --- his predictions that assign 60% probability to an outcome come true less than 60% of the time.

Method: Marcus constructed a calibration analysis by binning his 1,000 probability estimates into deciles and comparing predicted probabilities to actual frequencies.

Finding: Marcus's calibration curve showed a consistent S-curve deviation characteristic of overconfidence. When he estimated 55% probability, the actual frequency was approximately 53%. When he estimated 65%, the actual frequency was approximately 61%. When he estimated 75%, the actual frequency was approximately 69%. His Expected Calibration Error (ECE) was 0.028 --- not terrible, but meaningfully above zero.

The overconfidence was not uniform across sports. His NFL estimates were the most overconfident (ECE = 0.034), his NBA estimates were moderate (ECE = 0.026), and his MLB estimates were nearly well-calibrated (ECE = 0.015). Marcus hypothesized that this reflected sample size: MLB's 162-game season provided far more data for model calibration than the NFL's 17-game season, and his MLB model had been updated more frequently.

Impact: Overconfident probability estimates directly feed into the Kelly sizing formula, producing oversized bets. Marcus calculated that his systematic overconfidence caused him to over-bet by approximately 15% on average, reducing his geometric growth rate by approximately 0.5 percentage points per year.

Bias 5: Narrative-Driven Model Overrides

Hypothesis: When Marcus overrides his model, the overrides are systematically influenced by sports media narratives rather than genuine informational advantages.

Method: Marcus flagged 127 of the 1,000 bets as "overrides" --- bets where he deviated from his model's recommendation. He analyzed the performance of overrides versus model-following bets and examined the language in his override reasoning.

Finding: The override bets produced a yield of -1.2%, compared to +3.8% for model-following bets. This alone was damning, but the content analysis was more revealing. Of the 127 overrides, Marcus classified 43 as "information-based" (e.g., late injury news, weather changes not yet in the model) and 84 as "judgment-based" (e.g., "I think this team is better than the numbers show" or "revenge game narrative").

Information-based overrides performed well: +4.5% yield, suggesting that Marcus's live information processing added genuine value. But judgment-based overrides yielded -4.8%, destroying value systematically. The language in these entries was littered with narrative terms: "bounce-back," "statement game," "letdown spot," "playoff-tested," and "momentum." These were not data-driven assessments; they were stories masquerading as analysis.

Impact: The net impact of all overrides was approximately -0.6 percentage points of yield. Had Marcus limited his overrides to information-based adjustments and followed his model for all judgment calls, his overall yield would have improved by roughly that amount.


The Remediation Plan

Armed with these findings, Marcus designed a five-part remediation plan:

  1. Anti-anchoring protocol: Stop reviewing opening lines entirely. Configure his data pipeline to show only the current line and his model's output. The opening line has no informational value once the market has moved.

  2. Adversarial research requirement: For every bet, require a minimum of two written challenging points before proceeding. If he cannot articulate two specific reasons the bet might lose, the edge is probably not well understood.

  3. Mechanical stake sizing: Remove all manual input from the sizing calculation. The Kelly formula takes the model's probability and the current odds as inputs and produces a stake. Marcus committed to following the output within a 10% band --- no rounding down after losses, no rounding up after wins.

  4. Calibration recalibration: Implement a rolling calibration check that updates weekly. When the ECE exceeds 0.025, shrink all probability estimates toward 50% by a calibrated amount until the ECE returns to baseline.

  5. Override policy: Limit overrides to verifiable information (injury confirmations, weather changes, confirmed lineup changes). All "judgment" overrides are banned. Record any urge to override in the journal without acting on it, and analyze the hypothetical performance quarterly.


Results After Implementation

Marcus tracked his performance for the next 500 bets under the new protocol. His yield improved from 3.2% to 4.4% --- recovering approximately 75% of the bias-driven leakage his audit had identified. The remaining gap between his theoretical yield (4.8%) and realized yield (4.4%) he attributed to execution costs (line movement between signal and execution) and irreducible model estimation error.

The most impactful change was the override ban. Simply following his model for all judgment calls eliminated the -4.8% drag from narrative-driven bets while preserving the +4.5% contribution from legitimate information-based adjustments.


Key Lessons

The bias audit methodology is generalizable to any bettor with adequate records. The critical requirements are:

  1. Rich data: You need more than outcomes. You need model outputs, confidence ratings, written reasoning, and ideally emotional state indicators.
  2. Sufficient sample size: A minimum of 500 bets provides reasonable statistical power for most bias detection tests. One thousand is better.
  3. Honest confrontation: The purpose of the audit is to find problems, not to confirm that you are doing well. Approach it like a financial auditor, not a defense attorney.
  4. Actionable remediation: Each identified bias should produce a specific, implementable process change. "I'll try to be less biased" is not a remediation plan. "I will remove opening lines from my data pipeline" is.
  5. Follow-up measurement: Track the impact of remediation changes with the same rigor used for the initial audit. This closes the loop and prevents backsliding.

The most sobering finding from Marcus's audit was not any individual bias but their cumulative impact. Each bias alone seemed modest: 0.3 here, 0.5 there. Together, they consumed roughly 40% of his theoretical edge. For a bettor operating on thinner margins, the same biases could easily turn a profitable model into a losing operation.

Your model is only as good as the human operating it. Audit the human.


Discussion Questions

  1. Marcus's override analysis showed that information-based overrides added value while judgment-based overrides destroyed it. Under what circumstances might judgment-based overrides be justified, and how would you design a system to distinguish valuable judgment from narrative bias in real time?

  2. The anchoring bias was strongest in NFL betting where opening lines are available a full week before the game. How might sportsbooks' line release timing strategy be designed to exploit bettors' anchoring tendencies?

  3. Marcus's confirmation bias manifested in his web search patterns. How might the rise of AI-powered research tools (which can be prompted to provide balanced arguments) change the dynamics of confirmation bias in sports betting?

  4. The recency bias finding showed loss-induced timidity rather than the more commonly discussed loss-chasing. Which pattern is more common among sophisticated bettors, and which is more costly in expectation?

  5. If Marcus's audit had revealed that his overrides outperformed his model, what would that imply about his model's limitations, and how should he have incorporated that finding into his process?