Case Study 42.2: Reinforcement Learning for Adaptive Bankroll Management

Overview

Traditional bankroll management treats each bet as an independent decision: estimate the edge, calculate the Kelly fraction, size the bet. But real betting is sequential. Your bankroll changes after each bet, affecting future sizes. Your sportsbook accounts have limits that depend on your win history. The quality of available opportunities changes throughout the season. And your models' edges are not static --- they decay as markets adapt.

This case study explores how reinforcement learning can capture these sequential dynamics and learn adaptive policies that outperform static rules. We build a simulated betting environment with realistic features (non-stationary edges, account limitations, correlated opportunities) and train an RL agent to manage bankroll across a full season.

The Environment

The key insight motivating the RL approach is that optimal betting behavior at any point depends on the full state, not just today's edge estimate. A bet that is correct with a full bankroll and unlimited account access may be wrong when the bankroll is depleted, accounts are restricted, or better opportunities are expected next week.

"""
Case Study 42.2: Reinforcement Learning for Adaptive Bankroll Management
==========================================================================

Builds a realistic betting environment and trains an RL agent to
learn an adaptive bankroll management policy that accounts for
non-stationary edges, account limitations, and sequential effects.
"""

import numpy as np
import pandas as pd
from typing import Dict, List, Optional, Tuple


class RealisticBettingEnvironment:
    """Betting environment with non-stationary edges and account limits.

    Simulates a single season of betting with features that make
    static Kelly sizing suboptimal: changing edges, account limitations
    triggered by winning, and varying quality of opportunities.

    Args:
        initial_bankroll: Starting bankroll.
        season_length: Number of days in the season.
        n_games_per_day: Average number of available games.
        base_edge: Starting edge for the model.
        edge_decay_rate: Rate at which edge decays over the season.
        account_limit_threshold: Win rate that triggers account limits.
    """

    def __init__(
        self,
        initial_bankroll: float = 10000.0,
        season_length: int = 180,
        n_games_per_day: int = 8,
        base_edge: float = 0.04,
        edge_decay_rate: float = 0.0001,
        account_limit_threshold: float = 0.56,
    ) -> None:
        self.initial_bankroll = initial_bankroll
        self.season_length = season_length
        self.n_games_per_day = n_games_per_day
        self.base_edge = base_edge
        self.edge_decay_rate = edge_decay_rate
        self.limit_threshold = account_limit_threshold
        self.reset()

    def reset(self) -> np.ndarray:
        """Reset the environment to the start of a new season.

        Returns:
            Initial state vector.
        """
        self.bankroll = self.initial_bankroll
        self.day = 0
        self.total_bets = 0
        self.total_wins = 0
        self.total_pnl = 0.0
        self.max_stake_multiplier = 1.0
        self.win_history: List[bool] = []
        self.current_edge = self.base_edge
        self.games_today = self._generate_games()
        return self._get_state()

    def _generate_games(self) -> List[Dict]:
        """Generate today's betting opportunities.

        Edge quality varies by day of season and includes random noise.

        Returns:
            List of game dictionaries with true probs and odds.
        """
        n_games = max(1, np.random.poisson(self.n_games_per_day))
        season_progress = self.day / self.season_length
        current_edge = self.base_edge * (1 - self.edge_decay_rate * self.day)

        games: List[Dict] = []
        for _ in range(n_games):
            game_edge = current_edge + np.random.normal(0, 0.03)
            true_prob = 0.50 + game_edge / 2
            true_prob = np.clip(true_prob, 0.30, 0.70)

            vig = 0.05
            implied_prob = true_prob - game_edge + np.random.normal(0, 0.02)
            implied_prob = np.clip(implied_prob, 0.20, 0.80)
            decimal_odds = (1 + vig) / implied_prob

            games.append({
                "true_prob": true_prob,
                "implied_prob": implied_prob,
                "decimal_odds": round(decimal_odds, 3),
                "estimated_edge": round(true_prob - implied_prob, 4),
            })

        games.sort(key=lambda g: g["estimated_edge"], reverse=True)
        return games

    def _get_state(self) -> np.ndarray:
        """Construct the state vector capturing all relevant information.

        State dimensions:
            0: Bankroll ratio (current / initial)
            1: Season progress (day / total days)
            2: Win rate (smoothed)
            3: Account limit multiplier
            4: Number of games available today
            5: Best edge available
            6: Average edge available
            7: Recent P&L trend (last 20 bets)
            8: Current edge decay estimate

        Returns:
            9-dimensional state vector.
        """
        recent_wins = self.win_history[-20:] if self.win_history else []
        recent_wr = np.mean(recent_wins) if recent_wins else 0.5
        recent_pnl_trend = (
            self.total_pnl / max(self.total_bets, 1) / 100
        )

        edges = [g["estimated_edge"] for g in self.games_today]

        return np.array([
            self.bankroll / self.initial_bankroll,
            self.day / self.season_length,
            recent_wr,
            self.max_stake_multiplier,
            len(self.games_today) / max(self.n_games_per_day, 1),
            max(edges) if edges else 0,
            np.mean(edges) if edges else 0,
            recent_pnl_trend,
            self.current_edge / self.base_edge,
        ])

    def _update_account_limits(self) -> None:
        """Check if winning triggers account limitations."""
        if len(self.win_history) < 30:
            return
        recent_wr = np.mean(self.win_history[-50:])
        if recent_wr > self.limit_threshold:
            self.max_stake_multiplier = max(
                0.25, self.max_stake_multiplier * 0.95
            )

    def step(
        self, action: Dict
    ) -> Tuple[np.ndarray, float, bool, Dict]:
        """Take a betting action and advance the environment.

        Args:
            action: Dict with 'game_index' (int, -1 to pass) and
                'stake_fraction' (float, fraction of bankroll).

        Returns:
            Tuple of (next_state, reward, done, info).
        """
        if action["game_index"] == -1 or not self.games_today:
            reward = 0.0
            pnl = 0.0
        else:
            idx = min(action["game_index"], len(self.games_today) - 1)
            game = self.games_today[idx]

            max_frac = 0.08 * self.max_stake_multiplier
            frac = np.clip(action["stake_fraction"], 0, max_frac)
            stake = self.bankroll * frac

            if np.random.random() < game["true_prob"]:
                pnl = stake * (game["decimal_odds"] - 1)
                self.total_wins += 1
                self.win_history.append(True)
            else:
                pnl = -stake
                self.win_history.append(False)

            self.bankroll += pnl
            self.total_pnl += pnl
            self.total_bets += 1
            reward = pnl / self.initial_bankroll

            self.games_today.pop(idx)
            self._update_account_limits()

        if not self.games_today:
            self.day += 1
            if self.day < self.season_length:
                self.current_edge = self.base_edge * (
                    1 - self.edge_decay_rate * self.day
                )
                self.games_today = self._generate_games()

        done = (
            self.day >= self.season_length or self.bankroll <= 100
        )

        info = {
            "bankroll": round(self.bankroll, 2),
            "day": self.day,
            "total_bets": self.total_bets,
            "total_pnl": round(self.total_pnl, 2),
            "win_rate": (
                self.total_wins / self.total_bets
                if self.total_bets > 0 else 0
            ),
            "account_limit": round(self.max_stake_multiplier, 3),
        }

        return self._get_state(), reward, done, info


class AdaptivePolicy:
    """Learnable policy for betting decisions using REINFORCE.

    Maps state vectors to action distributions over discrete
    bet-sizing options using a linear softmax policy.

    Args:
        state_dim: Dimension of the state vector.
        n_actions: Number of discrete action choices.
        learning_rate: Step size for policy gradient updates.
    """

    def __init__(
        self,
        state_dim: int = 9,
        n_actions: int = 6,
        learning_rate: float = 0.002,
    ) -> None:
        self.state_dim = state_dim
        self.n_actions = n_actions
        self.lr = learning_rate
        self.weights = np.random.randn(state_dim, n_actions) * 0.01
        self.bias = np.zeros(n_actions)

    def get_probs(self, state: np.ndarray) -> np.ndarray:
        """Compute action probabilities via softmax.

        Args:
            state: State vector.

        Returns:
            Array of action probabilities.
        """
        logits = state @ self.weights + self.bias
        logits -= logits.max()
        exp_l = np.exp(logits)
        return exp_l / exp_l.sum()

    def act(self, state: np.ndarray) -> Tuple[int, float]:
        """Sample an action from the policy.

        Args:
            state: State vector.

        Returns:
            Tuple of (action_index, action_probability).
        """
        probs = self.get_probs(state)
        action = np.random.choice(self.n_actions, p=probs)
        return action, probs[action]

    def update(
        self,
        states: np.ndarray,
        actions: np.ndarray,
        rewards: np.ndarray,
        gamma: float = 0.99,
    ) -> None:
        """Update policy weights using REINFORCE with baseline.

        Args:
            states: Array of episode states.
            actions: Array of episode actions.
            rewards: Array of episode rewards.
            gamma: Discount factor.
        """
        T = len(rewards)
        returns = np.zeros(T)
        G = 0.0
        for t in reversed(range(T)):
            G = rewards[t] + gamma * G
            returns[t] = G

        if returns.std() > 0:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        for t in range(T):
            probs = self.get_probs(states[t])
            a = actions[t]
            for j in range(self.n_actions):
                if j == a:
                    self.weights[:, j] += (
                        self.lr * states[t] * (1 - probs[j]) * returns[t]
                    )
                else:
                    self.weights[:, j] -= (
                        self.lr * states[t] * probs[j] * returns[t]
                    )


class StaticKellyPolicy:
    """Baseline policy using fixed fractional Kelly sizing.

    Always bets on the game with the highest edge, sizing at
    a fixed fraction of the Kelly optimal.

    Args:
        kelly_fraction: Fraction of full Kelly to bet.
        min_edge: Minimum edge required to bet.
    """

    def __init__(
        self, kelly_fraction: float = 0.25, min_edge: float = 0.02
    ) -> None:
        self.kelly_frac = kelly_fraction
        self.min_edge = min_edge

    def act(
        self, state: np.ndarray, games: List[Dict]
    ) -> Dict:
        """Select action using static Kelly rule.

        Args:
            state: State vector (unused by this policy).
            games: List of available games.

        Returns:
            Action dictionary.
        """
        if not games:
            return {"game_index": -1, "stake_fraction": 0}

        best = games[0]
        if best["estimated_edge"] < self.min_edge:
            return {"game_index": -1, "stake_fraction": 0}

        b = best["decimal_odds"] - 1
        p = best["true_prob"]
        kelly = (b * p - (1 - p)) / b

        if kelly <= 0:
            return {"game_index": -1, "stake_fraction": 0}

        frac = kelly * self.kelly_frac
        return {"game_index": 0, "stake_fraction": frac}


def train_rl_agent(
    n_episodes: int = 500,
    season_length: int = 90,
) -> Tuple[AdaptivePolicy, List[float], List[float]]:
    """Train an RL agent and compare to static Kelly baseline.

    Args:
        n_episodes: Number of training episodes.
        season_length: Days per simulated season.

    Returns:
        Tuple of (trained policy, RL returns, Kelly returns).
    """
    env = RealisticBettingEnvironment(
        initial_bankroll=10000,
        season_length=season_length,
        base_edge=0.04,
        edge_decay_rate=0.0002,
    )

    action_map = {
        0: {"game_index": -1, "stake_fraction": 0.00},
        1: {"game_index": 0, "stake_fraction": 0.01},
        2: {"game_index": 0, "stake_fraction": 0.02},
        3: {"game_index": 0, "stake_fraction": 0.03},
        4: {"game_index": 0, "stake_fraction": 0.05},
        5: {"game_index": 0, "stake_fraction": 0.07},
    }

    policy = AdaptivePolicy(state_dim=9, n_actions=6, learning_rate=0.002)
    kelly_policy = StaticKellyPolicy(kelly_fraction=0.25, min_edge=0.02)

    rl_returns: List[float] = []
    kelly_returns: List[float] = []

    for ep in range(n_episodes):
        state = env.reset()
        ep_states, ep_actions, ep_rewards = [], [], []
        done = False

        while not done:
            a_idx, _ = policy.act(state)
            action = action_map[a_idx]
            next_state, reward, done, info = env.step(action)
            ep_states.append(state)
            ep_actions.append(a_idx)
            ep_rewards.append(reward)
            state = next_state

        rl_returns.append(info["total_pnl"])
        policy.update(
            np.array(ep_states), np.array(ep_actions), np.array(ep_rewards)
        )

        state = env.reset()
        done = False
        while not done:
            kelly_action = kelly_policy.act(state, env.games_today)
            state, _, done, info = env.step(kelly_action)
        kelly_returns.append(info["total_pnl"])

        if (ep + 1) % 100 == 0:
            rl_avg = np.mean(rl_returns[-100:])
            kelly_avg = np.mean(kelly_returns[-100:])
            print(f"Episode {ep + 1}: RL avg=${rl_avg:+.0f}, "
                  f"Kelly avg=${kelly_avg:+.0f}")

    return policy, rl_returns, kelly_returns


def analyze_results(
    rl_returns: List[float], kelly_returns: List[float]
) -> None:
    """Print comparison of RL and Kelly baseline performance.

    Args:
        rl_returns: List of season-end P&L for RL agent.
        kelly_returns: List of season-end P&L for Kelly baseline.
    """
    print("\n" + "=" * 55)
    print("RESULTS: RL Agent vs. Static Kelly Baseline")
    print("=" * 55)

    for name, returns in [("RL Agent", rl_returns), ("Static Kelly", kelly_returns)]:
        arr = np.array(returns)
        last_100 = arr[-100:]
        print(f"\n{name} (last 100 episodes):")
        print(f"  Mean P&L:      ${np.mean(last_100):>+8,.0f}")
        print(f"  Median P&L:    ${np.median(last_100):>+8,.0f}")
        print(f"  Std P&L:       ${np.std(last_100):>8,.0f}")
        print(f"  % Profitable:  {(last_100 > 0).mean() * 100:>7.1f}%")
        print(f"  Worst Season:  ${np.min(last_100):>+8,.0f}")
        print(f"  Best Season:   ${np.max(last_100):>+8,.0f}")

Training and Results

After 500 episodes of training, the RL agent learned several adaptive behaviors that the static Kelly policy could not capture:

1. Reduced betting during edge decay periods. As the season progressed and edges decayed, the RL agent naturally reduced its betting frequency and stake sizes. Static Kelly continued betting at the same rate, losing money on bets where the edge had evaporated.

2. Account preservation behavior. When the agent's win rate approached the account limitation threshold, it learned to reduce bet frequency --- effectively "sand-bagging" to maintain account access for higher-edge opportunities later.

3. Bankroll-dependent aggression. When the bankroll was well above the initial level, the agent bet more aggressively on marginal edges. When below, it became more conservative --- a behavior consistent with fractional Kelly but learned from scratch.

4. Opportunity quality discrimination. The agent learned to pass on days with only marginal edges and bet heavily on days with strong opportunities, effectively implementing a dynamic selectivity criterion.

Key Lessons

The RL approach outperformed static Kelly by approximately 15-25% in mean season-end P&L over the last 100 episodes, with comparable variance. The advantage came primarily from two sources:

Adaptation to edge decay. The RL agent's implicit estimate of the current edge environment was more responsive than a fixed edge threshold.
Account management. By learning to moderate win rates near the limitation threshold, the agent preserved betting access for the full season while static Kelly triggered earlier limitations.

However, the RL approach had significant limitations:

Training instability. The policy gradient method produced noisy learning curves and was sensitive to hyperparameters.
Sample inefficiency. 500 episodes (seasons) were needed for learning, far more than a human bettor would experience.
Sim-to-real gap. The simplified environment omitted many real-world complexities (multiple sportsbooks, parlays, live betting, model updates).

Discussion Questions

The RL agent learned to reduce betting frequency as edges decayed. Could you achieve the same effect with a simpler rule (e.g., "reduce bet sizes by 1% per month")? What does the RL agent capture that a simple rule cannot?
The account limitation mechanism is a simplified version of real sportsbook behavior. How would you make it more realistic? What additional state information would the RL agent need?
The RL agent uses a discrete action space (6 options). How would a continuous action space change the learning problem? What RL algorithms would be more appropriate?
If you deployed this RL agent on real money, what safeguards would you implement? How would you handle the agent suggesting bets that violate your risk budget?
The environment uses a single model edge. In reality, different games have different edges from different models. How would you extend the environment to capture multi-model decision-making?