Case Study 1: Building a Polymarket Trading Bot — From Research to Production

Overview

This case study walks through the complete journey of building, testing, and deploying a trading bot on Polymarket — the largest decentralized prediction market platform. We follow a fictional but realistic team of two developers, Mara and Jin, as they go from initial research to a production system generating consistent returns.

The case study is structured chronologically: research phase, development phase, testing phase, deployment phase, and operational lessons learned.


Phase 1: Research and Platform Analysis (Weeks 1-2)

Understanding Polymarket's Architecture

Mara and Jin began by thoroughly studying Polymarket's technical architecture. As discussed in Chapter 8 (Platform Survey) and Chapters 34-36 (Blockchain Integration), Polymarket operates as a hybrid system:

  • Settlement layer: Polygon (an Ethereum L2 chain) provides on-chain settlement
  • Order matching: A centralized Central Limit Order Book (CLOB) matches orders off-chain
  • Token standard: Conditional tokens (ERC-1155) represent YES and NO shares
  • Resolution: UMA's optimistic oracle resolves markets based on real-world outcomes

This hybrid design means the order book feels like a traditional exchange, but settlement and custody are trustless and on-chain. The team noted that this creates a unique latency profile: order matching is fast (like a centralized exchange), but settlement takes seconds to minutes (blockchain confirmation times).

Data Collection and Exploration

The team spent the first two weeks collecting data. They wrote a simple script to poll the Polymarket API every 5 minutes and store snapshots in a SQLite database:

"""Initial data collection script for Polymarket research.

Polls the Polymarket CLOB API at regular intervals and stores
market snapshots for later analysis. This is the foundation of
the training dataset for our predictive models.
"""

import asyncio
import httpx
import sqlite3
import json
from datetime import datetime

API_BASE = "https://clob.polymarket.com"
DB_PATH = "polymarket_research.db"
POLL_INTERVAL_SECONDS = 300  # 5 minutes


async def collect_data():
    """Main data collection loop."""
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            market_id TEXT,
            question TEXT,
            yes_price REAL,
            no_price REAL,
            volume_24h REAL,
            liquidity REAL,
            timestamp TEXT,
            raw_json TEXT
        )
    """)
    conn.commit()

    async with httpx.AsyncClient(base_url=API_BASE, timeout=30) as client:
        while True:
            try:
                response = await client.get(
                    "/markets", params={"active": "true", "limit": 500}
                )
                response.raise_for_status()
                data = response.json()

                timestamp = datetime.utcnow().isoformat()
                count = 0

                for market in data.get("data", []):
                    tokens = market.get("tokens", [])
                    yes_token = next(
                        (t for t in tokens if t.get("outcome") == "Yes"), None
                    )
                    no_token = next(
                        (t for t in tokens if t.get("outcome") == "No"), None
                    )
                    if yes_token and no_token:
                        conn.execute(
                            """INSERT INTO snapshots
                            (market_id, question, yes_price, no_price,
                             volume_24h, liquidity, timestamp, raw_json)
                            VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
                            (
                                market["condition_id"],
                                market.get("question", ""),
                                float(yes_token.get("price", 0.5)),
                                float(no_token.get("price", 0.5)),
                                float(market.get("volume_24hr", 0)),
                                float(market.get("liquidity", 0)),
                                timestamp,
                                json.dumps(market),
                            ),
                        )
                        count += 1

                conn.commit()
                print(f"[{timestamp}] Stored {count} market snapshots")

            except Exception as e:
                print(f"Error: {e}")

            await asyncio.sleep(POLL_INTERVAL_SECONDS)


if __name__ == "__main__":
    asyncio.run(collect_data())

After two weeks, they had over 40,000 snapshots across 200+ markets. They performed exploratory data analysis and discovered several patterns:

  1. Spread distribution: Most active markets had spreads under 3%, but less liquid markets could have spreads above 10%.
  2. Volume clustering: 70% of volume concentrated in the top 20 markets (politics, crypto, major events).
  3. Price autocorrelation: Prices exhibited strong short-term autocorrelation (prices 1 hour ago predict prices now), but weak long-term autocorrelation.
  4. Resolution patterns: Markets typically started resolving 24-48 hours before official resolution, with prices rapidly converging to 0 or 1.

Identifying the Edge

The team hypothesized that their edge would come from two sources:

  1. Slow information incorporation: When major news breaks, Polymarket prices take 15-60 minutes to fully adjust. A system that detects news faster and translates it into probability estimates could front-run the market's adjustment.

  2. Cross-platform price discrepancies: The same event might be priced differently on Polymarket, Metaculus, and other platforms. Combining these signals could produce a more accurate estimate than any single market.

They validated the first hypothesis by manually checking price movements around 20 major news events and found that the median price adjustment time was 23 minutes.


Phase 2: Model Development (Weeks 3-6)

Feature Engineering

Based on their research, the team engineered 17 features organized into four groups:

Group 1: Market microstructure (from Chapter 7) - Current midpoint price - Bid-ask spread - Order book depth at 5% from mid - Trade imbalance (buy volume minus sell volume over the last hour)

Group 2: Temporal dynamics (from Chapters 13-14) - Price return over 1 hour, 4 hours, 24 hours - Price volatility (rolling 24-hour standard deviation) - Days until resolution - Exponential time decay factor

Group 3: External signals (from Chapter 21) - News article count in the last 4 hours - News sentiment score (positive minus negative keywords) - Google Trends index for relevant keywords

Group 4: Cross-platform consensus (from Chapter 19) - Average probability across all platforms listing the same event - Standard deviation of probabilities across platforms - Metaculus community prediction (when available)

Training Pipeline

The team built the model training pipeline following the ensemble approach from Chapter 23:

"""Model training pipeline for the Polymarket trading bot.

Trains a calibrated ensemble of logistic regression and XGBoost,
using time-series cross-validation to prevent look-ahead bias.
"""

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss
import xgboost as xgb


def prepare_training_data(db_path: str) -> tuple[np.ndarray, np.ndarray]:
    """Load and prepare training data from collected snapshots.

    Labels: 1 if the market resolved YES, 0 if NO.
    Features: The 17-feature vector computed at each snapshot time.

    We only include resolved markets to avoid survivorship bias (Ch 17).
    """
    # In production, load from database and compute features
    # For illustration, we show the structure:
    df = pd.read_sql(
        "SELECT * FROM snapshots WHERE market_id IN "
        "(SELECT DISTINCT market_id FROM resolved_markets)",
        f"sqlite:///{db_path}",
    )

    # Compute features for each snapshot
    features = []
    labels = []

    for market_id in df["market_id"].unique():
        market_data = df[df["market_id"] == market_id].sort_values("timestamp")
        resolution = get_resolution(market_id)  # 1 for YES, 0 for NO
        if resolution is None:
            continue

        for _, row in market_data.iterrows():
            feature_vec = compute_feature_vector(row, market_data, df)
            features.append(feature_vec)
            labels.append(resolution)

    return np.array(features), np.array(labels)


def get_resolution(market_id: str) -> int:
    """Look up how a market resolved. Returns 1 for YES, 0 for NO."""
    # In production, query the resolution from the database
    return 1  # Placeholder


def compute_feature_vector(
    row: pd.Series, market_history: pd.DataFrame, all_data: pd.DataFrame
) -> np.ndarray:
    """Compute the 17-element feature vector for a single snapshot."""
    # Simplified — in production, each feature is computed from the data
    return np.random.randn(17)  # Placeholder


def train_ensemble(
    X: np.ndarray, y: np.ndarray
) -> dict:
    """Train the logistic regression + XGBoost ensemble.

    Returns a dictionary containing both trained models and
    evaluation metrics from time-series cross-validation.
    """
    tscv = TimeSeriesSplit(n_splits=5)
    results = {"lr_brier": [], "xgb_brier": [], "ensemble_brier": []}

    lr = CalibratedClassifierCV(
        LogisticRegression(C=0.1, max_iter=1000), cv=3, method="isotonic"
    )
    xgb_model = xgb.XGBClassifier(
        n_estimators=200, max_depth=5, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8,
        use_label_encoder=False, eval_metric="logloss",
    )

    for train_idx, val_idx in tscv.split(X):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        lr.fit(X_train, y_train)
        xgb_model.fit(X_train, y_train, verbose=False)

        lr_probs = lr.predict_proba(X_val)[:, 1]
        xgb_probs = xgb_model.predict_proba(X_val)[:, 1]
        ensemble_probs = 0.4 * lr_probs + 0.6 * xgb_probs

        results["lr_brier"].append(brier_score_loss(y_val, lr_probs))
        results["xgb_brier"].append(brier_score_loss(y_val, xgb_probs))
        results["ensemble_brier"].append(brier_score_loss(y_val, ensemble_probs))

    # Final training on all data
    lr.fit(X, y)
    xgb_model.fit(X, y, verbose=False)

    print(f"LR Brier:       {np.mean(results['lr_brier']):.4f} "
          f"+/- {np.std(results['lr_brier']):.4f}")
    print(f"XGB Brier:      {np.mean(results['xgb_brier']):.4f} "
          f"+/- {np.std(results['xgb_brier']):.4f}")
    print(f"Ensemble Brier: {np.mean(results['ensemble_brier']):.4f} "
          f"+/- {np.std(results['ensemble_brier']):.4f}")

    return {"lr": lr, "xgb": xgb_model, "metrics": results}

Calibration Results

The team's initial model achieved a Brier score of 0.198 on the held-out validation set, compared to 0.212 for the market prices alone. This means their model was slightly better calibrated than the market — a small but potentially profitable edge.

They verified calibration using a reliability diagram (Chapter 12) and found the model was well-calibrated in the 0.2-0.8 range but overconfident at the extremes. They addressed this by clipping predictions to [0.05, 0.95].


Phase 3: Backtesting (Weeks 7-8)

Backtest Configuration

The team ran a walk-forward backtest on 6 months of data with the following parameters:

  • Initial capital: $10,000
  • Kelly fraction: 0.25 (quarter Kelly)
  • Minimum edge threshold: 3%
  • Maximum position size: $500
  • Transaction costs: 2% round-trip (1% per side)
  • Slippage: 25 basis points per trade
  • Model retrained every 30 days

Results

Metric Value
Total Return 18.4%
Annualized Return 39.2%
Sharpe Ratio 1.82
Maximum Drawdown 8.3%
Win Rate 58.2%
Profit Factor 1.67
Total Trades 342
Average Trade P&L $5.38
Average Holding Period 4.2 days

The team was cautiously optimistic. The Sharpe ratio of 1.82 was strong, and the maximum drawdown of 8.3% was well within their risk tolerance. However, they noted several caveats:

  1. Survivorship bias risk: They only backtested on markets that had resolved. Markets that were delisted or abandoned were excluded.
  2. Liquidity optimism: The backtest assumed all orders filled at the limit price, which may not reflect reality for less liquid markets.
  3. Regime dependence: The backtest period included a high-activity political season, which may not be representative.

They decided to proceed to paper trading, while keeping these caveats in mind.


Phase 4: Paper Trading (Weeks 9-12)

Paper Trading Setup

The team deployed the bot in dry_run=True mode, simulating orders against live Polymarket prices without actually submitting them. This is the paper-to-live transition discussed in Chapter 18.

During four weeks of paper trading, the bot:

  • Processed 672 trading cycles (every hour)
  • Generated 1,847 signals (of which 523 exceeded the edge threshold)
  • "Executed" 412 paper trades
  • Achieved a paper P&L of +$1,240 on a $10,000 paper portfolio (12.4%)

Issues Discovered During Paper Trading

  1. Stale data problem: The API sometimes returned cached data up to 3 minutes old. The bot would generate signals based on stale prices, and by the time it would have executed, the price had moved. Solution: added a staleness check — if data is more than 60 seconds old, skip the cycle.

  2. Market resolution timing: Some markets resolved earlier than their listed end date. The bot was still trying to trade them after resolution. Solution: added a check for markets that have stopped trading.

  3. Concentrated positions: Without adequate diversification limits, the bot accumulated large positions in a single political event cluster. Solution: implemented the category exposure caps from the portfolio constructor.

  4. Feature computation failures: Missing historical data for new markets caused NaN values in feature vectors. Solution: added default values and a minimum history requirement (at least 24 hours of data before trading a market).


Phase 5: Live Deployment (Weeks 13+)

Going Live

After fixing all issues found in paper trading, the team deployed with real funds:

  • Initial capital: $2,000 (much smaller than backtest to limit downside during validation)
  • All risk limits set conservatively:
  • Maximum daily loss: $100
  • Maximum position: $200
  • Maximum portfolio exposure: $1,500
  • Circuit breaker on 10% drawdown

First Month Results

Metric Paper Trading Live Trading
Total Return 12.4% 7.8%
Sharpe Ratio 1.95 1.34
Max Drawdown 5.1% 6.7%
Win Rate 59.1% 55.3%
Total Trades 412 287

The live results were weaker than paper trading, as expected. The team identified two primary causes:

  1. Execution slippage: Real orders experienced more slippage than the paper trading model assumed, especially in thinner markets.
  2. Latency cost: The time between signal generation and order submission (about 2 seconds) allowed some opportunities to disappear.

Despite the degradation, the system was profitable and the Sharpe ratio remained above 1.0, which the team considered acceptable for an initial deployment.

Ongoing Operations

The team established the following operational rhythm:

  • Daily: Review the dashboard, check for alerts, verify no positions exceeded limits
  • Weekly: Retrain the model with new resolved-market data, review feature importance
  • Monthly: Full performance review, compare live results to backtest, adjust parameters if warranted

Key Lessons Learned

1. Start Small, Scale Gradually

The team's decision to start with $2,000 instead of their full intended capital ($10,000) proved wise. Several issues emerged in the first week of live trading that would have been more costly with a larger portfolio. After three months of stable operation, they gradually increased capital.

2. Paper Trading Is Necessary but Insufficient

Paper trading caught many bugs, but it could not replicate the execution realities of live trading. The gap between paper and live performance (approximately 40% reduction in returns) was consistent with the warnings in Chapter 18.

3. The Edge Is Small and Fragile

Their model's Brier score advantage over the market was only 0.014 (0.198 vs. 0.212). This small edge translated to real profits only because of disciplined position sizing and risk management. Any relaxation of these discipline would have erased the edge.

4. Data Quality Is Everything

At least 30% of the team's development time was spent on data quality: handling missing values, detecting stale data, reconciling discrepancies between API endpoints. This is unglamorous work, but it is the foundation on which everything else rests.

5. Monitoring Prevented Catastrophes

On two occasions, the monitoring system detected anomalies that could have led to significant losses:

  • Once, the API returned prices of 0.0 for all markets (an API bug). The staleness check prevented the bot from interpreting this as a trading signal.
  • Once, a market was resolved in favor of YES, but the API briefly showed it as still active at a price of 0.02. The bot would have bought heavily if not for the minimum-history check.

6. Regulatory Awareness Is Non-Negotiable

The team consulted a legal advisor before going live. While Polymarket operates on blockchain and is accessible globally, the legal landscape varies by jurisdiction. The team ensured their operations complied with local regulations, as emphasized in Chapters 38-39.


Architecture Summary

+-----------------------------------------------------------+
|               POLYMARKET TRADING BOT                       |
+-----------------------------------------------------------+
|                                                             |
|  +----------------+     +-----------------+                 |
|  | Polymarket API |---->| Data Collector  |                 |
|  +----------------+     +--------+--------+                 |
|                                  |                          |
|  +----------------+     +--------v--------+                 |
|  | News API       |---->| Feature Engine  |                 |
|  +----------------+     +--------+--------+                 |
|                                  |                          |
|  +----------------+     +--------v--------+                 |
|  | Metaculus API  |---->| Ensemble Model  |                 |
|  +----------------+     +--------+--------+                 |
|                                  |                          |
|                         +--------v--------+                 |
|                         | Strategy Engine |                 |
|                         | (Kelly sizing)  |                 |
|                         +--------+--------+                 |
|                                  |                          |
|                         +--------v--------+                 |
|                         | Risk Manager    |                 |
|                         +--------+--------+                 |
|                                  |                          |
|                         +--------v--------+                 |
|                         | Order Executor  |----> Polymarket |
|                         +-----------------+      CLOB API   |
|                                                             |
+-----------------------------------------------------------+

Discussion Questions

  1. The team's backtest showed a 39.2% annualized return, but live trading achieved closer to 25% annualized. What additional factors might explain this gap beyond execution slippage and latency?

  2. The model's edge was only 0.014 Brier score improvement. Is this enough to be confident the edge is real and not just noise? What statistical test would you use to assess significance?

  3. The team started with $2,000 despite backtesting with $10,000. How would you decide when it is safe to scale up? What metrics would you monitor?

  4. Polymarket uses UMA's optimistic oracle for resolution. What risks does this introduce that a centrally resolved market (like Kalshi) would not have? How would you mitigate them?

  5. The team chose a 1-hour trading cycle. Under what circumstances would a shorter cycle (e.g., 5 minutes) or a longer cycle (e.g., 4 hours) be preferable?