Appendix D: Data Sources Directory
This appendix catalogs the principal data sources for sports betting research, organized by sport and data type. Availability, pricing, and API specifications change frequently; the information here was current at the time of publication. Always verify terms of service before scraping or redistributing data.
D.1 Multi-Sport Data Providers
D.1.1 Free Sources
Sports Reference Family (sports-reference.com) - Pro-Football-Reference.com -- Comprehensive NFL statistics from 1920 to present. Game logs, player stats, advanced metrics (ANY/A, DVOA-like), draft data, and coaching records. CSV export available for most tables. - Basketball-Reference.com -- NBA, ABA, WNBA, and international data. Box scores, advanced stats (PER, BPM, VORP, Win Shares), lineup data, shooting splits. - Baseball-Reference.com -- MLB data back to 1871. Batting, pitching, fielding stats. WAR calculations (both bWAR and comparison to fWAR). Play-by-play from 1921. - Hockey-Reference.com -- NHL statistics from 1917. Skater and goalie stats, advanced metrics, game logs. - FBref.com -- Global soccer data powered by StatsBomb. Extensive coverage of top European leagues, MLS, international tournaments. Expected goals (xG), passing networks, defensive actions. - Limitations: Rate-limited; no official API. Terms of service prohibit automated scraping at high volume. Best for manual research and moderate-scale data collection.
ESPN API (site.api.espn.com)
- Undocumented but widely used public endpoints for scores, schedules, standings, and rosters across all major sports. JSON format. No authentication required for basic endpoints.
- Example: site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard
- Coverage: NFL, NBA, MLB, NHL, college sports, soccer, tennis, golf.
- Limitations: No historical odds, limited advanced statistics, endpoints may change without notice.
Retrosheet (retrosheet.org) - Free play-by-play data for every MLB game from 1921 to present. Event files can be parsed with the Chadwick tools. Essential for baseball simulation and historical analysis.
nflverse (github.com/nflverse)
- Community-maintained R and Python packages for NFL data. Includes play-by-play data from nflfastR, roster information, next-gen stats, and draft picks. The nfl_data_py Python package provides easy access.
- Coverage: NFL play-by-play from 1999, with EPA and WPA calculations.
NBA API (nba.com/stats)
- Accessed via the nba_api Python package. Player tracking data, shot charts, lineup combinations, hustle stats, and advanced box scores.
- Limitations: Rate-limited. Headers must include a valid referer.
Kaggle Datasets (kaggle.com) - Numerous sports datasets contributed by the community. Quality varies. Notable datasets include historical NFL game results with spreads, NBA shot logs, MLB Statcast data, and various soccer datasets.
D.1.2 Paid Sources
Sportradar (sportradar.com) - Industry-standard data provider. Official data partner of the NFL, NBA, NHL, MLB, and NASCAR. Real-time feeds, play-by-play, player props data, and proprietary advanced metrics. - Pricing: Enterprise-level; typically $10,000+ annually for research tiers. Developer trials available with limited call volumes.
Stats Perform (statsperform.com) - Formerly Opta. Premier soccer data provider. Event-level data with 2,000+ events per match. Expected goals models, player ratings, and possession metrics. - Also covers basketball, American football, baseball, cricket, tennis. - Pricing: Enterprise. Academic partnerships available.
Genius Sports (geniussports.com) - Official data rights holder for many NCAA sports and several international leagues. Live data feeds and trading tools.
Statsbomb (statsbomb.com) - High-quality soccer event data. Free tier covers select competitions (La Liga, Champions League finals, NWSL). Full product includes 360-degree freeze-frame data showing all player positions at each event. - Pricing: Tiered; academic access programs exist.
Second Spectrum / Hawk-Eye - Optical tracking data for NBA and Premier League. Player and ball position at 25 fps. Powers advanced spatial analytics but is extremely expensive and typically limited to teams and media partners.
D.2 Odds and Betting Data
D.2.1 Historical Odds
Odds-Portal (oddsportal.com) - Historical opening and closing odds from dozens of bookmakers. Covers all major sports globally. Free access for manual use; scraping is against terms of service. - Sports: Soccer, basketball, hockey, tennis, American football, baseball, and more.
Football-Data.co.uk (football-data.co.uk) - Free downloadable CSV files with historical match results and bookmaker odds for major European soccer leagues from the mid-1990s to present. Includes Bet365, Pinnacle, and market average odds. - Essential resource for soccer betting research. Updated weekly during seasons.
Australian Sports Betting (aussportsbetting.com) - Historical odds data for Australian sports (AFL, NRL, A-League) and international sports. Free CSV downloads. Includes closing Pinnacle lines.
Kaggle / GitHub Community Datasets - Various scraped odds datasets appear periodically. Notable: the Pinnacle closing line dataset for NFL (covers 2007-present in some versions), and comprehensive soccer odds collections.
SportsBookReview (sportsbookreview.com) - Historical line movement data for NFL, NBA, MLB, NHL, and college sports. Shows opening and closing lines at major sportsbooks along with public betting percentages. - Free access with registration; premium tiers available.
D.2.2 Real-Time Odds Feeds
The Odds API (the-odds-api.com)
- Real-time and pre-match odds from 40+ bookmakers via REST API. Covers major US and international sports. Free tier: 500 requests/month. Paid tiers from $20/month.
- Endpoints: head-to-head, spreads, totals, player props.
- Python: requests.get('https://api.the-odds-api.com/v4/sports/americanfootball_nfl/odds/', params=params)
OddsJam (oddsjam.com) - Real-time odds comparison across US sportsbooks. Positive EV finder and arbitrage scanner. Subscription required ($99+/month).
BetQL (betql.co) - Model-driven odds comparison and line movement tracking. Subscription service with free tier.
Pinnacle API
- Pinnacle offers a free API for current odds on all markets. Requires a funded Pinnacle account. Widely regarded as the sharpest lines in the market.
- Documentation: pinnacle.com/en/api
D.2.3 Betting Exchange Data
Betfair Exchange (betfair.com)
- The world's largest betting exchange. Historical data available through the Betfair Historical Data portal (subscription required). Includes tick-by-tick price data, matched volumes, and full order book snapshots.
- API: Free with Betfair account. Supports placing bets, streaming prices, and historical data queries.
- Python: betfairlightweight package.
Betdaq / Smarkets / Matchbook - Smaller exchanges with their own APIs. Useful for cross-exchange analysis and arbitrage research.
D.3 Sport-Specific Data Sources
D.3.1 NFL
| Source | Type | Cost | Key Data |
|---|---|---|---|
| nflfastR / nfl_data_py | Play-by-play | Free | EPA, WPA, CPOE, air yards |
| NFL Next Gen Stats | Tracking | Free (summary) | Speed, separation, rush lanes |
| Pro Football Focus | Grades + stats | $49.99+/yr | Player grades, snap counts |
| Warren Sharp / SharpFootball | Analytics | $99+/yr | Pace, tendency, situational |
| Football Outsiders | Advanced stats | Free / $49+/yr | DVOA, DAVE, adjusted stats |
D.3.2 NBA
| Source | Type | Cost | Key Data |
|---|---|---|---|
| nba_api (Python) | Official stats | Free | Tracking, shooting, lineups |
| Cleaning the Glass | Advanced stats | $100/yr | Lineup data, four-factors |
| PBPStats.com | Play-by-play | Free | Detailed possession analysis |
| NBA Tracking (Second Spectrum) | Optical | Restricted | Full spatial tracking |
| Basketball Index | Composite metrics | $50/yr | LEBRON, RAPTOR-like stats |
D.3.3 MLB
| Source | Type | Cost | Key Data |
|---|---|---|---|
| Baseball Savant (Statcast) | Pitch-level | Free | Exit velo, launch angle, spin |
| FanGraphs | Advanced stats | Free / $60/yr | WAR, FIP, wRC+, projections |
| Retrosheet | Play-by-play | Free | Historical event files |
| Brooks Baseball | Pitch tracking | Free | PitchFX and Statcast viz |
| Baseball Prospectus | Projections | $49.95/yr | PECOTA, DRC+, catcher framing |
D.3.4 NHL
| Source | Type | Cost | Key Data |
|---|---|---|---|
| NHL API (statsapi.web.nhl.com) | Official stats | Free | Game data, player stats |
| MoneyPuck.com | Advanced stats | Free | xG, WAR, line combinations |
| Natural Stat Trick | Advanced stats | Free | Shot metrics, Corsi, Fenwick |
| Evolving Hockey | Advanced stats | $25/yr | GAR, xG models, contracts |
| HockeyViz.com | Visualizations | Free | Shot maps, impact charts |
D.3.5 Soccer (Association Football)
| Source | Type | Cost | Key Data |
|---|---|---|---|
| FBref.com (StatsBomb) | Match stats | Free | xG, xA, progressive passes |
| Transfermarkt | Market values | Free | Transfer fees, squad values |
| Understat.com | Advanced stats | Free | xG by shot, match xG timeline |
| WhoScored.com | Ratings | Free | Player ratings, match stats |
| Football-Data.co.uk | Results + odds | Free | Historical results and odds |
| InfoGol | xG-based | Subscription | Pre-match xG projections |
| StatsBomb (full) | Event data | Paid | 360 freeze frames, detailed events |
D.3.6 College Sports
| Source | Type | Cost | Key Data |
|---|---|---|---|
| cfbfastR / hoopR | Play-by-play | Free | College football and basketball |
| Massey Ratings | Power ratings | Free | Composite and individual ratings |
| KenPom.com | Basketball stats | $24.95/yr | Adjusted efficiency, tempo |
| TeamRankings | Multi-sport | $49.95+/yr | Rankings, trends, public data |
| Haslametrics | Basketball | Free | Play-by-play derived metrics |
D.4 Web Scraping Best Practices
When official APIs are unavailable, web scraping may be necessary. Follow these guidelines:
- Respect robots.txt. Always check the robots.txt file before scraping. Obey crawl-delay directives.
- Rate limit requests. Use
time.sleep()between requests (minimum 2-3 seconds for most sites). Slamming a server can get your IP banned and is inconsiderate. - Use proper headers. Set a descriptive User-Agent string. Some sites block requests without proper headers.
- Cache aggressively. Store scraped data locally. Never re-scrape data you already have.
- Check terms of service. Many sports data sites explicitly prohibit scraping. Proceed with caution and respect intellectual property.
- Consider using Selenium or Playwright for JavaScript-rendered pages. Many modern sports sites use client-side rendering.
import requests
from bs4 import BeautifulSoup
import time
def scrape_with_respect(url, delay=3):
headers = {'User-Agent': 'Sports-Research-Bot/1.0 (academic research)'}
time.sleep(delay)
response = requests.get(url, headers=headers)
response.raise_for_status()
return BeautifulSoup(response.text, 'html.parser')
D.5 Data Quality Considerations
Working with sports betting data requires vigilance about several common issues:
Survivorship bias in odds data. Some historical odds databases only include markets that settled normally, excluding voided bets, postponed games, or walkover results.
Line movement timing. The time at which a line was recorded matters enormously. An "opening line" from one source may differ from another due to different snapshot times. Always document when lines were captured.
Score correction. Official scores can be adjusted after the fact (stat corrections in NFL, for example). Ensure your data source reflects final official results.
Missing data patterns. Data is rarely missing at random. Injured star players may have missing tracking data precisely when their absence most affects outcomes. Imputation strategies must account for this.
Odds format inconsistency. Different sources use different odds formats (American, decimal, fractional) and may or may not include the vig. Always normalize to a consistent format before analysis.
Time zone issues. Game times may be reported in different time zones across sources. Standardize to UTC or a single local time zone during data cleaning.
D.6 Building Your Data Pipeline
A recommended architecture for betting data management:
- Raw data lake. Store scraped and downloaded data in its original format (CSV, JSON). Never modify raw files.
- Staging layer. Clean, standardize, and validate data. Handle missing values, normalize team names, and convert odds formats.
- Feature store. Pre-compute features used by models: rolling averages, Elo ratings, rest days, travel distance, etc.
- Model inputs. Final merged dataset ready for training and prediction.
- Predictions and results. Store model outputs alongside actual outcomes for ongoing evaluation.
Use version control for code and consider DVC (Data Version Control) for large datasets. Document every transformation step so analyses are reproducible.
URLs and availability verified at the time of publication. The companion repository includes scripts for accessing many of these data sources programmatically.