Scraping FBref for Stats

Beginner 10 min read 0 views Nov 27, 2025
FBref provides comprehensive soccer statistics powered by StatsBomb and Opta data. While they don't offer an official API, respectful web scraping can extract valuable data for analysis. ## Setting Up Your Scraper Use Python with requests and BeautifulSoup for basic scraping: ```python import requests from bs4 import BeautifulSoup import pandas as pd import time def scrape_fbref_table(url): # Add delay to be respectful to the server time.sleep(3) headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') # Find the stats table table = soup.find('table') df = pd.read_html(str(table))[0] return df ``` ## Extracting Player Statistics ```python # Example: Get Premier League player stats league_url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats' player_stats = scrape_fbref_table(league_url) # Clean column names if multi-level if isinstance(player_stats.columns, pd.MultiIndex): player_stats.columns = ['_'.join(col).strip() for col in player_stats.columns.values] ``` ## Scraping Match Data ```python def get_match_stats(match_id): url = f'https://fbref.com/en/matches/{match_id}/' time.sleep(3) response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) tables = pd.read_html(response.content) # Extract team stats, player stats, and shot data return { 'team_stats': tables[0], 'player_stats': tables[1], 'shots': tables[-1] } ``` ## Best Practices When scraping FBref, follow these guidelines: - Add delays between requests (3-5 seconds minimum) - Use appropriate User-Agent headers - Cache downloaded data to avoid repeated requests - Respect robots.txt directives - Consider rate limiting during peak hours - Store data locally rather than querying repeatedly ## Data Processing FBref data often requires cleaning: ```python def clean_fbref_data(df): # Remove header rows that repeat in long tables df = df[df['Player'] != 'Player'] # Convert numeric columns numeric_cols = df.select_dtypes(include=['object']).columns for col in numeric_cols: df[col] = pd.to_numeric(df[col], errors='ignore') return df ``` Always verify that your scraping practices comply with FBref's terms of service and consider supporting them through their subscription service if you use the data extensively.

Discussion

Have questions or feedback? Join our community discussion on Discord or GitHub Discussions.