Appendix B: Python Setup and Libraries
This appendix provides setup instructions and library references for the code examples in this textbook.
Python Installation
Recommended: Anaconda Distribution
- Download from https://www.anaconda.com/download
- Install with default settings
- Includes Python, Jupyter, and common libraries
Alternative: Standard Python
- Download from https://www.python.org/downloads/
- Install Python 3.9+
- Use pip for library installation
Required Libraries
Core Data Analysis
pip install pandas numpy scipy
| Library | Version | Purpose |
|---|---|---|
| pandas | 1.5+ | Data manipulation |
| numpy | 1.24+ | Numerical computing |
| scipy | 1.10+ | Statistical functions |
Visualization
pip install matplotlib seaborn plotly
| Library | Version | Purpose |
|---|---|---|
| matplotlib | 3.7+ | Basic plotting |
| seaborn | 0.12+ | Statistical plots |
| plotly | 5.15+ | Interactive charts |
Machine Learning
pip install scikit-learn
| Library | Version | Purpose |
|---|---|---|
| scikit-learn | 1.2+ | ML algorithms |
NFL Data
pip install nfl-data-py sportsipy
| Library | Version | Purpose |
|---|---|---|
| nfl-data-py | 0.3+ | nflfastR Python port |
| sportsipy | 0.6+ | Sports reference data |
Web/API
pip install requests beautifulsoup4
| Library | Version | Purpose |
|---|---|---|
| requests | 2.28+ | HTTP requests |
| beautifulsoup4 | 4.12+ | HTML parsing |
Complete Environment Setup
Using requirements.txt
Create requirements.txt:
pandas>=1.5.0
numpy>=1.24.0
scipy>=1.10.0
matplotlib>=3.7.0
seaborn>=0.12.0
scikit-learn>=1.2.0
nfl-data-py>=0.3.0
requests>=2.28.0
jupyter>=1.0.0
Install:
pip install -r requirements.txt
Using Conda Environment
conda create -n nfl-analytics python=3.10
conda activate nfl-analytics
conda install pandas numpy scipy matplotlib seaborn scikit-learn jupyter
pip install nfl-data-py
Jupyter Notebook Setup
Starting Jupyter
jupyter notebook
# or
jupyter lab
Useful Magic Commands
# Display plots inline
%matplotlib inline
# Auto-reload modules
%load_ext autoreload
%autoreload 2
# Show all output (not just last)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Library Quick Reference
Pandas Basics
import pandas as pd
# Read data
df = pd.read_csv('data.csv')
# Basic operations
df.head() # First 5 rows
df.describe() # Summary stats
df.info() # Column types
# Filtering
df[df['column'] > value]
df.query('column > @value')
# Grouping
df.groupby('team').mean()
df.groupby(['team', 'year']).agg({'pts': 'sum'})
# Merging
pd.merge(df1, df2, on='key')
df1.join(df2, on='key')
NumPy Basics
import numpy as np
# Array creation
arr = np.array([1, 2, 3])
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
# Statistics
np.mean(arr)
np.std(arr)
np.percentile(arr, 75)
# Random
np.random.normal(mean, std, size)
np.random.choice(arr, size, replace=True)
Matplotlib Basics
import matplotlib.pyplot as plt
# Basic plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label='Series')
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('Title')
plt.legend()
plt.savefig('plot.png', dpi=150)
plt.show()
# Scatter plot
plt.scatter(x, y, c=colors, s=sizes)
# Histogram
plt.hist(data, bins=30, edgecolor='black')
# Bar chart
plt.bar(categories, values)
Seaborn Basics
import seaborn as sns
# Distribution
sns.histplot(df['column'], kde=True)
sns.kdeplot(df['column'])
# Relationships
sns.scatterplot(data=df, x='x', y='y', hue='group')
sns.regplot(data=df, x='x', y='y')
# Categorical
sns.boxplot(data=df, x='category', y='value')
sns.barplot(data=df, x='category', y='value')
# Heatmap
sns.heatmap(correlation_matrix, annot=True)
Scikit-Learn Basics
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Linear regression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
# Logistic regression
clf = LogisticRegression()
clf.fit(X_train, y_train)
accuracy = accuracy_score(y_test, clf.predict(X_test))
NFL Data Access
Using nfl-data-py
import nfl_data_py as nfl
# Load play-by-play
pbp = nfl.import_pbp_data([2022, 2023])
# Load schedules
schedules = nfl.import_schedules([2022, 2023])
# Load rosters
rosters = nfl.import_rosters([2022, 2023])
# Load combine data
combine = nfl.import_combine_data([2020, 2021, 2022])
Common Columns (Play-by-Play)
| Column | Description |
|---|---|
| play_id | Unique play identifier |
| game_id | Game identifier |
| posteam | Team with possession |
| defteam | Defensive team |
| down | Current down |
| ydstogo | Yards to first down |
| yardline_100 | Yards from endzone |
| play_type | Type of play |
| yards_gained | Yards gained |
| epa | Expected Points Added |
| wp | Win probability |
Project Structure
Recommended directory structure:
nfl-analytics-project/
├── data/
│ ├── raw/
│ └── processed/
├── notebooks/
│ ├── 01-data-exploration.ipynb
│ └── 02-analysis.ipynb
├── src/
│ ├── __init__.py
│ ├── data_loading.py
│ ├── analysis.py
│ └── visualization.py
├── tests/
│ └── test_analysis.py
├── requirements.txt
└── README.md
Troubleshooting
Common Issues
Import Errors:
pip install --upgrade <library>
Version Conflicts:
pip install <library>==<version>
Memory Issues:
- Use chunksize parameter in pd.read_csv()
- Filter data early in pipeline
- Use appropriate dtypes
Plotting Not Showing:
%matplotlib inline
plt.show()
This appendix provides the technical foundation for running code examples. Refer back as needed when setting up your environment.