Case Study 1: Setting Up a Team Analytics Environment
Overview
Scenario: You've been hired as a junior analyst for the Milwaukee Bucks. Your first task is to set up a standardized analytics environment that the entire analytics team can use. The environment must be reproducible, well-documented, and support the team's diverse analytical workflows.
Duration: 2-3 hours Difficulty: Beginner to Intermediate Prerequisites: Basic command line familiarity
Background
The Bucks analytics department currently faces several challenges:
- Team members use different Python versions, causing compatibility issues
- Package versions vary between machines, leading to inconsistent results
- New analysts spend days setting up their environments
- There's no standardized project structure
- Documentation is scattered and outdated
Your task is to create a standardized environment setup that addresses these issues.
Part 1: Requirements Gathering
1.1 Stakeholder Interviews
After meeting with team members, you've identified the following analytical workflows:
Video Analyst (Sarah) - Needs: matplotlib, opencv-python for frame analysis - "I spend hours each week extracting shot clock data from video."
Statistical Analyst (Marcus) - Needs: pandas, scipy, statsmodels for hypothesis testing - "I'm running regression models on player performance data."
Machine Learning Engineer (Priya) - Needs: scikit-learn, xgboost, tensorflow for predictive models - "My models need to be reproducible for auditing."
Data Engineer (James) - Needs: nba_api, requests, sqlalchemy for data pipelines - "I pull data from multiple sources daily."
1.2 Common Requirements Matrix
| Package | Video | Stats | ML | Data | Required |
|---|---|---|---|---|---|
| pandas | Yes | Yes | Yes | Yes | Core |
| numpy | Yes | Yes | Yes | Yes | Core |
| matplotlib | Yes | Yes | Yes | No | Core |
| seaborn | No | Yes | Yes | No | Core |
| scipy | No | Yes | Yes | No | Standard |
| scikit-learn | No | Yes | Yes | No | Standard |
| statsmodels | No | Yes | No | No | Standard |
| nba_api | No | Yes | Yes | Yes | Standard |
| jupyter | Yes | Yes | Yes | Yes | Core |
| opencv-python | Yes | No | No | No | Optional |
| xgboost | No | No | Yes | No | Optional |
| tensorflow | No | No | Yes | No | Optional |
Part 2: Implementation
2.1 Creating the Base Environment
Step 1: Create the Project Directory
# Create the main project directory
mkdir bucks_analytics
cd bucks_analytics
# Create subdirectory structure
mkdir -p {data/{raw,processed,external},notebooks,src/{data,features,models,visualization},tests,output/{figures,reports},docs}
# Create placeholder files
touch src/__init__.py
touch src/data/__init__.py
touch src/features/__init__.py
touch src/models/__init__.py
touch src/visualization/__init__.py
Step 2: Create the Virtual Environment
# Create virtual environment
python -m venv venv
# Activate the environment (Windows)
venv\Scripts\activate
# Activate the environment (macOS/Linux)
source venv/bin/activate
# Upgrade pip
pip install --upgrade pip
2.2 Creating Tiered Requirements Files
requirements-core.txt (Essential for all team members)
# Core Data Science Stack
pandas>=2.0.0,<3.0.0
numpy>=1.24.0,<2.0.0
scipy>=1.11.0,<2.0.0
# Visualization
matplotlib>=3.7.0,<4.0.0
seaborn>=0.12.0,<1.0.0
# Jupyter Environment
jupyter>=1.0.0
jupyterlab>=4.0.0
notebook>=7.0.0
ipywidgets>=8.0.0
# NBA Data Access
nba_api>=1.2.0
# Data Handling
openpyxl>=3.1.0
pyarrow>=12.0.0
requests>=2.31.0
# Utilities
python-dotenv>=1.0.0
tqdm>=4.65.0
requirements-stats.txt (For statistical analysis)
# Include core requirements
-r requirements-core.txt
# Statistical Modeling
statsmodels>=0.14.0
scikit-learn>=1.3.0
pingouin>=0.5.0
# Advanced Statistics
lifelines>=0.27.0 # Survival analysis
requirements-ml.txt (For machine learning workflows)
# Include stats requirements
-r requirements-stats.txt
# Machine Learning
xgboost>=1.7.0
lightgbm>=4.0.0
# Model Evaluation
shap>=0.42.0
# Optional: Deep Learning (commented out by default)
# tensorflow>=2.13.0
# torch>=2.0.0
requirements-dev.txt (For development and testing)
# Include core requirements
-r requirements-core.txt
# Testing
pytest>=7.4.0
pytest-cov>=4.1.0
# Code Quality
black>=23.7.0
flake8>=6.1.0
isort>=5.12.0
mypy>=1.5.0
# Documentation
sphinx>=7.0.0
sphinx-rtd-theme>=1.3.0
# Pre-commit hooks
pre-commit>=3.3.0
2.3 Creating the Setup Script
setup_environment.py
#!/usr/bin/env python
"""
Milwaukee Bucks Analytics Environment Setup Script
This script automates the setup of the analytics environment,
including virtual environment creation and package installation.
Usage:
python setup_environment.py [--profile PROFILE]
Profiles:
core - Basic data science stack (default)
stats - Statistical analysis packages
ml - Machine learning packages
dev - Development and testing tools
all - Everything
"""
import subprocess
import sys
import os
import argparse
from pathlib import Path
def run_command(command, description):
"""Execute a shell command and handle errors."""
print(f"\n{'='*60}")
print(f" {description}")
print(f"{'='*60}")
try:
result = subprocess.run(
command,
shell=True,
check=True,
capture_output=True,
text=True
)
if result.stdout:
print(result.stdout)
return True
except subprocess.CalledProcessError as e:
print(f"Error: {e.stderr}")
return False
def check_python_version():
"""Verify Python version meets requirements."""
version = sys.version_info
if version.major < 3 or (version.major == 3 and version.minor < 10):
print(f"Error: Python 3.10+ required. Found {version.major}.{version.minor}")
return False
print(f"Python version: {version.major}.{version.minor}.{version.micro} (OK)")
return True
def create_directories():
"""Create the project directory structure."""
directories = [
'data/raw',
'data/processed',
'data/external',
'notebooks',
'src/data',
'src/features',
'src/models',
'src/visualization',
'tests',
'output/figures',
'output/reports',
'docs'
]
for dir_path in directories:
Path(dir_path).mkdir(parents=True, exist_ok=True)
init_file = Path(dir_path.split('/')[0]) / '__init__.py'
if 'src' in dir_path:
init_file = Path(dir_path) / '__init__.py'
init_file.touch(exist_ok=True)
print("Directory structure created successfully")
return True
def create_virtual_environment():
"""Create a new virtual environment."""
venv_path = Path('venv')
if venv_path.exists():
print("Virtual environment already exists")
return True
return run_command(
f"{sys.executable} -m venv venv",
"Creating virtual environment"
)
def get_pip_command():
"""Get the correct pip command for the current OS."""
if sys.platform == 'win32':
return r'venv\Scripts\pip'
return 'venv/bin/pip'
def install_requirements(profile):
"""Install requirements based on selected profile."""
pip = get_pip_command()
# Upgrade pip first
run_command(f"{pip} install --upgrade pip", "Upgrading pip")
# Map profiles to requirements files
profile_map = {
'core': ['requirements-core.txt'],
'stats': ['requirements-stats.txt'],
'ml': ['requirements-ml.txt'],
'dev': ['requirements-dev.txt'],
'all': ['requirements-ml.txt', 'requirements-dev.txt']
}
requirements_files = profile_map.get(profile, ['requirements-core.txt'])
for req_file in requirements_files:
if Path(req_file).exists():
success = run_command(
f"{pip} install -r {req_file}",
f"Installing packages from {req_file}"
)
if not success:
return False
return True
def create_gitignore():
"""Create a comprehensive .gitignore file."""
gitignore_content = """# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual Environment
venv/
.venv/
ENV/
# Jupyter Notebooks
.ipynb_checkpoints/
*.ipynb_checkpoints/
# IDE
.idea/
.vscode/
*.swp
*.swo
*.sublime-*
# OS
.DS_Store
Thumbs.db
*.bak
# Project specific
data/raw/*
data/external/*
!data/raw/.gitkeep
!data/external/.gitkeep
output/*
!output/.gitkeep
*.log
*.csv
*.xlsx
*.parquet
# Credentials
.env
.env.local
secrets.json
credentials/
# Model files (often too large)
*.pkl
*.joblib
*.h5
*.pt
*.pth
"""
with open('.gitignore', 'w') as f:
f.write(gitignore_content)
print(".gitignore created successfully")
return True
def create_readme():
"""Create a README.md file."""
readme_content = """# Milwaukee Bucks Analytics Environment
## Quick Start
1. **Clone the repository**
```bash
git clone <repository-url>
cd bucks_analytics
```
2. **Run the setup script**
```bash
python setup_environment.py --profile stats
```
3. **Activate the environment**
```bash
# Windows
venv\\Scripts\\activate
# macOS/Linux
source venv/bin/activate
```
4. **Start Jupyter**
```bash
jupyter lab
```
## Installation Profiles
- `core`: Basic data science stack
- `stats`: Statistical analysis packages
- `ml`: Machine learning packages
- `dev`: Development and testing tools
- `all`: Everything
## Project Structure
bucks_analytics/ ├── data/ │ ├── raw/ # Original data (not tracked) │ ├── processed/ # Cleaned data │ └── external/ # Third-party data ├── notebooks/ # Jupyter notebooks ├── src/ │ ├── data/ # Data loading utilities │ ├── features/ # Feature engineering │ ├── models/ # Model definitions │ └── visualization/ # Plotting functions ├── tests/ # Unit tests ├── output/ │ ├── figures/ # Generated plots │ └── reports/ # Analysis reports └── docs/ # Documentation
## Data Sources
- NBA API (via nba_api library)
- Internal databases (credentials required)
- Basketball-Reference (rate-limited scraping)
## Contributing
1. Create a feature branch
2. Make your changes
3. Run tests: `pytest tests/`
4. Submit a pull request
## Questions?
Contact the analytics team at analytics@bucks.com
"""
with open('README.md', 'w') as f:
f.write(readme_content)
print("README.md created successfully")
return True
def verify_installation():
"""Verify that key packages are installed correctly."""
print("\n" + "="*60)
print(" Verifying Installation")
print("="*60 + "\n")
packages = [
('pandas', 'pandas'),
('numpy', 'numpy'),
('matplotlib', 'matplotlib'),
('scikit-learn', 'sklearn'),
('nba_api', 'nba_api'),
('jupyter', 'jupyter'),
]
if sys.platform == 'win32':
python = r'venv\Scripts\python'
else:
python = 'venv/bin/python'
all_ok = True
for package_name, import_name in packages:
try:
result = subprocess.run(
f'{python} -c "import {import_name}; print({import_name}.__version__)"',
shell=True,
capture_output=True,
text=True,
check=True
)
version = result.stdout.strip()
print(f" [OK] {package_name}: {version}")
except subprocess.CalledProcessError:
print(f" [FAILED] {package_name}")
all_ok = False
return all_ok
def main():
"""Main entry point for environment setup."""
parser = argparse.ArgumentParser(
description='Set up the Bucks Analytics environment'
)
parser.add_argument(
'--profile',
choices=['core', 'stats', 'ml', 'dev', 'all'],
default='core',
help='Installation profile (default: core)'
)
args = parser.parse_args()
print("\n" + "="*60)
print(" Milwaukee Bucks Analytics Environment Setup")
print("="*60)
print(f"\nProfile: {args.profile}")
# Run setup steps
steps = [
(check_python_version, "Checking Python version"),
(create_directories, "Creating directory structure"),
(create_virtual_environment, "Creating virtual environment"),
(lambda: install_requirements(args.profile), "Installing packages"),
(create_gitignore, "Creating .gitignore"),
(create_readme, "Creating README.md"),
(verify_installation, "Verifying installation"),
]
for step_func, step_name in steps:
print(f"\n{'='*60}")
print(f" {step_name}")
print(f"{'='*60}")
if not step_func():
print(f"\nSetup failed at: {step_name}")
return 1
print("\n" + "="*60)
print(" Setup Complete!")
print("="*60)
print("\nNext steps:")
print(" 1. Activate the environment:")
if sys.platform == 'win32':
print(" venv\\Scripts\\activate")
else:
print(" source venv/bin/activate")
print(" 2. Start Jupyter Lab:")
print(" jupyter lab")
print(" 3. Open notebooks/01_getting_started.ipynb")
return 0
if __name__ == '__main__':
sys.exit(main())
Part 3: Documentation and Onboarding
3.1 Creating an Onboarding Notebook
Create notebooks/01_getting_started.ipynb with the following cells:
Cell 1 (Markdown):
# Welcome to Bucks Analytics!
This notebook verifies your environment setup and introduces our analytics workflow.
## Running This Notebook
1. Make sure you've run the setup script
2. Activated your virtual environment
3. Started Jupyter Lab
Let's verify everything is working correctly.
Cell 2 (Code):
# Verify imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nba_api.stats.static import players, teams
print("All imports successful!")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
Cell 3 (Code):
# Test NBA API connection
all_players = players.get_players()
bucks_roster = [p for p in all_players if p['is_active']]
print(f"Found {len(bucks_roster)} active NBA players")
print("\nSample players:")
for player in bucks_roster[:5]:
print(f" - {player['full_name']}")
Cell 4 (Code):
# Test visualization
fig, ax = plt.subplots(figsize=(10, 6))
# Sample data
positions = ['PG', 'SG', 'SF', 'PF', 'C']
avg_points = [18.5, 17.2, 15.8, 14.3, 12.1]
ax.bar(positions, avg_points, color='#00471B') # Bucks green
ax.set_xlabel('Position')
ax.set_ylabel('Average Points')
ax.set_title('NBA Average Points by Position (Sample)')
plt.tight_layout()
plt.show()
print("Visualization working correctly!")
Part 4: Testing the Setup
4.1 Creating a Test Script
tests/test_environment.py
"""
Environment verification tests.
Run with: pytest tests/test_environment.py -v
"""
import pytest
class TestCorePackages:
"""Test that core packages are installed and functional."""
def test_pandas_import(self):
import pandas as pd
assert pd.__version__ >= '2.0.0'
def test_numpy_import(self):
import numpy as np
assert np.__version__ >= '1.24.0'
def test_matplotlib_import(self):
import matplotlib
assert matplotlib.__version__ >= '3.7.0'
def test_seaborn_import(self):
import seaborn as sns
assert sns.__version__ >= '0.12.0'
class TestNBAAPI:
"""Test NBA API connectivity."""
def test_nba_api_import(self):
from nba_api.stats.static import players
all_players = players.get_players()
assert len(all_players) > 0
def test_find_player(self):
from nba_api.stats.static import players
giannis = players.find_players_by_full_name("Giannis Antetokounmpo")
assert len(giannis) == 1
assert giannis[0]['id'] == 203507
class TestProjectStructure:
"""Test that project directories exist."""
def test_data_directories(self):
from pathlib import Path
assert Path('data/raw').exists()
assert Path('data/processed').exists()
assert Path('data/external').exists()
def test_src_directories(self):
from pathlib import Path
assert Path('src/data').exists()
assert Path('src/models').exists()
assert Path('src/visualization').exists()
def test_output_directories(self):
from pathlib import Path
assert Path('output/figures').exists()
assert Path('output/reports').exists()
if __name__ == '__main__':
pytest.main([__file__, '-v'])
Part 5: Discussion Questions
Question 1: Version Pinning
Why do the requirements files use version ranges (e.g., >=2.0.0,<3.0.0) instead of exact versions (==2.0.3)? What are the tradeoffs?
Question 2: Profile System
The setup uses different requirement profiles (core, stats, ml, dev). What advantages does this provide over a single requirements.txt file?
Question 3: Virtual Environments
Why is it important that each team member uses a virtual environment rather than installing packages globally?
Question 4: Reproducibility
What additional steps could be taken to ensure that an analysis run today can be exactly reproduced in five years?
Question 5: Security
The .gitignore file excludes credentials and .env files. What other security considerations should an analytics team address?
Deliverables
By completing this case study, you should produce:
- Setup Script: Functional
setup_environment.py - Requirements Files: Tiered requirements files for different use cases
- Project Structure: Complete directory structure with placeholder files
- Documentation: README.md and onboarding notebook
- Tests: Environment verification test suite
Key Takeaways
- Standardization reduces friction - A consistent setup process helps new team members become productive quickly
- Tiered requirements support diverse workflows without bloating everyone's environment
- Virtual environments isolate dependencies and ensure reproducibility
- Documentation is crucial - Good README and onboarding materials save hours of confusion
- Automated setup reduces human error and ensures consistency
This case study demonstrates how proper environment management enables effective team collaboration in basketball analytics.