Case Study 2: Debugging Environment Issues in a Production Analytics Pipeline
Overview
Scenario: The Atlanta Hawks analytics team has been experiencing intermittent failures in their nightly data pipeline. Players' shooting statistics are sometimes calculated incorrectly, and the team suspects environment-related issues. You've been brought in to diagnose and fix the problems.
Duration: 2-3 hours Difficulty: Intermediate to Advanced Prerequisites: Chapter 3 concepts, basic debugging experience
Background
The Hawks' analytics pipeline runs nightly to: 1. Pull game data from the NBA API 2. Calculate advanced shooting metrics 3. Generate reports for coaching staff 4. Update the team's internal dashboard
Over the past month, the pipeline has failed 8 times with different errors. The team has saved error logs but hasn't been able to identify the root causes.
Part 1: The Investigation
1.1 Error Log Analysis
Error Log 1: November 15th
Traceback (most recent call last):
File "pipeline.py", line 45, in calculate_metrics
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
Error Log 2: November 18th
Traceback (most recent call last):
File "pipeline.py", line 89, in fetch_data
response = nba_api.stats.endpoints.playergamelog.PlayerGameLog(...)
AttributeError: module 'nba_api.stats.endpoints' has no attribute 'playergamelog'
Error Log 3: November 22nd
Traceback (most recent call last):
File "pipeline.py", line 156, in generate_report
fig, ax = plt.subplots()
RuntimeError: Python is not installed as a framework. The Mac OS X backend...
Error Log 4: November 25th
Traceback (most recent call last):
File "pipeline.py", line 201, in save_data
df.to_parquet('output.parquet')
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support.
Error Log 5: December 1st
ValueError: numpy.ndarray size changed, may indicate binary incompatibility.
Expected 88 from C header, got 80 from PyObject
1.2 Current Environment Information
The team provided their requirements.txt:
pandas
numpy
matplotlib
nba_api
scipy
sklearn
And their deployment script:
#!/bin/bash
pip install -r requirements.txt
python pipeline.py
Part 2: Diagnosis
2.1 Problem Identification Framework
"""
Environment Diagnostic Tool for Basketball Analytics Pipelines
This script helps identify common environment issues that cause
pipeline failures.
"""
import sys
import subprocess
import importlib
from pathlib import Path
from typing import Dict, List, Tuple, Optional
import json
class EnvironmentDiagnostic:
"""
Diagnostic tool for Python environment issues.
Checks for common problems including version mismatches,
missing dependencies, and configuration issues.
"""
def __init__(self):
self.issues_found = []
self.warnings = []
self.checks_passed = []
def check_python_version(self) -> bool:
"""Verify Python version is appropriate."""
version = sys.version_info
if version.major < 3:
self.issues_found.append({
'type': 'CRITICAL',
'component': 'Python Version',
'message': f'Python 2.x detected ({version.major}.{version.minor}). Python 3.10+ required.',
'fix': 'Install Python 3.10 or higher'
})
return False
if version.major == 3 and version.minor < 10:
self.warnings.append({
'type': 'WARNING',
'component': 'Python Version',
'message': f'Python {version.major}.{version.minor} detected. Python 3.10+ recommended.',
'fix': 'Consider upgrading to Python 3.10+'
})
self.checks_passed.append('Python version check')
return True
def check_virtual_environment(self) -> bool:
"""Verify running in a virtual environment."""
in_venv = (
hasattr(sys, 'real_prefix') or
(hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix)
)
if not in_venv:
self.warnings.append({
'type': 'WARNING',
'component': 'Virtual Environment',
'message': 'Not running in a virtual environment',
'fix': 'Create and activate a virtual environment: python -m venv venv'
})
return False
self.checks_passed.append('Virtual environment check')
return True
def check_package_versions(self, requirements: Dict[str, str]) -> bool:
"""
Verify installed package versions match requirements.
Args:
requirements: Dict mapping package names to required versions
"""
all_ok = True
for package, required_version in requirements.items():
try:
module = importlib.import_module(package.replace('-', '_'))
installed_version = getattr(module, '__version__', 'unknown')
if required_version and installed_version != required_version:
if not self._version_compatible(installed_version, required_version):
self.issues_found.append({
'type': 'ERROR',
'component': f'Package: {package}',
'message': f'Version mismatch. Installed: {installed_version}, Required: {required_version}',
'fix': f'pip install {package}=={required_version}'
})
all_ok = False
except ImportError:
self.issues_found.append({
'type': 'ERROR',
'component': f'Package: {package}',
'message': f'Package not installed',
'fix': f'pip install {package}'
})
all_ok = False
if all_ok:
self.checks_passed.append('Package version check')
return all_ok
def _version_compatible(self, installed: str, required: str) -> bool:
"""Check if installed version is compatible with required."""
# Simple comparison - in production, use packaging.version
try:
installed_parts = [int(x) for x in installed.split('.')[:2]]
required_parts = [int(x) for x in required.split('.')[:2]]
return installed_parts >= required_parts
except (ValueError, AttributeError):
return False
def check_binary_compatibility(self) -> bool:
"""Check for numpy/scipy binary compatibility issues."""
try:
import numpy as np
import pandas as pd
# Try operations that would fail with binary incompatibility
arr = np.array([1, 2, 3])
df = pd.DataFrame({'a': arr})
_ = df.values
self.checks_passed.append('Binary compatibility check')
return True
except ValueError as e:
if 'binary incompatibility' in str(e).lower():
self.issues_found.append({
'type': 'CRITICAL',
'component': 'Binary Compatibility',
'message': 'NumPy/Pandas binary incompatibility detected',
'fix': 'Reinstall numpy and pandas: pip uninstall numpy pandas -y && pip install numpy pandas'
})
return False
raise
def check_matplotlib_backend(self) -> bool:
"""Check matplotlib backend configuration."""
try:
import matplotlib
backend = matplotlib.get_backend()
# Check for headless environment issues
if 'agg' not in backend.lower() and not self._has_display():
self.warnings.append({
'type': 'WARNING',
'component': 'Matplotlib Backend',
'message': f'Backend "{backend}" may not work in headless environment',
'fix': 'Set MPLBACKEND=Agg or add matplotlib.use("Agg") before importing pyplot'
})
return False
self.checks_passed.append('Matplotlib backend check')
return True
except Exception as e:
self.issues_found.append({
'type': 'ERROR',
'component': 'Matplotlib',
'message': str(e),
'fix': 'Check matplotlib installation and backend configuration'
})
return False
def _has_display(self) -> bool:
"""Check if a display is available."""
import os
return 'DISPLAY' in os.environ or sys.platform == 'win32'
def check_parquet_support(self) -> bool:
"""Verify parquet file support is available."""
try:
import pandas as pd
import io
# Try to use parquet
df = pd.DataFrame({'a': [1, 2, 3]})
buffer = io.BytesIO()
df.to_parquet(buffer)
self.checks_passed.append('Parquet support check')
return True
except ImportError as e:
self.issues_found.append({
'type': 'ERROR',
'component': 'Parquet Support',
'message': 'pyarrow or fastparquet not installed',
'fix': 'pip install pyarrow'
})
return False
def check_nba_api(self) -> bool:
"""Verify nba_api is correctly installed and accessible."""
try:
from nba_api.stats.endpoints import playergamelog
from nba_api.stats.static import players
# Verify static data is accessible
_ = players.get_players()
self.checks_passed.append('NBA API check')
return True
except ImportError as e:
self.issues_found.append({
'type': 'ERROR',
'component': 'NBA API',
'message': f'Import error: {e}',
'fix': 'pip install nba_api --upgrade'
})
return False
except Exception as e:
self.warnings.append({
'type': 'WARNING',
'component': 'NBA API',
'message': f'API accessible but error occurred: {e}',
'fix': 'Check network connectivity and API availability'
})
return True
def run_all_checks(self) -> Dict:
"""Run all diagnostic checks and return results."""
print("Running environment diagnostics...")
print("=" * 60)
# Core requirements for basketball analytics
requirements = {
'pandas': '2.0.0',
'numpy': '1.24.0',
'matplotlib': '3.7.0',
'scipy': '1.11.0',
}
checks = [
('Python Version', self.check_python_version),
('Virtual Environment', self.check_virtual_environment),
('Package Versions', lambda: self.check_package_versions(requirements)),
('Binary Compatibility', self.check_binary_compatibility),
('Matplotlib Backend', self.check_matplotlib_backend),
('Parquet Support', self.check_parquet_support),
('NBA API', self.check_nba_api),
]
for check_name, check_func in checks:
try:
print(f"\nChecking {check_name}...", end=" ")
result = check_func()
print("PASS" if result else "ISSUE DETECTED")
except Exception as e:
print(f"ERROR: {e}")
self.issues_found.append({
'type': 'ERROR',
'component': check_name,
'message': str(e),
'fix': 'Review error and check documentation'
})
return self.generate_report()
def generate_report(self) -> Dict:
"""Generate a diagnostic report."""
report = {
'summary': {
'checks_passed': len(self.checks_passed),
'warnings': len(self.warnings),
'errors': len(self.issues_found)
},
'passed': self.checks_passed,
'warnings': self.warnings,
'issues': self.issues_found
}
print("\n" + "=" * 60)
print("DIAGNOSTIC REPORT")
print("=" * 60)
print(f"\nPassed: {report['summary']['checks_passed']}")
print(f"Warnings: {report['summary']['warnings']}")
print(f"Errors: {report['summary']['errors']}")
if self.issues_found:
print("\n--- ISSUES REQUIRING ACTION ---")
for issue in self.issues_found:
print(f"\n[{issue['type']}] {issue['component']}")
print(f" Problem: {issue['message']}")
print(f" Fix: {issue['fix']}")
if self.warnings:
print("\n--- WARNINGS ---")
for warning in self.warnings:
print(f"\n[{warning['type']}] {warning['component']}")
print(f" Message: {warning['message']}")
print(f" Suggestion: {warning['fix']}")
return report
def main():
"""Run environment diagnostics."""
diagnostic = EnvironmentDiagnostic()
report = diagnostic.run_all_checks()
# Save report to file
with open('diagnostic_report.json', 'w') as f:
json.dump(report, f, indent=2)
print(f"\nFull report saved to diagnostic_report.json")
# Return appropriate exit code
return 1 if report['summary']['errors'] > 0 else 0
if __name__ == '__main__':
sys.exit(main())
Part 3: Root Cause Analysis
3.1 Error-by-Error Analysis
Error 1: ModuleNotFoundError for pandas
Root Cause: The deployment script runs pip install in the global environment, but the pipeline runs in a different environment (likely a cron job with a different PATH).
Fix:
#!/bin/bash
# Activate virtual environment before running
source /path/to/project/venv/bin/activate
pip install -r requirements.txt
python pipeline.py
Error 2: AttributeError for nba_api
Root Cause: The requirements.txt doesn't pin versions. An nba_api update changed the module structure.
Fix: Pin the nba_api version:
nba_api==1.2.1
Error 3: Matplotlib backend error
Root Cause: The pipeline runs on a headless server (no display), but matplotlib defaults to a GUI backend on macOS.
Fix: Add to the pipeline script before any matplotlib imports:
import matplotlib
matplotlib.use('Agg') # Use non-GUI backend
import matplotlib.pyplot as plt
Or set environment variable:
export MPLBACKEND=Agg
Error 4: Parquet engine not found
Root Cause: pyarrow was not listed as a dependency.
Fix: Add to requirements.txt:
pyarrow>=12.0.0
Error 5: NumPy binary incompatibility
Root Cause: numpy was upgraded but pandas was compiled against the old numpy version.
Fix: Reinstall both packages:
pip uninstall numpy pandas -y
pip install numpy pandas
3.2 Corrected Requirements File
# requirements.txt - Pinned versions for Hawks Analytics Pipeline
# Last updated: December 2024
# Tested on: Python 3.11.5
# Core Data Science
pandas==2.0.3
numpy==1.24.3
scipy==1.11.2
# Visualization
matplotlib==3.7.2
seaborn==0.12.2
# NBA Data
nba_api==1.2.1
# Machine Learning
scikit-learn==1.3.0
# File Formats
pyarrow==12.0.1
openpyxl==3.1.2
# Utilities
requests==2.31.0
python-dotenv==1.0.0
Part 4: Implementing Robust Solutions
4.1 Environment Validation Script
validate_environment.py
#!/usr/bin/env python
"""
Pre-flight environment validation for Hawks Analytics Pipeline.
This script should run before the main pipeline to catch
environment issues early.
Exit codes:
0 - All checks passed
1 - Critical error, pipeline should not run
2 - Warnings present, pipeline may run with caution
"""
import sys
import importlib
from typing import List, Tuple
def validate_python_version() -> Tuple[bool, str]:
"""Ensure Python version is correct."""
required = (3, 10)
current = sys.version_info[:2]
if current < required:
return False, f"Python {required[0]}.{required[1]}+ required, got {current[0]}.{current[1]}"
return True, f"Python {current[0]}.{current[1]}"
def validate_package(name: str, min_version: str) -> Tuple[bool, str]:
"""Validate a package is installed with correct version."""
try:
module = importlib.import_module(name.replace('-', '_'))
version = getattr(module, '__version__', 'unknown')
# Simple version comparison
installed = [int(x) for x in version.split('.')[:2]]
required = [int(x) for x in min_version.split('.')[:2]]
if installed < required:
return False, f"{name} {version} installed, {min_version}+ required"
return True, f"{name} {version}"
except ImportError:
return False, f"{name} not installed"
def validate_matplotlib_backend() -> Tuple[bool, str]:
"""Ensure matplotlib can work in headless mode."""
try:
import matplotlib
backend = matplotlib.get_backend().lower()
if 'agg' in backend or 'pdf' in backend or 'svg' in backend:
return True, f"Backend: {backend} (headless-compatible)"
else:
return False, f"Backend: {backend} (may require display)"
except Exception as e:
return False, str(e)
def validate_parquet() -> Tuple[bool, str]:
"""Ensure parquet support is available."""
try:
import pandas as pd
import io
df = pd.DataFrame({'test': [1, 2, 3]})
buffer = io.BytesIO()
df.to_parquet(buffer)
return True, "Parquet support available"
except ImportError:
return False, "pyarrow or fastparquet not installed"
def main() -> int:
"""Run all validations and report results."""
print("Hawks Analytics Pipeline - Environment Validation")
print("=" * 50)
validations = [
("Python Version", validate_python_version()),
("pandas", validate_package('pandas', '2.0.0')),
("numpy", validate_package('numpy', '1.24.0')),
("matplotlib", validate_package('matplotlib', '3.7.0')),
("nba_api", validate_package('nba_api', '1.2.0')),
("scikit-learn", validate_package('sklearn', '1.3.0')),
("Matplotlib Backend", validate_matplotlib_backend()),
("Parquet Support", validate_parquet()),
]
passed = 0
failed = 0
for name, (success, message) in validations:
status = "PASS" if success else "FAIL"
print(f" [{status}] {name}: {message}")
if success:
passed += 1
else:
failed += 1
print("=" * 50)
print(f"Results: {passed} passed, {failed} failed")
if failed > 0:
print("\nEnvironment validation FAILED. Please fix issues before running pipeline.")
return 1
print("\nEnvironment validation PASSED. Pipeline ready to run.")
return 0
if __name__ == '__main__':
sys.exit(main())
4.2 Improved Deployment Script
deploy.sh
#!/bin/bash
#
# Hawks Analytics Pipeline Deployment Script
#
# This script ensures the pipeline runs in a consistent environment
# with proper error handling and logging.
#
set -e # Exit on error
set -o pipefail # Catch errors in pipes
# Configuration
PROJECT_DIR="/opt/hawks_analytics"
VENV_DIR="$PROJECT_DIR/venv"
LOG_DIR="$PROJECT_DIR/logs"
LOG_FILE="$LOG_DIR/pipeline_$(date +%Y%m%d_%H%M%S).log"
# Ensure log directory exists
mkdir -p "$LOG_DIR"
# Function to log messages
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Function to handle errors
handle_error() {
log "ERROR: Pipeline failed at line $1"
log "Check log file: $LOG_FILE"
exit 1
}
trap 'handle_error $LINENO' ERR
log "Starting Hawks Analytics Pipeline"
log "=================================="
# Change to project directory
cd "$PROJECT_DIR"
log "Working directory: $(pwd)"
# Activate virtual environment
log "Activating virtual environment..."
source "$VENV_DIR/bin/activate"
# Verify Python
log "Python: $(which python)"
log "Version: $(python --version)"
# Set matplotlib backend for headless operation
export MPLBACKEND=Agg
# Run environment validation
log "Running environment validation..."
python validate_environment.py 2>&1 | tee -a "$LOG_FILE"
if [ ${PIPESTATUS[0]} -ne 0 ]; then
log "Environment validation failed. Aborting."
exit 1
fi
# Run the main pipeline
log "Starting main pipeline..."
python pipeline.py 2>&1 | tee -a "$LOG_FILE"
# Check exit status
if [ ${PIPESTATUS[0]} -eq 0 ]; then
log "Pipeline completed successfully"
else
log "Pipeline failed"
exit 1
fi
log "=================================="
log "Pipeline execution complete"
Part 5: Prevention Strategies
5.1 Continuous Integration Checks
.github/workflows/validate.yml
name: Environment Validation
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
schedule:
- cron: '0 6 * * *' # Daily at 6 AM
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run environment validation
run: python validate_environment.py
- name: Run tests
run: pytest tests/ -v
- name: Test pipeline dry run
run: python pipeline.py --dry-run
5.2 Dependency Lock File
requirements.lock (generated with pip freeze)
# This file contains exact versions for reproducible builds
# Generated: 2024-12-01
# Python: 3.11.5
# Platform: linux-x86_64
certifi==2023.7.22
charset-normalizer==3.2.0
contourpy==1.1.0
cycler==0.11.0
fonttools==4.42.1
idna==3.4
joblib==1.3.2
kiwisolver==1.4.5
matplotlib==3.7.2
nba-api==1.2.1
numpy==1.24.3
packaging==23.1
pandas==2.0.3
pillow==10.0.0
pyarrow==12.0.1
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2023.3
requests==2.31.0
scikit-learn==1.3.0
scipy==1.11.2
seaborn==0.12.2
six==1.16.0
threadpoolctl==3.2.0
tzdata==2023.3
urllib3==2.0.4
Discussion Questions
Question 1: Version Pinning Tradeoffs
The original requirements.txt had no version pins. The new one has exact pins. What are the tradeoffs? When might you prefer one approach over the other?
Question 2: Production vs Development
Should development environments match production exactly? What problems might arise if they differ?
Question 3: Dependency Management Tools
Tools like Poetry, Pipenv, and pip-tools offer more sophisticated dependency management. What advantages might they provide over plain requirements.txt?
Question 4: Container Solutions
How might containerization (Docker) have prevented some of these issues? What are the tradeoffs of containerization?
Question 5: Monitoring
What monitoring or alerting should be in place to catch environment issues before they cause pipeline failures?
Deliverables
- Diagnostic Tool: Complete environment diagnostic script
- Fixed Requirements: Properly pinned requirements.txt
- Validation Script: Pre-flight environment validation
- Deployment Script: Robust deployment with error handling
- CI Configuration: GitHub Actions workflow for validation
Key Takeaways
- Pin your dependencies - Unpinned versions lead to reproducibility failures
- Validate before running - Catch environment issues before they cause pipeline failures
- Use virtual environments consistently - Isolation prevents dependency conflicts
- Handle headless environments - Configure backends for server deployment
- Log everything - Good logging makes debugging much easier
- Automate validation - CI/CD catches issues before they reach production
This case study demonstrates that environment management is not just a setup task but an ongoing operational concern that requires systematic approaches.