Case Study 2: Debugging Environment Issues in a Production Analytics Pipeline

Overview

Scenario: The Atlanta Hawks analytics team has been experiencing intermittent failures in their nightly data pipeline. Players' shooting statistics are sometimes calculated incorrectly, and the team suspects environment-related issues. You've been brought in to diagnose and fix the problems.

Duration: 2-3 hours Difficulty: Intermediate to Advanced Prerequisites: Chapter 3 concepts, basic debugging experience


Background

The Hawks' analytics pipeline runs nightly to: 1. Pull game data from the NBA API 2. Calculate advanced shooting metrics 3. Generate reports for coaching staff 4. Update the team's internal dashboard

Over the past month, the pipeline has failed 8 times with different errors. The team has saved error logs but hasn't been able to identify the root causes.


Part 1: The Investigation

1.1 Error Log Analysis

Error Log 1: November 15th

Traceback (most recent call last):
  File "pipeline.py", line 45, in calculate_metrics
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Error Log 2: November 18th

Traceback (most recent call last):
  File "pipeline.py", line 89, in fetch_data
    response = nba_api.stats.endpoints.playergamelog.PlayerGameLog(...)
AttributeError: module 'nba_api.stats.endpoints' has no attribute 'playergamelog'

Error Log 3: November 22nd

Traceback (most recent call last):
  File "pipeline.py", line 156, in generate_report
    fig, ax = plt.subplots()
RuntimeError: Python is not installed as a framework. The Mac OS X backend...

Error Log 4: November 25th

Traceback (most recent call last):
  File "pipeline.py", line 201, in save_data
    df.to_parquet('output.parquet')
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support.

Error Log 5: December 1st

ValueError: numpy.ndarray size changed, may indicate binary incompatibility.
Expected 88 from C header, got 80 from PyObject

1.2 Current Environment Information

The team provided their requirements.txt:

pandas
numpy
matplotlib
nba_api
scipy
sklearn

And their deployment script:

#!/bin/bash
pip install -r requirements.txt
python pipeline.py

Part 2: Diagnosis

2.1 Problem Identification Framework

"""
Environment Diagnostic Tool for Basketball Analytics Pipelines

This script helps identify common environment issues that cause
pipeline failures.
"""

import sys
import subprocess
import importlib
from pathlib import Path
from typing import Dict, List, Tuple, Optional
import json


class EnvironmentDiagnostic:
    """
    Diagnostic tool for Python environment issues.

    Checks for common problems including version mismatches,
    missing dependencies, and configuration issues.
    """

    def __init__(self):
        self.issues_found = []
        self.warnings = []
        self.checks_passed = []

    def check_python_version(self) -> bool:
        """Verify Python version is appropriate."""
        version = sys.version_info

        if version.major < 3:
            self.issues_found.append({
                'type': 'CRITICAL',
                'component': 'Python Version',
                'message': f'Python 2.x detected ({version.major}.{version.minor}). Python 3.10+ required.',
                'fix': 'Install Python 3.10 or higher'
            })
            return False

        if version.major == 3 and version.minor < 10:
            self.warnings.append({
                'type': 'WARNING',
                'component': 'Python Version',
                'message': f'Python {version.major}.{version.minor} detected. Python 3.10+ recommended.',
                'fix': 'Consider upgrading to Python 3.10+'
            })

        self.checks_passed.append('Python version check')
        return True

    def check_virtual_environment(self) -> bool:
        """Verify running in a virtual environment."""
        in_venv = (
            hasattr(sys, 'real_prefix') or
            (hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix)
        )

        if not in_venv:
            self.warnings.append({
                'type': 'WARNING',
                'component': 'Virtual Environment',
                'message': 'Not running in a virtual environment',
                'fix': 'Create and activate a virtual environment: python -m venv venv'
            })
            return False

        self.checks_passed.append('Virtual environment check')
        return True

    def check_package_versions(self, requirements: Dict[str, str]) -> bool:
        """
        Verify installed package versions match requirements.

        Args:
            requirements: Dict mapping package names to required versions
        """
        all_ok = True

        for package, required_version in requirements.items():
            try:
                module = importlib.import_module(package.replace('-', '_'))
                installed_version = getattr(module, '__version__', 'unknown')

                if required_version and installed_version != required_version:
                    if not self._version_compatible(installed_version, required_version):
                        self.issues_found.append({
                            'type': 'ERROR',
                            'component': f'Package: {package}',
                            'message': f'Version mismatch. Installed: {installed_version}, Required: {required_version}',
                            'fix': f'pip install {package}=={required_version}'
                        })
                        all_ok = False

            except ImportError:
                self.issues_found.append({
                    'type': 'ERROR',
                    'component': f'Package: {package}',
                    'message': f'Package not installed',
                    'fix': f'pip install {package}'
                })
                all_ok = False

        if all_ok:
            self.checks_passed.append('Package version check')

        return all_ok

    def _version_compatible(self, installed: str, required: str) -> bool:
        """Check if installed version is compatible with required."""
        # Simple comparison - in production, use packaging.version
        try:
            installed_parts = [int(x) for x in installed.split('.')[:2]]
            required_parts = [int(x) for x in required.split('.')[:2]]
            return installed_parts >= required_parts
        except (ValueError, AttributeError):
            return False

    def check_binary_compatibility(self) -> bool:
        """Check for numpy/scipy binary compatibility issues."""
        try:
            import numpy as np
            import pandas as pd

            # Try operations that would fail with binary incompatibility
            arr = np.array([1, 2, 3])
            df = pd.DataFrame({'a': arr})
            _ = df.values

            self.checks_passed.append('Binary compatibility check')
            return True

        except ValueError as e:
            if 'binary incompatibility' in str(e).lower():
                self.issues_found.append({
                    'type': 'CRITICAL',
                    'component': 'Binary Compatibility',
                    'message': 'NumPy/Pandas binary incompatibility detected',
                    'fix': 'Reinstall numpy and pandas: pip uninstall numpy pandas -y && pip install numpy pandas'
                })
                return False
            raise

    def check_matplotlib_backend(self) -> bool:
        """Check matplotlib backend configuration."""
        try:
            import matplotlib
            backend = matplotlib.get_backend()

            # Check for headless environment issues
            if 'agg' not in backend.lower() and not self._has_display():
                self.warnings.append({
                    'type': 'WARNING',
                    'component': 'Matplotlib Backend',
                    'message': f'Backend "{backend}" may not work in headless environment',
                    'fix': 'Set MPLBACKEND=Agg or add matplotlib.use("Agg") before importing pyplot'
                })
                return False

            self.checks_passed.append('Matplotlib backend check')
            return True

        except Exception as e:
            self.issues_found.append({
                'type': 'ERROR',
                'component': 'Matplotlib',
                'message': str(e),
                'fix': 'Check matplotlib installation and backend configuration'
            })
            return False

    def _has_display(self) -> bool:
        """Check if a display is available."""
        import os
        return 'DISPLAY' in os.environ or sys.platform == 'win32'

    def check_parquet_support(self) -> bool:
        """Verify parquet file support is available."""
        try:
            import pandas as pd
            import io

            # Try to use parquet
            df = pd.DataFrame({'a': [1, 2, 3]})
            buffer = io.BytesIO()
            df.to_parquet(buffer)

            self.checks_passed.append('Parquet support check')
            return True

        except ImportError as e:
            self.issues_found.append({
                'type': 'ERROR',
                'component': 'Parquet Support',
                'message': 'pyarrow or fastparquet not installed',
                'fix': 'pip install pyarrow'
            })
            return False

    def check_nba_api(self) -> bool:
        """Verify nba_api is correctly installed and accessible."""
        try:
            from nba_api.stats.endpoints import playergamelog
            from nba_api.stats.static import players

            # Verify static data is accessible
            _ = players.get_players()

            self.checks_passed.append('NBA API check')
            return True

        except ImportError as e:
            self.issues_found.append({
                'type': 'ERROR',
                'component': 'NBA API',
                'message': f'Import error: {e}',
                'fix': 'pip install nba_api --upgrade'
            })
            return False
        except Exception as e:
            self.warnings.append({
                'type': 'WARNING',
                'component': 'NBA API',
                'message': f'API accessible but error occurred: {e}',
                'fix': 'Check network connectivity and API availability'
            })
            return True

    def run_all_checks(self) -> Dict:
        """Run all diagnostic checks and return results."""
        print("Running environment diagnostics...")
        print("=" * 60)

        # Core requirements for basketball analytics
        requirements = {
            'pandas': '2.0.0',
            'numpy': '1.24.0',
            'matplotlib': '3.7.0',
            'scipy': '1.11.0',
        }

        checks = [
            ('Python Version', self.check_python_version),
            ('Virtual Environment', self.check_virtual_environment),
            ('Package Versions', lambda: self.check_package_versions(requirements)),
            ('Binary Compatibility', self.check_binary_compatibility),
            ('Matplotlib Backend', self.check_matplotlib_backend),
            ('Parquet Support', self.check_parquet_support),
            ('NBA API', self.check_nba_api),
        ]

        for check_name, check_func in checks:
            try:
                print(f"\nChecking {check_name}...", end=" ")
                result = check_func()
                print("PASS" if result else "ISSUE DETECTED")
            except Exception as e:
                print(f"ERROR: {e}")
                self.issues_found.append({
                    'type': 'ERROR',
                    'component': check_name,
                    'message': str(e),
                    'fix': 'Review error and check documentation'
                })

        return self.generate_report()

    def generate_report(self) -> Dict:
        """Generate a diagnostic report."""
        report = {
            'summary': {
                'checks_passed': len(self.checks_passed),
                'warnings': len(self.warnings),
                'errors': len(self.issues_found)
            },
            'passed': self.checks_passed,
            'warnings': self.warnings,
            'issues': self.issues_found
        }

        print("\n" + "=" * 60)
        print("DIAGNOSTIC REPORT")
        print("=" * 60)

        print(f"\nPassed: {report['summary']['checks_passed']}")
        print(f"Warnings: {report['summary']['warnings']}")
        print(f"Errors: {report['summary']['errors']}")

        if self.issues_found:
            print("\n--- ISSUES REQUIRING ACTION ---")
            for issue in self.issues_found:
                print(f"\n[{issue['type']}] {issue['component']}")
                print(f"  Problem: {issue['message']}")
                print(f"  Fix: {issue['fix']}")

        if self.warnings:
            print("\n--- WARNINGS ---")
            for warning in self.warnings:
                print(f"\n[{warning['type']}] {warning['component']}")
                print(f"  Message: {warning['message']}")
                print(f"  Suggestion: {warning['fix']}")

        return report


def main():
    """Run environment diagnostics."""
    diagnostic = EnvironmentDiagnostic()
    report = diagnostic.run_all_checks()

    # Save report to file
    with open('diagnostic_report.json', 'w') as f:
        json.dump(report, f, indent=2)

    print(f"\nFull report saved to diagnostic_report.json")

    # Return appropriate exit code
    return 1 if report['summary']['errors'] > 0 else 0


if __name__ == '__main__':
    sys.exit(main())

Part 3: Root Cause Analysis

3.1 Error-by-Error Analysis

Error 1: ModuleNotFoundError for pandas

Root Cause: The deployment script runs pip install in the global environment, but the pipeline runs in a different environment (likely a cron job with a different PATH).

Fix:

#!/bin/bash
# Activate virtual environment before running
source /path/to/project/venv/bin/activate
pip install -r requirements.txt
python pipeline.py

Error 2: AttributeError for nba_api

Root Cause: The requirements.txt doesn't pin versions. An nba_api update changed the module structure.

Fix: Pin the nba_api version:

nba_api==1.2.1

Error 3: Matplotlib backend error

Root Cause: The pipeline runs on a headless server (no display), but matplotlib defaults to a GUI backend on macOS.

Fix: Add to the pipeline script before any matplotlib imports:

import matplotlib
matplotlib.use('Agg')  # Use non-GUI backend
import matplotlib.pyplot as plt

Or set environment variable:

export MPLBACKEND=Agg

Error 4: Parquet engine not found

Root Cause: pyarrow was not listed as a dependency.

Fix: Add to requirements.txt:

pyarrow>=12.0.0

Error 5: NumPy binary incompatibility

Root Cause: numpy was upgraded but pandas was compiled against the old numpy version.

Fix: Reinstall both packages:

pip uninstall numpy pandas -y
pip install numpy pandas

3.2 Corrected Requirements File

# requirements.txt - Pinned versions for Hawks Analytics Pipeline
# Last updated: December 2024
# Tested on: Python 3.11.5

# Core Data Science
pandas==2.0.3
numpy==1.24.3
scipy==1.11.2

# Visualization
matplotlib==3.7.2
seaborn==0.12.2

# NBA Data
nba_api==1.2.1

# Machine Learning
scikit-learn==1.3.0

# File Formats
pyarrow==12.0.1
openpyxl==3.1.2

# Utilities
requests==2.31.0
python-dotenv==1.0.0

Part 4: Implementing Robust Solutions

4.1 Environment Validation Script

validate_environment.py

#!/usr/bin/env python
"""
Pre-flight environment validation for Hawks Analytics Pipeline.

This script should run before the main pipeline to catch
environment issues early.

Exit codes:
    0 - All checks passed
    1 - Critical error, pipeline should not run
    2 - Warnings present, pipeline may run with caution
"""

import sys
import importlib
from typing import List, Tuple


def validate_python_version() -> Tuple[bool, str]:
    """Ensure Python version is correct."""
    required = (3, 10)
    current = sys.version_info[:2]

    if current < required:
        return False, f"Python {required[0]}.{required[1]}+ required, got {current[0]}.{current[1]}"
    return True, f"Python {current[0]}.{current[1]}"


def validate_package(name: str, min_version: str) -> Tuple[bool, str]:
    """Validate a package is installed with correct version."""
    try:
        module = importlib.import_module(name.replace('-', '_'))
        version = getattr(module, '__version__', 'unknown')

        # Simple version comparison
        installed = [int(x) for x in version.split('.')[:2]]
        required = [int(x) for x in min_version.split('.')[:2]]

        if installed < required:
            return False, f"{name} {version} installed, {min_version}+ required"
        return True, f"{name} {version}"

    except ImportError:
        return False, f"{name} not installed"


def validate_matplotlib_backend() -> Tuple[bool, str]:
    """Ensure matplotlib can work in headless mode."""
    try:
        import matplotlib
        backend = matplotlib.get_backend().lower()

        if 'agg' in backend or 'pdf' in backend or 'svg' in backend:
            return True, f"Backend: {backend} (headless-compatible)"
        else:
            return False, f"Backend: {backend} (may require display)"

    except Exception as e:
        return False, str(e)


def validate_parquet() -> Tuple[bool, str]:
    """Ensure parquet support is available."""
    try:
        import pandas as pd
        import io

        df = pd.DataFrame({'test': [1, 2, 3]})
        buffer = io.BytesIO()
        df.to_parquet(buffer)

        return True, "Parquet support available"

    except ImportError:
        return False, "pyarrow or fastparquet not installed"


def main() -> int:
    """Run all validations and report results."""
    print("Hawks Analytics Pipeline - Environment Validation")
    print("=" * 50)

    validations = [
        ("Python Version", validate_python_version()),
        ("pandas", validate_package('pandas', '2.0.0')),
        ("numpy", validate_package('numpy', '1.24.0')),
        ("matplotlib", validate_package('matplotlib', '3.7.0')),
        ("nba_api", validate_package('nba_api', '1.2.0')),
        ("scikit-learn", validate_package('sklearn', '1.3.0')),
        ("Matplotlib Backend", validate_matplotlib_backend()),
        ("Parquet Support", validate_parquet()),
    ]

    passed = 0
    failed = 0

    for name, (success, message) in validations:
        status = "PASS" if success else "FAIL"
        print(f"  [{status}] {name}: {message}")

        if success:
            passed += 1
        else:
            failed += 1

    print("=" * 50)
    print(f"Results: {passed} passed, {failed} failed")

    if failed > 0:
        print("\nEnvironment validation FAILED. Please fix issues before running pipeline.")
        return 1

    print("\nEnvironment validation PASSED. Pipeline ready to run.")
    return 0


if __name__ == '__main__':
    sys.exit(main())

4.2 Improved Deployment Script

deploy.sh

#!/bin/bash
#
# Hawks Analytics Pipeline Deployment Script
#
# This script ensures the pipeline runs in a consistent environment
# with proper error handling and logging.
#

set -e  # Exit on error
set -o pipefail  # Catch errors in pipes

# Configuration
PROJECT_DIR="/opt/hawks_analytics"
VENV_DIR="$PROJECT_DIR/venv"
LOG_DIR="$PROJECT_DIR/logs"
LOG_FILE="$LOG_DIR/pipeline_$(date +%Y%m%d_%H%M%S).log"

# Ensure log directory exists
mkdir -p "$LOG_DIR"

# Function to log messages
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Function to handle errors
handle_error() {
    log "ERROR: Pipeline failed at line $1"
    log "Check log file: $LOG_FILE"
    exit 1
}

trap 'handle_error $LINENO' ERR

log "Starting Hawks Analytics Pipeline"
log "=================================="

# Change to project directory
cd "$PROJECT_DIR"
log "Working directory: $(pwd)"

# Activate virtual environment
log "Activating virtual environment..."
source "$VENV_DIR/bin/activate"

# Verify Python
log "Python: $(which python)"
log "Version: $(python --version)"

# Set matplotlib backend for headless operation
export MPLBACKEND=Agg

# Run environment validation
log "Running environment validation..."
python validate_environment.py 2>&1 | tee -a "$LOG_FILE"

if [ ${PIPESTATUS[0]} -ne 0 ]; then
    log "Environment validation failed. Aborting."
    exit 1
fi

# Run the main pipeline
log "Starting main pipeline..."
python pipeline.py 2>&1 | tee -a "$LOG_FILE"

# Check exit status
if [ ${PIPESTATUS[0]} -eq 0 ]; then
    log "Pipeline completed successfully"
else
    log "Pipeline failed"
    exit 1
fi

log "=================================="
log "Pipeline execution complete"

Part 5: Prevention Strategies

5.1 Continuous Integration Checks

.github/workflows/validate.yml

name: Environment Validation

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM

jobs:
  validate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Run environment validation
        run: python validate_environment.py

      - name: Run tests
        run: pytest tests/ -v

      - name: Test pipeline dry run
        run: python pipeline.py --dry-run

5.2 Dependency Lock File

requirements.lock (generated with pip freeze)

# This file contains exact versions for reproducible builds
# Generated: 2024-12-01
# Python: 3.11.5
# Platform: linux-x86_64

certifi==2023.7.22
charset-normalizer==3.2.0
contourpy==1.1.0
cycler==0.11.0
fonttools==4.42.1
idna==3.4
joblib==1.3.2
kiwisolver==1.4.5
matplotlib==3.7.2
nba-api==1.2.1
numpy==1.24.3
packaging==23.1
pandas==2.0.3
pillow==10.0.0
pyarrow==12.0.1
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2023.3
requests==2.31.0
scikit-learn==1.3.0
scipy==1.11.2
seaborn==0.12.2
six==1.16.0
threadpoolctl==3.2.0
tzdata==2023.3
urllib3==2.0.4

Discussion Questions

Question 1: Version Pinning Tradeoffs

The original requirements.txt had no version pins. The new one has exact pins. What are the tradeoffs? When might you prefer one approach over the other?

Question 2: Production vs Development

Should development environments match production exactly? What problems might arise if they differ?

Question 3: Dependency Management Tools

Tools like Poetry, Pipenv, and pip-tools offer more sophisticated dependency management. What advantages might they provide over plain requirements.txt?

Question 4: Container Solutions

How might containerization (Docker) have prevented some of these issues? What are the tradeoffs of containerization?

Question 5: Monitoring

What monitoring or alerting should be in place to catch environment issues before they cause pipeline failures?


Deliverables

  1. Diagnostic Tool: Complete environment diagnostic script
  2. Fixed Requirements: Properly pinned requirements.txt
  3. Validation Script: Pre-flight environment validation
  4. Deployment Script: Robust deployment with error handling
  5. CI Configuration: GitHub Actions workflow for validation

Key Takeaways

  1. Pin your dependencies - Unpinned versions lead to reproducibility failures
  2. Validate before running - Catch environment issues before they cause pipeline failures
  3. Use virtual environments consistently - Isolation prevents dependency conflicts
  4. Handle headless environments - Configure backends for server deployment
  5. Log everything - Good logging makes debugging much easier
  6. Automate validation - CI/CD catches issues before they reach production

This case study demonstrates that environment management is not just a setup task but an ongoing operational concern that requires systematic approaches.