Appendix D: Environment Setup Guide

This guide walks you through setting up a complete data science environment for the exercises in this book. You have three paths: local setup with conda (recommended), local setup with venv, or cloud notebooks. Pick one and go.

Option 1: Local Setup with Conda (Recommended)

Conda manages both Python and non-Python dependencies (like C libraries that numpy and scipy need), which makes it the most reliable option for data science work.

Step 1: Install Miniconda

Download Miniconda (not full Anaconda — it's bloated) from https://docs.conda.io/en/latest/miniconda.html. Choose the installer for your operating system.

macOS/Linux:

# Download and run the installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# Follow the prompts. Say "yes" to initializing conda.
# Restart your terminal, then verify:
conda --version

Windows:

Download the .exe installer and run it. Check "Add to PATH" during installation. Open a new terminal and verify:

conda --version

Step 2: Create the Project Environment

# Create a new environment with Python 3.11
conda create -n ids python=3.11 -y

# Activate it
conda activate ids

# Verify
python --version
# Should show Python 3.11.x

Step 3: Install Core Packages

# Install the scientific stack from conda-forge (better dependency resolution)
conda install -c conda-forge \
    numpy pandas scipy matplotlib seaborn jupyter jupyterlab \
    scikit-learn statsmodels -y

# Install ML libraries via pip (not all are on conda-forge)
pip install \
    xgboost lightgbm catboost \
    shap optuna \
    mlflow \
    fastapi uvicorn pydantic \
    imbalanced-learn category_encoders \
    geopandas folium \
    nltk \
    dask[complete] polars \
    pyarrow \
    python-dotenv \
    joblib \
    pytest black ruff mypy \
    notebook

Step 4: Verify the Installation

Create a file called verify_setup.py and run it:

import sys
print(f"Python: {sys.version}")

libraries = [
    'numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn',
    'sklearn', 'statsmodels', 'xgboost', 'lightgbm', 'catboost',
    'shap', 'optuna', 'mlflow', 'fastapi', 'pydantic',
    'imblearn', 'category_encoders',
    'geopandas', 'folium',
    'nltk',
    'dask', 'polars',
    'pyarrow', 'joblib', 'pytest'
]

for lib in libraries:
    try:
        mod = __import__(lib)
        version = getattr(mod, '__version__', 'installed')
        print(f"  {lib}: {version}")
    except ImportError:
        print(f"  {lib}: MISSING -- run pip install {lib}")

python verify_setup.py

All libraries should show a version number. If any show "MISSING," install them with pip install <library>.

Step 5: Download NLTK Data

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')

Step 6: Launch Jupyter

# JupyterLab (recommended)
jupyter lab

# Or classic notebook
jupyter notebook

Option 2: Local Setup with venv

If you prefer not to use conda, Python's built-in venv module works fine. You need Python 3.10 or 3.11 installed from https://www.python.org.

Create and Activate the Environment

# Create virtual environment
python -m venv ids-env

# Activate it
# macOS/Linux:
source ids-env/bin/activate
# Windows:
ids-env\Scripts\activate

# Upgrade pip
pip install --upgrade pip

Install All Packages

Save this as requirements.txt:

numpy>=1.24,<2.0
pandas>=2.0
scipy>=1.11
matplotlib>=3.7
seaborn>=0.12
jupyter>=1.0
jupyterlab>=4.0
scikit-learn>=1.3
statsmodels>=0.14

xgboost>=2.0
lightgbm>=4.0
catboost>=1.2
shap>=0.43
optuna>=3.4

mlflow>=2.9

fastapi>=0.104
uvicorn>=0.24
pydantic>=2.5

imbalanced-learn>=0.11
category_encoders>=2.6

geopandas>=0.14
folium>=0.15

nltk>=3.8

dask[complete]>=2023.12
polars>=0.19
pyarrow>=14.0

python-dotenv>=1.0
joblib>=1.3
pytest>=7.4
black>=23.12
ruff>=0.1
mypy>=1.7

Install:

pip install -r requirements.txt

Note on geopandas: On some systems, geopandas requires GDAL and other C libraries that can be difficult to install with pip alone. If you hit errors, switch to conda for geopandas: conda install -c conda-forge geopandas.

Option 3: Cloud Notebooks

If you want to skip local setup entirely, cloud notebooks provide pre-configured environments.

Google Colab (Free)

Open https://colab.research.google.com and create a new notebook. Most core libraries are pre-installed. For missing ones:

!pip install -q catboost shap optuna mlflow category_encoders \
    imbalanced-learn polars geopandas folium fastapi

Colab limitations: - Sessions timeout after inactivity (free tier: ~90 minutes) - No persistent file storage (use Google Drive mount) - Limited RAM on free tier (12 GB) - No local server (cannot run FastAPI exercises natively)

Mount Google Drive for persistent storage:

from google.colab import drive
drive.mount('/content/drive')

# Save work to Drive
import shutil
shutil.copy('model.joblib', '/content/drive/MyDrive/ids-project/model.joblib')

Amazon SageMaker Studio Lab (Free)

Sign up at https://studiolab.sagemaker.aws (free, no AWS account needed). Provides a JupyterLab environment with: - 15 GB persistent storage - 12 hours of CPU or 4 hours of GPU per session - Full conda/pip access

# In a SageMaker Studio Lab terminal
conda create -n ids python=3.11 -y
conda activate ids
pip install -r requirements.txt

Advantage over Colab: Persistent storage and a real terminal for running FastAPI and MLflow locally.

Kaggle Notebooks (Free)

Open https://www.kaggle.com/code and create a new notebook. Most ML libraries are pre-installed, plus internet access and free GPU time.

!pip install -q catboost optuna mlflow category_encoders polars

Kaggle limitations: - 30 GB RAM, 20 hours per week of GPU - No persistent terminal (notebook-only) - Internet access must be enabled per notebook

Docker for Data Science

Docker containerizes your entire environment: Python, libraries, system dependencies, and your code. This ensures that your project runs identically on any machine.

Install Docker

Download Docker Desktop from https://www.docker.com/products/docker-desktop/. Available for Windows, macOS, and Linux.

Verify:

docker --version
docker run hello-world

Dockerfile for This Book

Save as Dockerfile in your project root:

FROM python:3.11-slim

# System dependencies for geopandas and other C-backed libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    libgdal-dev \
    libgeos-dev \
    libproj-dev \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Download NLTK data
RUN python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab'); \
    nltk.download('stopwords'); nltk.download('wordnet'); \
    nltk.download('vader_lexicon')"

# Copy project code
COPY . .

# Expose ports for Jupyter and FastAPI
EXPOSE 8888 8000

# Default command: start JupyterLab
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", \
     "--no-browser", "--allow-root", "--NotebookApp.token=''"]

Build and Run

# Build the image
docker build -t ids-textbook .

# Run JupyterLab
docker run -p 8888:8888 -v $(pwd):/app ids-textbook

# Run FastAPI (Chapter 31)
docker run -p 8000:8000 ids-textbook \
    uvicorn app:app --host 0.0.0.0 --port 8000

Open http://localhost:8888 for Jupyter or http://localhost:8000/docs for the FastAPI Swagger UI.

Docker Compose for Multi-Service Setup

For Chapter 30 (MLflow) and Chapter 31 (FastAPI), you may want multiple services running simultaneously. Save as docker-compose.yml:

version: '3.8'

services:
  jupyter:
    build: .
    ports:
      - "8888:8888"
    volumes:
      - .:/app
      - mlflow-data:/mlflow

  mlflow:
    build: .
    command: mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts
    ports:
      - "5000:5000"
    volumes:
      - mlflow-data:/mlflow

  api:
    build: .
    command: uvicorn app:app --host 0.0.0.0 --port 8000
    ports:
      - "8000:8000"
    volumes:
      - .:/app

volumes:
  mlflow-data:

# Start all services
docker compose up

# Stop all services
docker compose down

Project Structure

After setup, organize your project following the cookiecutter-data-science convention from Chapter 29:

streamflow-churn/
    data/
        raw/                    # Original, immutable data
        processed/              # Cleaned, transformed data
        features/               # Final feature matrices
    notebooks/
        01-eda.ipynb
        02-feature-engineering.ipynb
        03-modeling.ipynb
    src/
        __init__.py
        data/
            __init__.py
            extract.py          # SQL extraction scripts
            clean.py
        features/
            __init__.py
            build_features.py   # Feature engineering pipeline
        models/
            __init__.py
            train.py
            predict.py
            evaluate.py
    model/
        churn_pipeline.joblib   # Serialized model
    app.py                      # FastAPI application
    tests/
        test_features.py
        test_model.py
        test_api.py
    Dockerfile
    docker-compose.yml
    requirements.txt
    .env                        # Environment variables (never commit)
    .gitignore
    README.md

Recommended .gitignore

# Data (too large for git)
data/raw/
data/processed/
*.csv
*.parquet

# Models (too large for git)
model/*.joblib
model/*.pkl
mlruns/

# Environment
.env
ids-env/
__pycache__/
.ipynb_checkpoints/

# OS files
.DS_Store
Thumbs.db

Troubleshooting

"ModuleNotFoundError" in Jupyter but pip says it's installed: Your Jupyter kernel may be using a different Python than your terminal. Fix:

conda activate ids
python -m ipykernel install --user --name ids --display-name "IDS (Python 3.11)"

Then select the "IDS (Python 3.11)" kernel in Jupyter.

geopandas installation fails (GDAL errors):

# Use conda instead of pip for geopandas
conda install -c conda-forge geopandas -y

CatBoost is slow to install: CatBoost compiles from source on some platforms. Be patient (5-10 minutes) or install the CPU-only version:

pip install catboost --no-cache-dir

MLflow tracking server won't start: Ensure port 5000 is not already in use:

# Check what's using port 5000
lsof -i :5000   # macOS/Linux
netstat -ano | findstr :5000   # Windows

# Use a different port
mlflow server --host 0.0.0.0 --port 5001

SHAP is slow on large datasets: Use TreeSHAP for tree-based models (fast, exact) instead of KernelSHAP (slow, approximate):

# Fast: TreeExplainer for tree models
explainer = shap.TreeExplainer(xgb_model)

# Slow: KernelExplainer for any model (use small background sample)
explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100))

Docker "permission denied" on Linux:

sudo usermod -aG docker $USER
# Log out and back in for the change to take effect

Polars vs. pandas confusion: They have different APIs. Polars does not have an index, uses .select() instead of [] for column selection, and uses .filter() instead of boolean indexing. Refer to the Polars documentation at https://docs.pola.rs for the expression syntax.

Hardware Recommendations

Component	Minimum	Recommended
RAM	8 GB	16-32 GB
CPU	4 cores	8+ cores
Storage	20 GB free	50+ GB SSD
GPU	Not required	Nice for CatBoost/XGBoost GPU training

The exercises in this book run on any modern laptop. Chapters 28 (large datasets) and 14 (gradient boosting with large data) benefit from more RAM. If your machine has less than 8 GB, use a cloud notebook for those chapters.