Appendix D: Environment Setup

This appendix walks you through setting up a complete development environment for the code in this textbook. The stack includes Python, PyTorch with GPU support, Jupyter Lab, orchestration tools, and everything else you need to run every exercise from Chapter 1 through the capstone. Instructions cover Linux (the primary platform for production ML), macOS (common for development), and Windows via WSL2.


D.1 Platform Overview

Component Version Purpose
Python 3.11.x Runtime
CUDA Toolkit 12.4 GPU compute for PyTorch
cuDNN 9.x Deep learning GPU acceleration
PyTorch 2.3+ Deep learning framework
conda (miniforge) latest Environment management
Docker 24+ Containerized pipelines
Jupyter Lab 4.x Interactive development
VS Code latest IDE
Git 2.40+ Version control

Minimum hardware for GPU work: NVIDIA GPU with 8 GB VRAM (RTX 3070 or better), 32 GB RAM, 50 GB free disk. For CPU-only work: 16 GB RAM suffices, but training will be 10-50x slower.


D.2 Linux Setup (Ubuntu 22.04 / 24.04)

Linux is the path of least resistance for ML development. These instructions target Ubuntu, but adapt straightforwardly to Fedora or Arch.

Step 1: System Dependencies

sudo apt update && sudo apt install -y \
    build-essential \
    git \
    curl \
    wget \
    unzip \
    libssl-dev \
    libffi-dev \
    libpq-dev \
    graphviz \
    graphviz-dev

Step 2: NVIDIA Driver and CUDA

Check your current driver:

nvidia-smi

If no driver is installed or the version is below 535:

sudo apt install -y nvidia-driver-550
sudo reboot

After reboot, confirm:

nvidia-smi
# Should show driver version >= 550 and CUDA version >= 12.4

Install the CUDA toolkit (for compiling custom CUDA kernels — not strictly required if you only use PyTorch's prebuilt binaries):

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-4

Add to your shell profile (~/.bashrc or ~/.zshrc):

export PATH="/usr/local/cuda-12.4/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH"

Verify:

nvcc --version
# Should show CUDA 12.4

Step 3: Miniforge (conda)

We use Miniforge instead of Anaconda to avoid licensing issues and get faster dependency resolution via mamba:

curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3
$HOME/miniforge3/bin/conda init bash
source ~/.bashrc

Step 4: Create the Course Environment

Save the following as environment.yml:

name: ads
channels:
  - conda-forge
  - pytorch
  - nvidia
dependencies:
  - python=3.11
  - pytorch=2.3
  - torchvision
  - torchaudio
  - pytorch-cuda=12.4
  - numpy>=1.26
  - pandas>=2.2
  - scipy>=1.13
  - scikit-learn>=1.5
  - matplotlib>=3.9
  - seaborn>=0.13
  - jupyterlab>=4.2
  - ipywidgets
  - notebook
  - sqlalchemy>=2.0
  - psycopg2
  - pyarrow>=16.0
  - polars>=1.0
  - pip
  - pip:
    - dowhy>=0.11
    - econml>=0.15
    - pymc>=5.16
    - arviz>=0.19
    - transformers>=4.42
    - datasets>=2.20
    - accelerate>=0.31
    - ray[tune]>=2.32
    - optuna>=3.6
    - mlflow>=2.14
    - great-expectations>=1.0
    - fairlearn>=0.10
    - opacus>=1.5
    - captum>=0.7
    - fastapi>=0.111
    - uvicorn[standard]>=0.30
    - pydantic>=2.7
    - shap>=0.45
    - plotly>=5.22
    - dagster>=1.7
    - dagster-webserver>=1.7
    - prefect>=2.19
    - pyspark>=3.5
    - great-expectations>=1.0
    - docker>=7.1
    - black>=24.4
    - ruff>=0.4
    - pytest>=8.2
    - httpx>=0.27

Create the environment:

conda env create -f environment.yml
conda activate ads

Verify PyTorch GPU access:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"CPU\"}')"

Step 5: Jupyter Lab

conda activate ads
jupyter lab --no-browser --port=8888

For remote servers, add --ip=0.0.0.0 and use SSH tunneling:

# On your local machine:
ssh -N -L 8888:localhost:8888 user@remote-server

D.3 macOS Setup

Step 1: Xcode Command Line Tools and Homebrew

xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install git graphviz wget

Step 2: Miniforge

# Apple Silicon (M1/M2/M3/M4):
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh -b -p $HOME/miniforge3

# Intel Mac:
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-x86_64.sh
bash Miniforge3-MacOSX-x86_64.sh -b -p $HOME/miniforge3

$HOME/miniforge3/bin/conda init zsh
source ~/.zshrc

Step 3: Create the Environment

Use the same environment.yml as Linux, with one change — replace pytorch-cuda=12.4 with cpuonly (or remove it entirely for MPS acceleration on Apple Silicon):

# For Apple Silicon, replace the pytorch lines with:
dependencies:
  - pytorch=2.3
  - torchvision
  - torchaudio
  # No pytorch-cuda line — PyTorch will use MPS automatically
conda env create -f environment.yml
conda activate ads

Verify MPS (Apple Silicon GPU) access:

python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"

macOS Notes

  • MPS (Metal Performance Shaders): Apple Silicon Macs have GPU acceleration via MPS. It is not as fast as CUDA but is significantly faster than CPU. Use device = torch.device('mps') in your code.
  • Memory: macOS shares memory between CPU and GPU. A 32 GB Mac can allocate most of that to the GPU, which is an advantage over discrete GPUs with fixed VRAM.
  • PySpark on macOS: Requires Java. Install with brew install openjdk@17 and set JAVA_HOME.

D.4 Windows Setup (WSL2)

Native Windows Python development works for basic tasks, but CUDA, Docker, and most production ML tools assume a Unix environment. WSL2 bridges this gap: you run a real Linux kernel inside Windows, with full GPU pass-through.

Step 1: Enable WSL2

Open PowerShell as Administrator:

wsl --install -d Ubuntu-24.04

Restart your machine when prompted. After restart, open the Ubuntu terminal and set your username and password.

Step 2: NVIDIA GPU Support in WSL2

Install the Windows NVIDIA driver (not the Linux driver) from nvidia.com/drivers. The Windows driver includes WSL2 GPU support automatically. Do not install a separate Linux NVIDIA driver inside WSL2.

Verify inside WSL2:

nvidia-smi

Step 3: Follow the Linux Instructions

Once inside WSL2, follow every step in Section D.2 (Linux Setup) exactly. WSL2 is a full Ubuntu environment.

Step 4: Docker Desktop Integration

Install Docker Desktop for Windows and enable the WSL2 backend in Settings > General > "Use the WSL 2 based engine." Then in Settings > Resources > WSL Integration, enable your Ubuntu distribution. Docker commands will work inside WSL2 without additional configuration.

Step 5: VS Code Integration

Install VS Code on Windows. Install the "WSL" extension. Then from inside WSL2:

code .

This opens VS Code on Windows but executes all terminal commands, linters, and debuggers inside WSL2. Your Python environment, CUDA toolkit, and all tools are in WSL2; VS Code is just the GUI layer.

Windows/WSL2 Notes

  • File system performance: Store all code and data inside the WSL2 filesystem (/home/username/), not on the Windows filesystem (/mnt/c/). Accessing Windows files from WSL2 is 5-10x slower due to the filesystem translation layer.
  • Memory allocation: By default, WSL2 takes up to 50% of system RAM. For ML workloads, create C:\Users\<username>\.wslconfig: ini [wsl2] memory=24GB swap=8GB processors=8 Restart WSL2 with wsl --shutdown after changing this file.
  • Port forwarding: Jupyter and FastAPI servers running inside WSL2 are accessible from Windows at localhost automatically.

D.5 Docker for ML

Docker ensures reproducibility: the same container runs identically on your laptop, your colleague's machine, and a cloud GPU instance.

Base Dockerfile for Course Work

FROM pytorch/pytorch:2.3.0-cuda12.4-cudnn9-runtime

WORKDIR /app

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    curl \
    build-essential \
    libpq-dev \
    graphviz \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy project code
COPY . .

# Default command
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

Build and run:

docker build -t ads-course .
docker run --gpus all -p 8888:8888 -v $(pwd)/notebooks:/app/notebooks ads-course

Multi-Stage Build for Production Serving

# Stage 1: Build
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements-serve.txt .
RUN pip install --no-cache-dir --target=/install -r requirements-serve.txt

# Stage 2: Runtime
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /install /usr/local/lib/python3.11/site-packages
COPY serve/ .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Docker Compose for Full Pipeline Stack

services:
  jupyter:
    build: .
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/app/notebooks
      - ./data:/app/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.14.0
    ports:
      - "5000:5000"
    volumes:
      - mlflow_data:/mlflow
    command: >
      mlflow server
      --host 0.0.0.0
      --port 5000
      --backend-store-uri sqlite:///mlflow/mlflow.db
      --default-artifact-root /mlflow/artifacts

  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: ads
      POSTGRES_PASSWORD: ads_dev_password
      POSTGRES_DB: features
    ports:
      - "5432:5432"
    volumes:
      - pg_data:/var/lib/postgresql/data

volumes:
  mlflow_data:
  pg_data:

Start the full stack:

docker compose up -d

D.6 Cloud GPU Instances

For chapters requiring extended GPU training (Chapters 8-12, 18-20, 31-33), a cloud GPU instance may be necessary if your local hardware is insufficient.

AWS (EC2)

# Launch a GPU instance via AWS CLI
aws ec2 run-instances \
    --image-id ami-0abcdef1234567890 \
    --instance-type g5.xlarge \
    --key-name my-key \
    --security-group-ids sg-0123456789abcdef0 \
    --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200,"VolumeType":"gp3"}}]' \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ads-training}]'
Instance Type GPU VRAM Hourly Cost (approx.) Use Case
g5.xlarge 1x A10G 24 GB $1.00 Most chapters
g5.2xlarge 1x A10G 24 GB $1.20 Larger batch sizes
p4d.24xlarge 8x A100 320 GB $32.77 Distributed training (Ch. 31)

GCP (Compute Engine)

gcloud compute instances create ads-training \
    --zone=us-central1-a \
    --machine-type=g2-standard-8 \
    --accelerator=type=nvidia-l4,count=1 \
    --boot-disk-size=200GB \
    --image-family=pytorch-latest-gpu \
    --image-project=deeplearning-platform-release \
    --maintenance-policy=TERMINATE

Cloud Cost Management

  • Use spot/preemptible instances: 60-90% cheaper, but can be terminated with 30 seconds notice. Save checkpoints frequently (every epoch, or every N steps).
  • Auto-shutdown: Set up a cron job or cloud function to stop instances that have been idle for 30 minutes. Forgetting to stop an instance is the most common cloud cost mistake. bash # Add to crontab on the instance itself # Shuts down if no GPU processes are running for 30 minutes */5 * * * * if [ $(nvidia-smi --query-compute-apps=pid --format=csv,noheader | wc -l) -eq 0 ]; then echo "idle" >> /tmp/idle.log; else rm -f /tmp/idle.log; fi */30 * * * * if [ $(wc -l < /tmp/idle.log 2>/dev/null || echo 0) -ge 6 ]; then sudo shutdown -h now; fi
  • Storage: Detach and delete data volumes when you are done. A 200 GB volume costs roughly $20/month even when the instance is stopped.

D.7 VS Code Configuration

VS Code with the Python and Jupyter extensions provides a strong development experience for this course.

Extension Purpose
Python (ms-python.python) Linting, IntelliSense, debugging
Jupyter (ms-toolsai.jupyter) Notebook support in VS Code
Pylance (ms-python.vscode-pylance) Type checking and autocompletion
Docker (ms-azuretools.vscode-docker) Dockerfile and Compose support
Remote - SSH (ms-vscode-remote.remote-ssh) Develop on cloud instances
WSL (ms-vscode-remote.remote-wsl) Develop inside WSL2
GitLens (eamodio.gitlens) Git history and blame
Ruff (charliermarsh.ruff) Fast Python linter

Workspace Settings

Save as .vscode/settings.json in your project root:

{
    "python.defaultInterpreterPath": "~/miniforge3/envs/ads/bin/python",
    "python.analysis.typeCheckingMode": "basic",
    "python.testing.pytestEnabled": true,
    "python.testing.pytestArgs": ["tests/"],
    "[python]": {
        "editor.defaultFormatter": "charliermarsh.ruff",
        "editor.formatOnSave": true,
        "editor.codeActionsOnSave": {
            "source.fixAll.ruff": "explicit",
            "source.organizeImports.ruff": "explicit"
        }
    },
    "jupyter.notebookFileRoot": "${workspaceFolder}",
    "files.exclude": {
        "**/__pycache__": true,
        "**/.ipynb_checkpoints": true,
        "**/mlruns": true
    }
}

D.8 Troubleshooting Common Issues

CUDA / GPU Issues

Problem: torch.cuda.is_available() returns False even though nvidia-smi shows the GPU.

Diagnosis and fix:

# Check that PyTorch was built with CUDA support
python -c "import torch; print(torch.version.cuda)"
# If this prints None, you installed the CPU-only PyTorch

# Reinstall with CUDA:
conda install pytorch pytorch-cuda=12.4 -c pytorch -c nvidia

Problem: CUDA out of memory during training.

Fixes (in order of preference): 1. Reduce batch size. 2. Use mixed precision: torch.cuda.amp.autocast() and GradScaler. 3. Use gradient accumulation to simulate larger batches. 4. Use gradient checkpointing: model.gradient_checkpointing_enable(). 5. Move to a larger GPU.

# Gradient accumulation example
accumulation_steps = 4
optimizer.zero_grad()
for i, batch in enumerate(train_loader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Dependency Conflicts

Problem: pip install fails with version conflicts.

Fix: Use conda for heavy numerical packages (PyTorch, numpy, scipy, scikit-learn) and pip only for packages not available in conda-forge. Never mix conda install and pip install for the same package.

# If your environment is corrupted, rebuild it:
conda deactivate
conda env remove -n ads
conda env create -f environment.yml

Problem: ImportError: libcudnn.so.9: cannot open shared object file.

Fix: cuDNN is missing or the wrong version. If using conda, it should be installed automatically with pytorch-cuda. If not:

conda install -c nvidia cudnn=9

Jupyter Issues

Problem: Jupyter kernel dies during large computations.

Fixes: 1. Check memory usage with htop or nvidia-smi. You may be running out of RAM or VRAM. 2. Increase Jupyter's memory limit: jupyter lab --NotebookApp.max_buffer_size=10737418240 3. For very large datasets, use Dask or PySpark instead of loading everything into a pandas DataFrame.

Problem: Jupyter cannot find the conda environment.

Fix:

conda activate ads
python -m ipykernel install --user --name=ads --display-name="Python (ads)"

Then select the "Python (ads)" kernel in Jupyter Lab.

Docker Issues

Problem: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Fix: Install the NVIDIA Container Toolkit:

distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

PySpark Issues

Problem: JAVA_HOME is not set when starting PySpark.

Fix:

# Ubuntu/WSL2
sudo apt install -y openjdk-17-jdk
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

# macOS
brew install openjdk@17
export JAVA_HOME=$(/usr/libexec/java_home -v 17)

Add the export JAVA_HOME line to your shell profile.

Airflow Issues

Problem: Airflow database initialization errors.

Fix: Airflow requires a metadata database. For local development:

export AIRFLOW_HOME=~/airflow
airflow db migrate
airflow users create \
    --username admin \
    --password admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email admin@example.com
airflow webserver --port 8080 &
airflow scheduler &

D.9 Verifying Your Installation

Run this verification script after setup to confirm everything works:

"""verify_setup.py — Run this to confirm your environment is ready."""

import sys

def check(name, test_fn):
    try:
        result = test_fn()
        print(f"  [PASS] {name}: {result}")
        return True
    except Exception as e:
        print(f"  [FAIL] {name}: {e}")
        return False

print(f"Python {sys.version}\n")
print("Core libraries:")
results = []

results.append(check("NumPy", lambda: __import__("numpy").__version__))
results.append(check("pandas", lambda: __import__("pandas").__version__))
results.append(check("scikit-learn", lambda: __import__("sklearn").__version__))
results.append(check("PyTorch", lambda: __import__("torch").__version__))
results.append(check("CUDA available", lambda: __import__("torch").cuda.is_available()))

print("\nML ecosystem:")
results.append(check("DoWhy", lambda: __import__("dowhy").__version__))
results.append(check("EconML", lambda: __import__("econml").__version__))
results.append(check("PyMC", lambda: __import__("pymc").__version__))
results.append(check("ArviZ", lambda: __import__("arviz").__version__))
results.append(check("Transformers", lambda: __import__("transformers").__version__))
results.append(check("Ray", lambda: __import__("ray").__version__))
results.append(check("Optuna", lambda: __import__("optuna").__version__))
results.append(check("MLflow", lambda: __import__("mlflow").__version__))
results.append(check("Great Expectations", lambda: __import__("great_expectations").__version__))
results.append(check("Fairlearn", lambda: __import__("fairlearn").__version__))
results.append(check("Opacus", lambda: __import__("opacus").__version__))
results.append(check("Captum", lambda: __import__("captum").__version__))
results.append(check("FastAPI", lambda: __import__("fastapi").__version__))
results.append(check("SHAP", lambda: __import__("shap").__version__))

print("\nOrchestration and infrastructure:")
results.append(check("PySpark", lambda: __import__("pyspark").__version__))
results.append(check("Dagster", lambda: __import__("dagster").__version__))
results.append(check("Prefect", lambda: __import__("prefect").__version__))

passed = sum(results)
total = len(results)
print(f"\n{'='*50}")
print(f"Results: {passed}/{total} checks passed")
if passed == total:
    print("Your environment is ready.")
else:
    print("Some checks failed. Review the errors above.")

Run it:

conda activate ads
python verify_setup.py

CUDA availability returning False is acceptable if you are on macOS (use MPS instead) or do not have an NVIDIA GPU. All other checks should pass.


Cross-references: Appendix C covers the API patterns for each library installed here. Appendix E lists the datasets you will download into this environment.