Appendix D: Environment Setup Guide
This guide walks you through setting up a complete data science environment for the exercises in this book. You have three paths: local setup with conda (recommended), local setup with venv, or cloud notebooks. Pick one and go.
Option 1: Local Setup with Conda (Recommended)
Conda manages both Python and non-Python dependencies (like C libraries that numpy and scipy need), which makes it the most reliable option for data science work.
Step 1: Install Miniconda
Download Miniconda (not full Anaconda — it's bloated) from https://docs.conda.io/en/latest/miniconda.html. Choose the installer for your operating system.
macOS/Linux:
# Download and run the installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Follow the prompts. Say "yes" to initializing conda.
# Restart your terminal, then verify:
conda --version
Windows:
Download the .exe installer and run it. Check "Add to PATH" during installation. Open a new terminal and verify:
conda --version
Step 2: Create the Project Environment
# Create a new environment with Python 3.11
conda create -n ids python=3.11 -y
# Activate it
conda activate ids
# Verify
python --version
# Should show Python 3.11.x
Step 3: Install Core Packages
# Install the scientific stack from conda-forge (better dependency resolution)
conda install -c conda-forge \
numpy pandas scipy matplotlib seaborn jupyter jupyterlab \
scikit-learn statsmodels -y
# Install ML libraries via pip (not all are on conda-forge)
pip install \
xgboost lightgbm catboost \
shap optuna \
mlflow \
fastapi uvicorn pydantic \
imbalanced-learn category_encoders \
geopandas folium \
nltk \
dask[complete] polars \
pyarrow \
python-dotenv \
joblib \
pytest black ruff mypy \
notebook
Step 4: Verify the Installation
Create a file called verify_setup.py and run it:
import sys
print(f"Python: {sys.version}")
libraries = [
'numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn',
'sklearn', 'statsmodels', 'xgboost', 'lightgbm', 'catboost',
'shap', 'optuna', 'mlflow', 'fastapi', 'pydantic',
'imblearn', 'category_encoders',
'geopandas', 'folium',
'nltk',
'dask', 'polars',
'pyarrow', 'joblib', 'pytest'
]
for lib in libraries:
try:
mod = __import__(lib)
version = getattr(mod, '__version__', 'installed')
print(f" {lib}: {version}")
except ImportError:
print(f" {lib}: MISSING -- run pip install {lib}")
python verify_setup.py
All libraries should show a version number. If any show "MISSING," install them with pip install <library>.
Step 5: Download NLTK Data
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')
Step 6: Launch Jupyter
# JupyterLab (recommended)
jupyter lab
# Or classic notebook
jupyter notebook
Option 2: Local Setup with venv
If you prefer not to use conda, Python's built-in venv module works fine. You need Python 3.10 or 3.11 installed from https://www.python.org.
Create and Activate the Environment
# Create virtual environment
python -m venv ids-env
# Activate it
# macOS/Linux:
source ids-env/bin/activate
# Windows:
ids-env\Scripts\activate
# Upgrade pip
pip install --upgrade pip
Install All Packages
Save this as requirements.txt:
numpy>=1.24,<2.0
pandas>=2.0
scipy>=1.11
matplotlib>=3.7
seaborn>=0.12
jupyter>=1.0
jupyterlab>=4.0
scikit-learn>=1.3
statsmodels>=0.14
xgboost>=2.0
lightgbm>=4.0
catboost>=1.2
shap>=0.43
optuna>=3.4
mlflow>=2.9
fastapi>=0.104
uvicorn>=0.24
pydantic>=2.5
imbalanced-learn>=0.11
category_encoders>=2.6
geopandas>=0.14
folium>=0.15
nltk>=3.8
dask[complete]>=2023.12
polars>=0.19
pyarrow>=14.0
python-dotenv>=1.0
joblib>=1.3
pytest>=7.4
black>=23.12
ruff>=0.1
mypy>=1.7
Install:
pip install -r requirements.txt
Note on geopandas: On some systems, geopandas requires GDAL and other C libraries that can be difficult to install with pip alone. If you hit errors, switch to conda for geopandas: conda install -c conda-forge geopandas.
Option 3: Cloud Notebooks
If you want to skip local setup entirely, cloud notebooks provide pre-configured environments.
Google Colab (Free)
Open https://colab.research.google.com and create a new notebook. Most core libraries are pre-installed. For missing ones:
!pip install -q catboost shap optuna mlflow category_encoders \
imbalanced-learn polars geopandas folium fastapi
Colab limitations: - Sessions timeout after inactivity (free tier: ~90 minutes) - No persistent file storage (use Google Drive mount) - Limited RAM on free tier (12 GB) - No local server (cannot run FastAPI exercises natively)
Mount Google Drive for persistent storage:
from google.colab import drive
drive.mount('/content/drive')
# Save work to Drive
import shutil
shutil.copy('model.joblib', '/content/drive/MyDrive/ids-project/model.joblib')
Amazon SageMaker Studio Lab (Free)
Sign up at https://studiolab.sagemaker.aws (free, no AWS account needed). Provides a JupyterLab environment with: - 15 GB persistent storage - 12 hours of CPU or 4 hours of GPU per session - Full conda/pip access
# In a SageMaker Studio Lab terminal
conda create -n ids python=3.11 -y
conda activate ids
pip install -r requirements.txt
Advantage over Colab: Persistent storage and a real terminal for running FastAPI and MLflow locally.
Kaggle Notebooks (Free)
Open https://www.kaggle.com/code and create a new notebook. Most ML libraries are pre-installed, plus internet access and free GPU time.
!pip install -q catboost optuna mlflow category_encoders polars
Kaggle limitations: - 30 GB RAM, 20 hours per week of GPU - No persistent terminal (notebook-only) - Internet access must be enabled per notebook
Docker for Data Science
Docker containerizes your entire environment: Python, libraries, system dependencies, and your code. This ensures that your project runs identically on any machine.
Install Docker
Download Docker Desktop from https://www.docker.com/products/docker-desktop/. Available for Windows, macOS, and Linux.
Verify:
docker --version
docker run hello-world
Dockerfile for This Book
Save as Dockerfile in your project root:
FROM python:3.11-slim
# System dependencies for geopandas and other C-backed libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
libgdal-dev \
libgeos-dev \
libproj-dev \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Download NLTK data
RUN python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab'); \
nltk.download('stopwords'); nltk.download('wordnet'); \
nltk.download('vader_lexicon')"
# Copy project code
COPY . .
# Expose ports for Jupyter and FastAPI
EXPOSE 8888 8000
# Default command: start JupyterLab
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", \
"--no-browser", "--allow-root", "--NotebookApp.token=''"]
Build and Run
# Build the image
docker build -t ids-textbook .
# Run JupyterLab
docker run -p 8888:8888 -v $(pwd):/app ids-textbook
# Run FastAPI (Chapter 31)
docker run -p 8000:8000 ids-textbook \
uvicorn app:app --host 0.0.0.0 --port 8000
Open http://localhost:8888 for Jupyter or http://localhost:8000/docs for the FastAPI Swagger UI.
Docker Compose for Multi-Service Setup
For Chapter 30 (MLflow) and Chapter 31 (FastAPI), you may want multiple services running simultaneously. Save as docker-compose.yml:
version: '3.8'
services:
jupyter:
build: .
ports:
- "8888:8888"
volumes:
- .:/app
- mlflow-data:/mlflow
mlflow:
build: .
command: mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///mlflow/mlflow.db --default-artifact-root /mlflow/artifacts
ports:
- "5000:5000"
volumes:
- mlflow-data:/mlflow
api:
build: .
command: uvicorn app:app --host 0.0.0.0 --port 8000
ports:
- "8000:8000"
volumes:
- .:/app
volumes:
mlflow-data:
# Start all services
docker compose up
# Stop all services
docker compose down
Project Structure
After setup, organize your project following the cookiecutter-data-science convention from Chapter 29:
streamflow-churn/
data/
raw/ # Original, immutable data
processed/ # Cleaned, transformed data
features/ # Final feature matrices
notebooks/
01-eda.ipynb
02-feature-engineering.ipynb
03-modeling.ipynb
src/
__init__.py
data/
__init__.py
extract.py # SQL extraction scripts
clean.py
features/
__init__.py
build_features.py # Feature engineering pipeline
models/
__init__.py
train.py
predict.py
evaluate.py
model/
churn_pipeline.joblib # Serialized model
app.py # FastAPI application
tests/
test_features.py
test_model.py
test_api.py
Dockerfile
docker-compose.yml
requirements.txt
.env # Environment variables (never commit)
.gitignore
README.md
Recommended .gitignore
# Data (too large for git)
data/raw/
data/processed/
*.csv
*.parquet
# Models (too large for git)
model/*.joblib
model/*.pkl
mlruns/
# Environment
.env
ids-env/
__pycache__/
.ipynb_checkpoints/
# OS files
.DS_Store
Thumbs.db
Troubleshooting
"ModuleNotFoundError" in Jupyter but pip says it's installed: Your Jupyter kernel may be using a different Python than your terminal. Fix:
conda activate ids
python -m ipykernel install --user --name ids --display-name "IDS (Python 3.11)"
Then select the "IDS (Python 3.11)" kernel in Jupyter.
geopandas installation fails (GDAL errors):
# Use conda instead of pip for geopandas
conda install -c conda-forge geopandas -y
CatBoost is slow to install: CatBoost compiles from source on some platforms. Be patient (5-10 minutes) or install the CPU-only version:
pip install catboost --no-cache-dir
MLflow tracking server won't start: Ensure port 5000 is not already in use:
# Check what's using port 5000
lsof -i :5000 # macOS/Linux
netstat -ano | findstr :5000 # Windows
# Use a different port
mlflow server --host 0.0.0.0 --port 5001
SHAP is slow on large datasets: Use TreeSHAP for tree-based models (fast, exact) instead of KernelSHAP (slow, approximate):
# Fast: TreeExplainer for tree models
explainer = shap.TreeExplainer(xgb_model)
# Slow: KernelExplainer for any model (use small background sample)
explainer = shap.KernelExplainer(model.predict, shap.sample(X_train, 100))
Docker "permission denied" on Linux:
sudo usermod -aG docker $USER
# Log out and back in for the change to take effect
Polars vs. pandas confusion:
They have different APIs. Polars does not have an index, uses .select() instead of [] for column selection, and uses .filter() instead of boolean indexing. Refer to the Polars documentation at https://docs.pola.rs for the expression syntax.
Hardware Recommendations
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16-32 GB |
| CPU | 4 cores | 8+ cores |
| Storage | 20 GB free | 50+ GB SSD |
| GPU | Not required | Nice for CatBoost/XGBoost GPU training |
The exercises in this book run on any modern laptop. Chapters 28 (large datasets) and 14 (gradient boosting with large data) benefit from more RAM. If your machine has less than 8 GB, use a cloud notebook for those chapters.