Appendix C: Environment Setup Guide

This guide walks you through setting up a Python environment for statistical analysis. You have three main options, listed from easiest to most flexible.

C.1 Option 1: Google Colab (Easiest — No Installation Required)

Best for: Students who want to start coding immediately with zero setup. Also ideal if you're working on a Chromebook, a shared computer, or any device where you can't install software.

Getting Started

Open a web browser and navigate to colab.research.google.com.
Sign in with a Google account (any Gmail account works).
Click New Notebook to create a blank notebook.
You're ready to write Python.

Key Features

Pre-installed libraries: numpy, pandas, matplotlib, seaborn, scipy, statsmodels, and scikit-learn are all already installed. No pip install needed for the libraries used in this textbook.
Free GPU access: Not needed for this course, but useful if you continue to machine learning.
Auto-save: Notebooks are saved automatically to your Google Drive.
Sharing: Click Share in the upper right to collaborate with classmates (works like Google Docs).

Loading Data in Colab

# Option A: Upload from your computer
from google.colab import files
uploaded = files.upload()  # Opens a file chooser dialog

# Option B: Load directly from a URL
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/example/data.csv")

# Option C: Mount Google Drive
from google.colab import drive
drive.mount("/content/drive")
df = pd.read_csv("/content/drive/MyDrive/stats_class/data.csv")

Limitations

Requires internet access at all times.
Sessions time out after ~90 minutes of inactivity (you won't lose saved work, but unsaved variables are lost).
Runtime resets if you close the browser tab.
File uploads don't persist between sessions unless saved to Drive.

Recommendation

Use Colab for the first half of the course (Chapters 1-14). It lets you focus on learning statistics without worrying about installation. If you find yourself wanting more control or offline access, transition to Anaconda (Option 2) at your own pace.

C.2 Option 2: Anaconda (Recommended for a Full Setup)

Best for: Students who want a reliable, all-in-one installation on their own computer. Anaconda includes Python, Jupyter, and all the major data science libraries in a single installer.

Installation

Windows

Go to anaconda.com/download.
Download the Windows installer (64-bit).
Run the installer. Accept the license agreement.
Choose "Install for: Just Me" (recommended).
Accept the default installation location (usually C:\Users\YourName\anaconda3).
On the "Advanced Options" screen: - Check "Add Anaconda3 to my PATH environment variable" (despite the warning — it makes things easier). - Check "Register Anaconda3 as my default Python".
Click Install. This takes 5-10 minutes.
When finished, open Anaconda Navigator from the Start Menu.

macOS

Go to anaconda.com/download.
Download the macOS installer (choose the version for your chip — Apple Silicon for M1/M2/M3 Macs, or Intel for older Macs).
Run the .pkg installer and follow the prompts.
When finished, open Anaconda Navigator from Launchpad or the Applications folder.

Linux

Go to anaconda.com/download.
Download the Linux installer (.sh file).
Open a terminal and run: bash bash ~/Downloads/Anaconda3-2024.xx-x-Linux-x86_64.sh
Press Enter to review the license, type yes to accept.
Accept the default install location or choose your own.
When asked to initialize Anaconda, type yes.
Close and reopen your terminal.

Launching JupyterLab

From Anaconda Navigator: 1. Open Anaconda Navigator. 2. Click Launch under JupyterLab (or Jupyter Notebook). 3. A browser tab opens automatically.

From the command line:

jupyter lab

This opens JupyterLab in your default browser.

What's Included

Anaconda comes with 250+ packages pre-installed, including everything used in this textbook: - Python 3.11+ - numpy, pandas, matplotlib, seaborn - scipy, statsmodels, scikit-learn - JupyterLab, Jupyter Notebook

Creating a Course Environment (Optional but Recommended)

Creating a dedicated environment keeps your course packages separate from other projects:

conda create -n stats_course python=3.11 numpy pandas matplotlib seaborn scipy statsmodels scikit-learn jupyterlab
conda activate stats_course
jupyter lab

To return to this environment later:

conda activate stats_course

C.3 Option 3: pip + Virtual Environment (Lightweight)

Best for: Students who already have Python installed or who prefer minimal installations.

Prerequisites

You need Python 3.9 or later. Check your version:

python --version

If you don't have Python, download it from python.org.

Setup

# Create a virtual environment
python -m venv stats_env

# Activate it
# Windows:
stats_env\Scripts\activate
# macOS/Linux:
source stats_env/bin/activate

# Install required packages
pip install numpy pandas matplotlib seaborn scipy statsmodels scikit-learn jupyterlab

# Launch JupyterLab
jupyter lab

Deactivating

When you're done working:

deactivate

To reactivate later:

# Windows:
stats_env\Scripts\activate
# macOS/Linux:
source stats_env/bin/activate

C.4 JupyterLab Basics

Whether you're using Colab, Anaconda, or pip, you'll be working in Jupyter notebooks. Here's a quick orientation.

Notebook Structure

A Jupyter notebook consists of cells. Each cell is either: - Code cell: Contains Python code. Run it by pressing Shift + Enter. - Markdown cell: Contains formatted text, headings, and explanations.

Essential Keyboard Shortcuts

Action	Shortcut
Run current cell and move to next	Shift + Enter
Run current cell and stay	Ctrl + Enter
Insert cell below	B (in command mode)
Insert cell above	A (in command mode)
Delete cell	DD (press D twice in command mode)
Convert to Markdown	M (in command mode)
Convert to Code	Y (in command mode)
Enter command mode	Esc
Enter edit mode	Enter
Save notebook	Ctrl + S
Undo within cell	Ctrl + Z

Best Practices for This Course

Start every notebook with imports (see Appendix B, Section B.1).
Run cells in order from top to bottom. If something breaks, try Kernel > Restart and Run All.
Use Markdown cells to explain your reasoning. Your notebook should read like a report, not just a script.
Name your notebooks clearly: ch05-exploring-data.ipynb, not Untitled3.ipynb.
Save frequently. Colab auto-saves; JupyterLab requires Ctrl+S.

Working with Data Files

# Check your current working directory
import os
print(os.getcwd())

# List files in the current directory
os.listdir(".")

# Change directory (JupyterLab only — not needed in Colab)
os.chdir("/path/to/your/data")

C.5 Installing Additional Packages

In Colab

!pip install package_name

With Anaconda

conda install package_name
# or if not available via conda:
pip install package_name

With pip

pip install package_name

Packages Used in This Textbook

All packages below should already be installed if you followed the setup above:

Package	Version Used	Purpose
numpy	1.24+	Numerical computing, random sampling
pandas	2.0+	Data loading, cleaning, manipulation
matplotlib	3.7+	Base plotting library
seaborn	0.12+	Statistical visualizations
scipy	1.10+	Statistical tests, distributions
statsmodels	0.14+	Regression, proportion tests, power analysis
scikit-learn	1.3+	Logistic regression, evaluation metrics (Ch.24)
jupyterlab	4.0+	Notebook environment

C.6 Troubleshooting Common Issues

"ModuleNotFoundError: No module named 'pandas'"

Cause: The package isn't installed in your current Python environment.

Fix:

pip install pandas

In Colab: !pip install pandas

If you're using Anaconda, make sure you've activated the correct environment:

conda activate stats_course

"Kernel died" or "Kernel is dead"

Cause: Usually caused by running out of memory (loading a very large dataset) or a package conflict.

Fix: 1. Save your notebook. 2. Click Kernel > Restart. 3. Re-run your cells from the top. 4. If the problem persists with a large dataset, try loading only a subset: df = pd.read_csv("data.csv", nrows=10000)

Plots Not Showing

Cause: Missing the plt.show() call, or the notebook isn't configured for inline plots.

Fix:

# Add this at the top of your notebook (JupyterLab usually doesn't need it)
%matplotlib inline

# Always call plt.show() after creating a figure
plt.show()

"SettingWithCopyWarning"

Cause: You're trying to modify a slice of a DataFrame.

Fix: Use .loc for assignment:

# Instead of this (warning):
df[df["age"] > 30]["category"] = "Senior"

# Do this:
df.loc[df["age"] > 30, "category"] = "Senior"

"DtypeWarning: Columns have mixed types"

Cause: A column has inconsistent data types (e.g., numbers mixed with text).

Fix:

df = pd.read_csv("data.csv", dtype={"column_name": str})
# Then clean the column:
df["column_name"] = pd.to_numeric(df["column_name"], errors="coerce")

Colab Session Disconnects

Cause: You've been inactive for too long, or the session exceeded its time limit.

Fix: - Save your work to Google Drive frequently. - Re-run your notebook from the top when you reconnect (Colab doesn't preserve variables between sessions).

Version Conflicts

Cause: Different packages require different versions of shared dependencies.

Fix: Create a fresh environment:

conda create -n stats_fresh python=3.11
conda activate stats_fresh
pip install numpy pandas matplotlib seaborn scipy statsmodels scikit-learn jupyterlab

"FutureWarning" or "DeprecationWarning"

Cause: You're using syntax that will change in a future version of a library.

Fix: These are warnings, not errors. Your code still runs correctly. To suppress them during class (not recommended for production work):

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Excel Can't Open a CSV Properly

Cause: Encoding issues or delimiter confusion.

Fix:

# Try specifying encoding
df = pd.read_csv("data.csv", encoding="utf-8")
# or
df = pd.read_csv("data.csv", encoding="latin-1")

# If the delimiter isn't a comma
df = pd.read_csv("data.tsv", sep="\t")    # Tab-separated
df = pd.read_csv("data.csv", sep=";")      # Semicolon-separated

C.7 Quick Start Checklist

Use this checklist to verify your environment is ready for the course:

[ ] Python 3.9+ is installed (python --version)
[ ] You can open a Jupyter notebook (Colab, JupyterLab, or Notebook)
[ ] The following code runs without errors:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)
print("SciPy:", stats.scipy.__version__)

# Quick test
data = np.random.normal(100, 15, size=50)
df = pd.DataFrame({"scores": data})
print(df.describe())
sns.histplot(df["scores"])
plt.title("Setup Test: Random Scores")
plt.show()
print("\nAll systems go!")

If that code produces a histogram and prints "All systems go!" — you're ready for Chapter 3.