Appendix C: Environment Setup Guide
This guide walks you through setting up a Python environment for statistical analysis. You have three main options, listed from easiest to most flexible.
C.1 Option 1: Google Colab (Easiest — No Installation Required)
Best for: Students who want to start coding immediately with zero setup. Also ideal if you're working on a Chromebook, a shared computer, or any device where you can't install software.
Getting Started
- Open a web browser and navigate to colab.research.google.com.
- Sign in with a Google account (any Gmail account works).
- Click New Notebook to create a blank notebook.
- You're ready to write Python.
Key Features
- Pre-installed libraries: numpy, pandas, matplotlib, seaborn, scipy, statsmodels, and scikit-learn are all already installed. No
pip installneeded for the libraries used in this textbook. - Free GPU access: Not needed for this course, but useful if you continue to machine learning.
- Auto-save: Notebooks are saved automatically to your Google Drive.
- Sharing: Click Share in the upper right to collaborate with classmates (works like Google Docs).
Loading Data in Colab
# Option A: Upload from your computer
from google.colab import files
uploaded = files.upload() # Opens a file chooser dialog
# Option B: Load directly from a URL
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/example/data.csv")
# Option C: Mount Google Drive
from google.colab import drive
drive.mount("/content/drive")
df = pd.read_csv("/content/drive/MyDrive/stats_class/data.csv")
Limitations
- Requires internet access at all times.
- Sessions time out after ~90 minutes of inactivity (you won't lose saved work, but unsaved variables are lost).
- Runtime resets if you close the browser tab.
- File uploads don't persist between sessions unless saved to Drive.
Recommendation
Use Colab for the first half of the course (Chapters 1-14). It lets you focus on learning statistics without worrying about installation. If you find yourself wanting more control or offline access, transition to Anaconda (Option 2) at your own pace.
C.2 Option 2: Anaconda (Recommended for a Full Setup)
Best for: Students who want a reliable, all-in-one installation on their own computer. Anaconda includes Python, Jupyter, and all the major data science libraries in a single installer.
Installation
Windows
- Go to anaconda.com/download.
- Download the Windows installer (64-bit).
- Run the installer. Accept the license agreement.
- Choose "Install for: Just Me" (recommended).
- Accept the default installation location (usually
C:\Users\YourName\anaconda3). - On the "Advanced Options" screen: - Check "Add Anaconda3 to my PATH environment variable" (despite the warning — it makes things easier). - Check "Register Anaconda3 as my default Python".
- Click Install. This takes 5-10 minutes.
- When finished, open Anaconda Navigator from the Start Menu.
macOS
- Go to anaconda.com/download.
- Download the macOS installer (choose the version for your chip — Apple Silicon for M1/M2/M3 Macs, or Intel for older Macs).
- Run the
.pkginstaller and follow the prompts. - When finished, open Anaconda Navigator from Launchpad or the Applications folder.
Linux
- Go to anaconda.com/download.
- Download the Linux installer (
.shfile). - Open a terminal and run:
bash bash ~/Downloads/Anaconda3-2024.xx-x-Linux-x86_64.sh - Press Enter to review the license, type
yesto accept. - Accept the default install location or choose your own.
- When asked to initialize Anaconda, type
yes. - Close and reopen your terminal.
Launching JupyterLab
From Anaconda Navigator: 1. Open Anaconda Navigator. 2. Click Launch under JupyterLab (or Jupyter Notebook). 3. A browser tab opens automatically.
From the command line:
jupyter lab
This opens JupyterLab in your default browser.
What's Included
Anaconda comes with 250+ packages pre-installed, including everything used in this textbook: - Python 3.11+ - numpy, pandas, matplotlib, seaborn - scipy, statsmodels, scikit-learn - JupyterLab, Jupyter Notebook
Creating a Course Environment (Optional but Recommended)
Creating a dedicated environment keeps your course packages separate from other projects:
conda create -n stats_course python=3.11 numpy pandas matplotlib seaborn scipy statsmodels scikit-learn jupyterlab
conda activate stats_course
jupyter lab
To return to this environment later:
conda activate stats_course
C.3 Option 3: pip + Virtual Environment (Lightweight)
Best for: Students who already have Python installed or who prefer minimal installations.
Prerequisites
You need Python 3.9 or later. Check your version:
python --version
If you don't have Python, download it from python.org.
Setup
# Create a virtual environment
python -m venv stats_env
# Activate it
# Windows:
stats_env\Scripts\activate
# macOS/Linux:
source stats_env/bin/activate
# Install required packages
pip install numpy pandas matplotlib seaborn scipy statsmodels scikit-learn jupyterlab
# Launch JupyterLab
jupyter lab
Deactivating
When you're done working:
deactivate
To reactivate later:
# Windows:
stats_env\Scripts\activate
# macOS/Linux:
source stats_env/bin/activate
C.4 JupyterLab Basics
Whether you're using Colab, Anaconda, or pip, you'll be working in Jupyter notebooks. Here's a quick orientation.
Notebook Structure
A Jupyter notebook consists of cells. Each cell is either: - Code cell: Contains Python code. Run it by pressing Shift + Enter. - Markdown cell: Contains formatted text, headings, and explanations.
Essential Keyboard Shortcuts
| Action | Shortcut |
|---|---|
| Run current cell and move to next | Shift + Enter |
| Run current cell and stay | Ctrl + Enter |
| Insert cell below | B (in command mode) |
| Insert cell above | A (in command mode) |
| Delete cell | DD (press D twice in command mode) |
| Convert to Markdown | M (in command mode) |
| Convert to Code | Y (in command mode) |
| Enter command mode | Esc |
| Enter edit mode | Enter |
| Save notebook | Ctrl + S |
| Undo within cell | Ctrl + Z |
Best Practices for This Course
- Start every notebook with imports (see Appendix B, Section B.1).
- Run cells in order from top to bottom. If something breaks, try Kernel > Restart and Run All.
- Use Markdown cells to explain your reasoning. Your notebook should read like a report, not just a script.
- Name your notebooks clearly:
ch05-exploring-data.ipynb, notUntitled3.ipynb. - Save frequently. Colab auto-saves; JupyterLab requires Ctrl+S.
Working with Data Files
# Check your current working directory
import os
print(os.getcwd())
# List files in the current directory
os.listdir(".")
# Change directory (JupyterLab only — not needed in Colab)
os.chdir("/path/to/your/data")
C.5 Installing Additional Packages
In Colab
!pip install package_name
With Anaconda
conda install package_name
# or if not available via conda:
pip install package_name
With pip
pip install package_name
Packages Used in This Textbook
All packages below should already be installed if you followed the setup above:
| Package | Version Used | Purpose |
|---|---|---|
| numpy | 1.24+ | Numerical computing, random sampling |
| pandas | 2.0+ | Data loading, cleaning, manipulation |
| matplotlib | 3.7+ | Base plotting library |
| seaborn | 0.12+ | Statistical visualizations |
| scipy | 1.10+ | Statistical tests, distributions |
| statsmodels | 0.14+ | Regression, proportion tests, power analysis |
| scikit-learn | 1.3+ | Logistic regression, evaluation metrics (Ch.24) |
| jupyterlab | 4.0+ | Notebook environment |
C.6 Troubleshooting Common Issues
"ModuleNotFoundError: No module named 'pandas'"
Cause: The package isn't installed in your current Python environment.
Fix:
pip install pandas
In Colab: !pip install pandas
If you're using Anaconda, make sure you've activated the correct environment:
conda activate stats_course
"Kernel died" or "Kernel is dead"
Cause: Usually caused by running out of memory (loading a very large dataset) or a package conflict.
Fix:
1. Save your notebook.
2. Click Kernel > Restart.
3. Re-run your cells from the top.
4. If the problem persists with a large dataset, try loading only a subset: df = pd.read_csv("data.csv", nrows=10000)
Plots Not Showing
Cause: Missing the plt.show() call, or the notebook isn't configured for inline plots.
Fix:
# Add this at the top of your notebook (JupyterLab usually doesn't need it)
%matplotlib inline
# Always call plt.show() after creating a figure
plt.show()
"SettingWithCopyWarning"
Cause: You're trying to modify a slice of a DataFrame.
Fix: Use .loc for assignment:
# Instead of this (warning):
df[df["age"] > 30]["category"] = "Senior"
# Do this:
df.loc[df["age"] > 30, "category"] = "Senior"
"DtypeWarning: Columns have mixed types"
Cause: A column has inconsistent data types (e.g., numbers mixed with text).
Fix:
df = pd.read_csv("data.csv", dtype={"column_name": str})
# Then clean the column:
df["column_name"] = pd.to_numeric(df["column_name"], errors="coerce")
Colab Session Disconnects
Cause: You've been inactive for too long, or the session exceeded its time limit.
Fix: - Save your work to Google Drive frequently. - Re-run your notebook from the top when you reconnect (Colab doesn't preserve variables between sessions).
Version Conflicts
Cause: Different packages require different versions of shared dependencies.
Fix: Create a fresh environment:
conda create -n stats_fresh python=3.11
conda activate stats_fresh
pip install numpy pandas matplotlib seaborn scipy statsmodels scikit-learn jupyterlab
"FutureWarning" or "DeprecationWarning"
Cause: You're using syntax that will change in a future version of a library.
Fix: These are warnings, not errors. Your code still runs correctly. To suppress them during class (not recommended for production work):
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
Excel Can't Open a CSV Properly
Cause: Encoding issues or delimiter confusion.
Fix:
# Try specifying encoding
df = pd.read_csv("data.csv", encoding="utf-8")
# or
df = pd.read_csv("data.csv", encoding="latin-1")
# If the delimiter isn't a comma
df = pd.read_csv("data.tsv", sep="\t") # Tab-separated
df = pd.read_csv("data.csv", sep=";") # Semicolon-separated
C.7 Quick Start Checklist
Use this checklist to verify your environment is ready for the course:
- [ ] Python 3.9+ is installed (
python --version) - [ ] You can open a Jupyter notebook (Colab, JupyterLab, or Notebook)
- [ ] The following code runs without errors:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)
print("SciPy:", stats.scipy.__version__)
# Quick test
data = np.random.normal(100, 15, size=50)
df = pd.DataFrame({"scores": data})
print(df.describe())
sns.histplot(df["scores"])
plt.title("Setup Test: Random Scores")
plt.show()
print("\nAll systems go!")
If that code produces a histogram and prints "All systems go!" — you're ready for Chapter 3.