Appendix B: Python Quick Reference for Statistics

This appendix collects the essential Python code used throughout the textbook into a single reference. All code assumes you have imported the standard libraries as shown in Section B.1. For setup instructions, see Appendix C.


B.1 Standard Imports

Every notebook in this textbook begins with these imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Optional: nicer defaults for plots
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

Additional imports used in specific chapters:

import statsmodels.api as sm                          # Ch.22-24
from statsmodels.stats.proportion import (            # Ch.14, Ch.16
    proportions_ztest, proportion_confint,
    confint_proportions_2indep
)
from statsmodels.stats.multicomp import pairwise_tukeyhsd  # Ch.20
from statsmodels.stats.power import (                 # Ch.17
    TTestIndPower, NormalIndPower
)
from sklearn.linear_model import LogisticRegression   # Ch.24
from sklearn.metrics import (                         # Ch.24
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score
)
from sklearn.model_selection import train_test_split  # Ch.24

B.2 pandas Essentials (Ch.3, Ch.7)

Loading Data

# Load a CSV file
df = pd.read_csv("data.csv")

# Load from a URL
df = pd.read_csv("https://example.com/data.csv")

# Load an Excel file
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Load with specific options
df = pd.read_csv("data.csv",
                  na_values=["", "NA", "N/A", "Missing"],
                  parse_dates=["date_column"])

Viewing Data

df.head()              # First 5 rows
df.head(10)            # First 10 rows
df.tail()              # Last 5 rows
df.shape               # (rows, columns)
df.columns             # Column names
df.dtypes              # Data types of each column
df.info()              # Summary: types, non-null counts, memory
df.describe()          # Summary statistics for numerical columns
df.describe(include="all")  # Include categorical columns too
df["column"].value_counts()       # Frequency table for categorical
df["column"].value_counts(normalize=True)  # Proportions
df["column"].unique()             # Unique values
df["column"].nunique()            # Count of unique values

Filtering and Selecting

# Select a single column (returns a Series)
df["age"]

# Select multiple columns (returns a DataFrame)
df[["age", "income", "education"]]

# Filter rows by condition
df[df["age"] > 30]
df[df["state"] == "California"]
df[(df["age"] > 30) & (df["income"] > 50000)]  # AND
df[(df["age"] < 20) | (df["age"] > 65)]         # OR
df[df["state"].isin(["CA", "NY", "TX"])]         # In a list
df[~df["state"].isin(["CA", "NY"])]              # NOT in a list

Sorting

df.sort_values("age")                          # Ascending
df.sort_values("age", ascending=False)         # Descending
df.sort_values(["state", "age"])               # Multiple columns

Grouping and Aggregating

# Group by one variable
df.groupby("state")["income"].mean()
df.groupby("state")["income"].median()
df.groupby("state")["income"].describe()

# Group by multiple variables
df.groupby(["state", "gender"])["income"].mean()

# Multiple aggregations at once
df.groupby("state")["income"].agg(["mean", "median", "std", "count"])

# Custom aggregation
df.groupby("state").agg(
    avg_income=("income", "mean"),
    median_income=("income", "median"),
    count=("income", "size")
)

Data Cleaning (Ch.7)

# Check for missing values
df.isna().sum()                    # Count NAs per column
df.isna().mean()                   # Proportion of NAs per column
df.isna().sum().sum()              # Total NAs in the entire DataFrame

# Handle missing values
df.dropna()                        # Drop all rows with any NA
df.dropna(subset=["age", "income"])  # Drop only if these columns have NA
df["age"].fillna(df["age"].median())  # Fill with median
df["category"].fillna(df["category"].mode()[0])  # Fill with mode

# Check for and remove duplicates
df.duplicated().sum()              # Count duplicates
df.drop_duplicates()               # Remove exact duplicate rows
df.drop_duplicates(subset=["id"])  # Remove by specific columns

# Fix inconsistent text
df["state"] = df["state"].str.strip()    # Remove whitespace
df["state"] = df["state"].str.lower()    # Lowercase
df["state"] = df["state"].str.title()    # Title Case
df["state"] = df["state"].replace({
    "Calif.": "California",
    "CA": "California"
})

# Change data types
df["age"] = df["age"].astype(int)
df["date"] = pd.to_datetime(df["date"])
df["category"] = df["category"].astype("category")

# Create new variables (feature engineering)
df["age_group"] = pd.cut(df["age"],
                         bins=[0, 18, 35, 55, 100],
                         labels=["Youth", "Young Adult", "Middle", "Senior"])
df["income_quartile"] = pd.qcut(df["income"], q=4,
                                labels=["Q1", "Q2", "Q3", "Q4"])
df["bmi"] = df["weight_kg"] / (df["height_m"] ** 2)

# Reshaping: wide to long
df_long = pd.melt(df,
                  id_vars=["student_id"],
                  value_vars=["exam1", "exam2", "exam3"],
                  var_name="exam",
                  value_name="score")

# Contingency table / crosstab
pd.crosstab(df["gender"], df["smoking_status"])
pd.crosstab(df["gender"], df["smoking_status"], normalize="index")  # Row %
pd.crosstab(df["gender"], df["smoking_status"], margins=True)       # With totals

B.3 Visualization Essentials (Ch.5, Ch.22, Ch.25)

Histograms

# Basic histogram
plt.hist(df["age"], bins=15, edgecolor="black", alpha=0.7)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Distribution of Age")
plt.show()

# Seaborn histogram with KDE
sns.histplot(df["age"], bins=15, kde=True)
plt.title("Distribution of Age")
plt.show()

# Side-by-side histograms for comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(df[df["group"] == "A"]["score"], ax=axes[0], bins=15)
axes[0].set_title("Group A")
sns.histplot(df[df["group"] == "B"]["score"], ax=axes[1], bins=15)
axes[1].set_title("Group B")
plt.tight_layout()
plt.show()

Bar Charts

# Frequency bar chart
df["category"].value_counts().plot(kind="bar", edgecolor="black")
plt.xlabel("Category")
plt.ylabel("Count")
plt.title("Frequency by Category")
plt.xticks(rotation=45)
plt.show()

# Seaborn bar chart (with error bars showing CI)
sns.barplot(data=df, x="group", y="score", ci=95)
plt.title("Mean Score by Group (with 95% CI)")
plt.show()

# Grouped bar chart
sns.barplot(data=df, x="department", y="salary", hue="gender")
plt.title("Mean Salary by Department and Gender")
plt.show()

Box Plots

# Basic box plot
sns.boxplot(data=df, y="income")
plt.title("Income Distribution")
plt.show()

# Grouped box plot (comparing distributions)
sns.boxplot(data=df, x="education", y="income")
plt.title("Income by Education Level")
plt.xticks(rotation=45)
plt.show()

# Horizontal box plot
sns.boxplot(data=df, x="income", y="state", orient="h")
plt.title("Income Distribution by State")
plt.show()

Scatterplots (Ch.22)

# Basic scatterplot
plt.scatter(df["hours_studied"], df["exam_score"], alpha=0.6)
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")
plt.title("Study Time vs. Exam Performance")
plt.show()

# Scatterplot with regression line
sns.regplot(data=df, x="hours_studied", y="exam_score", ci=95)
plt.title("Study Time vs. Exam Performance (with regression line)")
plt.show()

# Colored by group
sns.scatterplot(data=df, x="hours_studied", y="exam_score",
                hue="gender", style="gender")
plt.title("Study Time vs. Score by Gender")
plt.show()

# Correlation heatmap
corr_matrix = df[["age", "income", "education_years", "hours_worked"]].corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.title("Correlation Matrix")
plt.show()

QQ-Plots (Ch.10)

# QQ-plot to assess normality
stats.probplot(df["variable"], dist="norm", plot=plt)
plt.title("QQ-Plot: Checking Normality")
plt.show()

Saving Figures

plt.savefig("my_figure.png", dpi=300, bbox_inches="tight")
plt.savefig("my_figure.pdf", bbox_inches="tight")  # Vector format

B.4 Descriptive Statistics (Ch.6)

# Individual measures
df["income"].mean()            # Mean
df["income"].median()          # Median
df["income"].mode()            # Mode (can have multiple)
df["income"].std()             # Standard deviation (sample, n-1)
df["income"].var()             # Variance (sample, n-1)
df["income"].min()             # Minimum
df["income"].max()             # Maximum
df["income"].quantile(0.25)    # First quartile (Q1)
df["income"].quantile(0.75)    # Third quartile (Q3)
df["income"].quantile(0.75) - df["income"].quantile(0.25)  # IQR
df["income"].sem()             # Standard error of the mean

# All at once
df["income"].describe()
# Returns: count, mean, std, min, 25%, 50%, 75%, max

# z-scores
from scipy.stats import zscore
df["income_z"] = zscore(df["income"])

# Manual z-score
df["income_z"] = (df["income"] - df["income"].mean()) / df["income"].std()

# Five-number summary
df["income"].quantile([0, 0.25, 0.5, 0.75, 1.0])

# Skewness and kurtosis
df["income"].skew()       # Positive = right-skewed
df["income"].kurtosis()   # Excess kurtosis (0 = normal)

B.5 Probability (Ch.8-10)

Basic Probability from Data

# Proportion (empirical probability)
(df["outcome"] == "Success").mean()     # P(Success)

# Joint probability from crosstab
ct = pd.crosstab(df["A"], df["B"], normalize="all")  # Joint probabilities

# Conditional probability from crosstab
ct_row = pd.crosstab(df["A"], df["B"], normalize="index")  # P(B|A)
ct_col = pd.crosstab(df["A"], df["B"], normalize="columns")  # P(A|B)

Binomial Distribution (Ch.10)

from scipy.stats import binom

n, p = 20, 0.35
# P(X = k): probability mass function
binom.pmf(k=7, n=n, p=p)          # P(X = 7)

# P(X <= k): cumulative distribution function
binom.cdf(k=7, n=n, p=p)          # P(X <= 7)

# P(X > k)
1 - binom.cdf(k=7, n=n, p=p)     # P(X > 7)
binom.sf(k=7, n=n, p=p)           # Same thing (survival function)

# P(a <= X <= b)
binom.cdf(10, n, p) - binom.cdf(4, n, p)  # P(5 <= X <= 10)

# Mean and standard deviation
binom.mean(n=n, p=p)               # n * p
binom.std(n=n, p=p)                # sqrt(n * p * (1-p))

# Simulate binomial outcomes
np.random.binomial(n=20, p=0.35, size=1000)  # 1000 simulated counts

Normal Distribution (Ch.10)

from scipy.stats import norm

mu, sigma = 100, 15
# P(X <= x): CDF
norm.cdf(115, loc=mu, scale=sigma)           # P(X <= 115)

# P(X > x)
norm.sf(115, loc=mu, scale=sigma)            # P(X > 115)
1 - norm.cdf(115, loc=mu, scale=sigma)       # Same thing

# P(a < X < b)
norm.cdf(115, mu, sigma) - norm.cdf(85, mu, sigma)  # P(85 < X < 115)

# Inverse: find x for a given probability (percentile/quantile)
norm.ppf(0.90, loc=mu, scale=sigma)          # 90th percentile

# Standard normal (z-scores)
norm.cdf(1.96)                                # P(Z <= 1.96) = 0.975
norm.ppf(0.975)                               # z* = 1.96

# Normality test
stat, p_value = stats.shapiro(df["variable"])
# If p < 0.05, evidence against normality

B.6 Sampling and Simulation (Ch.11, Ch.18)

Simulating Sampling Distributions (Ch.11)

population = df["income"].values
sample_means = []

for _ in range(10000):
    sample = np.random.choice(population, size=30, replace=True)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

# Plot the sampling distribution
plt.hist(sample_means, bins=50, edgecolor="black", density=True)
plt.xlabel("Sample Mean")
plt.ylabel("Density")
plt.title(f"Sampling Distribution of x-bar (n=30)")
plt.axvline(np.mean(sample_means), color="red", linestyle="--",
            label=f"Mean = {np.mean(sample_means):.2f}")
plt.legend()
plt.show()

# Standard error
print(f"SE (simulated): {np.std(sample_means):.4f}")
print(f"SE (formula):   {np.std(population) / np.sqrt(30):.4f}")

Bootstrap Confidence Intervals (Ch.18)

data = df["variable"].values
n_bootstrap = 10000
boot_means = []

for _ in range(n_bootstrap):
    boot_sample = np.random.choice(data, size=len(data), replace=True)
    boot_means.append(boot_sample.mean())

boot_means = np.array(boot_means)

# Percentile method confidence interval
ci_lower = np.percentile(boot_means, 2.5)
ci_upper = np.percentile(boot_means, 97.5)
print(f"95% Bootstrap CI: ({ci_lower:.3f}, {ci_upper:.3f})")

# Plot bootstrap distribution
plt.hist(boot_means, bins=50, edgecolor="black", density=True)
plt.axvline(ci_lower, color="red", linestyle="--", label=f"Lower: {ci_lower:.3f}")
plt.axvline(ci_upper, color="red", linestyle="--", label=f"Upper: {ci_upper:.3f}")
plt.title("Bootstrap Distribution of the Mean")
plt.legend()
plt.show()

Permutation Test (Ch.18)

group_a = df[df["group"] == "A"]["score"].values
group_b = df[df["group"] == "B"]["score"].values
observed_diff = group_a.mean() - group_b.mean()

combined = np.concatenate([group_a, group_b])
n_a = len(group_a)
n_permutations = 10000
perm_diffs = []

for _ in range(n_permutations):
    np.random.shuffle(combined)
    perm_diff = combined[:n_a].mean() - combined[n_a:].mean()
    perm_diffs.append(perm_diff)

perm_diffs = np.array(perm_diffs)

# Two-tailed p-value
p_value = np.mean(np.abs(perm_diffs) >= np.abs(observed_diff))
print(f"Observed difference: {observed_diff:.3f}")
print(f"Permutation p-value: {p_value:.4f}")

B.7 Confidence Intervals (Ch.12, Ch.14)

CI for a Mean (Ch.12)

data = df["variable"].values
n = len(data)
x_bar = data.mean()
se = data.std(ddof=1) / np.sqrt(n)    # ddof=1 for sample std
df_val = n - 1

# Using scipy
ci = stats.t.interval(confidence=0.95, df=df_val, loc=x_bar, scale=se)
print(f"95% CI for the mean: ({ci[0]:.3f}, {ci[1]:.3f})")

# Manual calculation
t_star = stats.t.ppf(0.975, df=df_val)
moe = t_star * se
print(f"95% CI: ({x_bar - moe:.3f}, {x_bar + moe:.3f})")

CI for a Proportion (Ch.14)

from statsmodels.stats.proportion import proportion_confint

x = 54      # Number of successes
n = 150     # Sample size
p_hat = x / n

# Wald interval (standard)
ci_wald = proportion_confint(x, n, alpha=0.05, method="normal")

# Wilson interval (recommended)
ci_wilson = proportion_confint(x, n, alpha=0.05, method="wilson")

# Plus-four method
ci_plus4 = proportion_confint(x + 2, n + 4, alpha=0.05, method="normal")

print(f"Wald:    ({ci_wald[0]:.4f}, {ci_wald[1]:.4f})")
print(f"Wilson:  ({ci_wilson[0]:.4f}, {ci_wilson[1]:.4f})")
print(f"Plus-4:  ({ci_plus4[0]:.4f}, {ci_plus4[1]:.4f})")

Sample Size Determination (Ch.12)

# For a mean: n = (z* * sigma / E)^2
z_star = stats.norm.ppf(0.975)   # For 95% CI
sigma_est = 15                    # Estimated population SD
E = 3                             # Desired margin of error
n_needed = (z_star * sigma_est / E) ** 2
print(f"Required sample size: {int(np.ceil(n_needed))}")

# For a proportion: n = (z* / E)^2 * p*(1-p)
p_est = 0.5     # Most conservative estimate
E = 0.03         # Desired margin of error (3 percentage points)
n_needed = (z_star / E) ** 2 * p_est * (1 - p_est)
print(f"Required sample size: {int(np.ceil(n_needed))}")

B.8 Hypothesis Tests (Ch.13-16)

One-Sample t-Test for a Mean (Ch.15)

data = df["variable"].values
mu_0 = 100  # Hypothesized population mean

t_stat, p_value = stats.ttest_1samp(data, popmean=mu_0)
print(f"t = {t_stat:.4f}, p = {p_value:.4f} (two-tailed)")

# One-tailed p-value (Ha: mu > mu_0)
p_one_tail = p_value / 2 if t_stat > 0 else 1 - p_value / 2
print(f"One-tailed p-value (greater): {p_one_tail:.4f}")

One-Sample z-Test for a Proportion (Ch.14)

from statsmodels.stats.proportion import proportions_ztest

count = 54        # Number of successes
nobs = 150        # Total observations
p_0 = 0.30        # Null hypothesis proportion

# Two-tailed test
z_stat, p_value = proportions_ztest(count, nobs, value=p_0,
                                     alternative="two-sided")
print(f"z = {z_stat:.4f}, p = {p_value:.4f}")

# One-tailed (Ha: p > 0.30)
z_stat, p_value = proportions_ztest(count, nobs, value=p_0,
                                     alternative="larger")
print(f"z = {z_stat:.4f}, p = {p_value:.4f} (one-tailed)")

Two-Sample t-Test for Independent Means (Ch.16)

group_a = df[df["group"] == "A"]["score"]
group_b = df[df["group"] == "B"]["score"]

# Welch's t-test (default, does NOT assume equal variances)
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=False)
print(f"t = {t_stat:.4f}, p = {p_value:.4f} (two-tailed)")

# Pooled t-test (assumes equal variances — rarely recommended)
t_stat, p_value = stats.ttest_ind(group_a, group_b, equal_var=True)

Paired t-Test (Ch.16)

before = df["score_before"]
after = df["score_after"]

t_stat, p_value = stats.ttest_rel(before, after)
print(f"Paired t = {t_stat:.4f}, p = {p_value:.4f}")

# Equivalent to one-sample t-test on the differences:
differences = after - before
t_stat2, p_value2 = stats.ttest_1samp(differences, popmean=0)

Two-Proportion z-Test (Ch.16)

from statsmodels.stats.proportion import proportions_ztest

count = np.array([45, 32])     # Successes in each group
nobs = np.array([150, 140])    # Total in each group

z_stat, p_value = proportions_ztest(count, nobs, alternative="two-sided")
print(f"z = {z_stat:.4f}, p = {p_value:.4f}")

# CI for the difference in proportions
from statsmodels.stats.proportion import confint_proportions_2indep
ci_low, ci_high = confint_proportions_2indep(
    count[0], nobs[0], count[1], nobs[1], method="wald"
)
print(f"95% CI for p1-p2: ({ci_low:.4f}, {ci_high:.4f})")

B.9 Chi-Square Tests (Ch.19)

Goodness-of-Fit Test

observed = np.array([45, 30, 25])             # Observed counts
expected = np.array([33.33, 33.33, 33.34])    # Expected under H0

chi2_stat, p_value = stats.chisquare(f_obs=observed, f_exp=expected)
print(f"Chi-square = {chi2_stat:.4f}, p = {p_value:.4f}")

Test of Independence

# From a contingency table
contingency = pd.crosstab(df["gender"], df["preference"])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi-square = {chi2:.4f}")
print(f"p-value = {p_value:.4f}")
print(f"df = {dof}")
print(f"\nExpected frequencies:\n{expected}")

# Cramer's V (effect size)
n = contingency.sum().sum()
k = min(contingency.shape) - 1
cramers_v = np.sqrt(chi2 / (n * k))
print(f"Cramer's V = {cramers_v:.4f}")

B.10 ANOVA (Ch.20)

One-Way ANOVA

# Separate arrays by group
group_1 = df[df["treatment"] == "A"]["score"]
group_2 = df[df["treatment"] == "B"]["score"]
group_3 = df[df["treatment"] == "C"]["score"]

# F-test
f_stat, p_value = stats.f_oneway(group_1, group_2, group_3)
print(f"F = {f_stat:.4f}, p = {p_value:.4f}")

# Eta-squared (effect size)
# SS_B = SST - SSW
grand_mean = df["score"].mean()
ss_total = ((df["score"] - grand_mean) ** 2).sum()
ss_within = sum(((g - g.mean()) ** 2).sum()
                for g in [group_1, group_2, group_3])
ss_between = ss_total - ss_within
eta_squared = ss_between / ss_total
print(f"Eta-squared = {eta_squared:.4f}")

Tukey's HSD Post-Hoc Test

from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(df["score"], df["treatment"], alpha=0.05)
print(tukey)
# Shows: group1, group2, mean_diff, p-adj, lower, upper, reject

Levene's Test (Equal Variances)

stat, p_value = stats.levene(group_1, group_2, group_3)
print(f"Levene's test: W = {stat:.4f}, p = {p_value:.4f}")
# If p < 0.05, variances are significantly different -> consider Welch's ANOVA

B.11 Nonparametric Tests (Ch.21)

Mann-Whitney U Test (Wilcoxon Rank-Sum)

# Alternative to two-sample t-test
u_stat, p_value = stats.mannwhitneyu(group_a, group_b,
                                      alternative="two-sided")
print(f"U = {u_stat:.4f}, p = {p_value:.4f}")

Wilcoxon Signed-Rank Test

# Alternative to paired t-test
stat, p_value = stats.wilcoxon(after - before, alternative="two-sided")
print(f"W = {stat:.4f}, p = {p_value:.4f}")

Kruskal-Wallis Test

# Alternative to one-way ANOVA
h_stat, p_value = stats.kruskal(group_1, group_2, group_3)
print(f"H = {h_stat:.4f}, p = {p_value:.4f}")

Sign Test

# Simplest paired nonparametric test
differences = after - before
n_positive = (differences > 0).sum()
n_nonzero = (differences != 0).sum()

result = stats.binomtest(n_positive, n_nonzero, p=0.5,
                          alternative="two-sided")
print(f"Sign test p-value = {result.pvalue:.4f}")

Shapiro-Wilk Test for Normality

stat, p_value = stats.shapiro(data)
print(f"Shapiro-Wilk: W = {stat:.4f}, p = {p_value:.4f}")
# Small p -> evidence against normality -> consider nonparametric

B.12 Correlation and Regression (Ch.22-23)

Pearson Correlation

r, p_value = stats.pearsonr(df["x"], df["y"])
print(f"r = {r:.4f}, p = {p_value:.4f}")

# Correlation matrix
corr_matrix = df[["x", "y", "z"]].corr()
print(corr_matrix)

Simple Linear Regression

# Quick method (scipy)
slope, intercept, r_value, p_value, std_err = stats.linregress(
    df["x"], df["y"]
)
print(f"y = {intercept:.3f} + {slope:.3f}x")
print(f"R-squared = {r_value**2:.4f}")
print(f"p-value (slope) = {p_value:.4f}")

# Full method (statsmodels)
X = sm.add_constant(df["x"])  # Adds intercept column
model = sm.OLS(df["y"], X).fit()
print(model.summary())

# Predictions
df["predicted"] = model.predict(X)
df["residual"] = df["y"] - df["predicted"]

Multiple Regression (Ch.23)

# Multiple predictors
X = df[["x1", "x2", "x3"]]
X = sm.add_constant(X)
y = df["y"]

model = sm.OLS(y, X).fit()
print(model.summary())

# Key outputs from the summary:
# model.params         -> coefficients
# model.pvalues        -> p-values for each coefficient
# model.rsquared       -> R-squared
# model.rsquared_adj   -> Adjusted R-squared
# model.f_pvalue       -> p-value for overall F-test
# model.conf_int()     -> 95% CIs for each coefficient

# Categorical predictors (dummy variables)
df_dummies = pd.get_dummies(df, columns=["category"], drop_first=True)

# VIF for multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns[1:]  # Exclude constant
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                   for i in range(1, X.shape[1])]
print(vif_data)
# VIF > 10 suggests problematic multicollinearity

# Residual diagnostics
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residuals vs. fitted
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.5)
axes[0].axhline(0, color="red", linestyle="--")
axes[0].set_xlabel("Fitted Values")
axes[0].set_ylabel("Residuals")
axes[0].set_title("Residuals vs. Fitted")

# QQ-plot of residuals
stats.probplot(model.resid, plot=axes[1])
axes[1].set_title("QQ-Plot of Residuals")
plt.tight_layout()
plt.show()

B.13 Logistic Regression (Ch.24)

# Using statsmodels (for inference — p-values, CIs)
X = df[["x1", "x2", "x3"]]
X = sm.add_constant(X)
y = df["outcome"]  # Binary: 0 or 1

logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())

# Odds ratios
print("\nOdds Ratios:")
print(np.exp(logit_model.params))

# Odds ratio confidence intervals
print("\nOdds Ratio 95% CIs:")
print(np.exp(logit_model.conf_int()))

# Using sklearn (for prediction — confusion matrix, ROC)
X_train, X_test, y_train, y_test = train_test_split(
    df[["x1", "x2", "x3"]], df["outcome"],
    test_size=0.3, random_state=42
)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{cm}")

# Classification report (precision, recall, F1)
print(classification_report(y_test, y_pred))

# ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

plt.plot(fpr, tpr, label=f"AUC = {auc:.3f}")
plt.plot([0, 1], [0, 1], "k--", label="Random (AUC = 0.5)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

B.14 Power Analysis (Ch.17)

from statsmodels.stats.power import TTestIndPower, NormalIndPower

# Power for a two-sample t-test
power_analysis = TTestIndPower()

# Calculate required sample size
n = power_analysis.solve_power(
    effect_size=0.5,    # Cohen's d (medium)
    alpha=0.05,
    power=0.80,
    ratio=1.0,          # n2/n1
    alternative="two-sided"
)
print(f"Required n per group: {int(np.ceil(n))}")

# Calculate power for a given sample size
power = power_analysis.solve_power(
    effect_size=0.5,
    alpha=0.05,
    nobs1=30,
    ratio=1.0,
    alternative="two-sided"
)
print(f"Power with n=30 per group: {power:.3f}")

# Power curve
sample_sizes = np.arange(10, 200, 5)
powers = [power_analysis.solve_power(
    effect_size=0.5, alpha=0.05, nobs1=n, ratio=1.0
) for n in sample_sizes]

plt.plot(sample_sizes, powers)
plt.axhline(0.80, color="red", linestyle="--", label="80% power")
plt.xlabel("Sample Size per Group")
plt.ylabel("Power")
plt.title("Power Curve (Cohen's d = 0.5, alpha = 0.05)")
plt.legend()
plt.show()

# Effect sizes
# Cohen's d
d = (group_a.mean() - group_b.mean()) / np.sqrt(
    ((len(group_a)-1)*group_a.var() + (len(group_b)-1)*group_b.var()) /
    (len(group_a) + len(group_b) - 2)
)
print(f"Cohen's d = {d:.4f}")

# Cohen's h (for proportions)
p1, p2 = 0.38, 0.31
cohens_h = 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2))
print(f"Cohen's h = {cohens_h:.4f}")

B.15 Quick Decision Guide: Which Function Do I Need?

I want to... Function Chapter
Load a CSV file pd.read_csv() 3
View summary statistics df.describe() 3, 6
Count missing values df.isna().sum() 7
Make a histogram sns.histplot() or plt.hist() 5
Make a box plot sns.boxplot() 6
Make a scatterplot sns.scatterplot() or plt.scatter() 5, 22
Check normality (visual) stats.probplot() 10
Check normality (test) stats.shapiro() 10
Simulate sampling distribution np.random.choice() in a loop 11
Compute a CI for a mean stats.t.interval() 12
Compute a CI for a proportion proportion_confint() 14
One-sample t-test stats.ttest_1samp() 15
One-sample z-test for proportion proportions_ztest() 14
Two-sample t-test (independent) stats.ttest_ind(equal_var=False) 16
Paired t-test stats.ttest_rel() 16
Two-proportion z-test proportions_ztest() (two counts) 16
Chi-square goodness-of-fit stats.chisquare() 19
Chi-square test of independence stats.chi2_contingency() 19
One-way ANOVA stats.f_oneway() 20
Post-hoc pairwise comparisons pairwise_tukeyhsd() 20
Mann-Whitney U test stats.mannwhitneyu() 21
Wilcoxon signed-rank test stats.wilcoxon() 21
Kruskal-Wallis test stats.kruskal() 21
Pearson correlation stats.pearsonr() 22
Simple linear regression sm.OLS() or stats.linregress() 22
Multiple regression sm.OLS() with multiple predictors 23
Logistic regression (inference) sm.Logit() 24
Logistic regression (prediction) LogisticRegression() (sklearn) 24
Power analysis TTestIndPower() 17
Bootstrap CI np.random.choice(replace=True) loop 18
Permutation test Shuffle + loop 18