Appendix A: Statistical Foundations
This appendix provides quick reference for the statistical concepts used throughout the textbook.
Descriptive Statistics
Measures of Central Tendency
Mean (Average):
μ = Σx / n
- Sum all values, divide by count
- Sensitive to outliers
Median: - Middle value when sorted - Robust to outliers
Mode: - Most frequent value - Useful for categorical data
Measures of Spread
Variance:
σ² = Σ(x - μ)² / n
Standard Deviation:
σ = √(σ²)
Coefficient of Variation:
CV = σ / μ
- Standardized measure of spread
- Useful for comparing variability across different scales
Range:
Range = max - min
Interquartile Range (IQR):
IQR = Q3 - Q1
Probability Distributions
Normal Distribution
f(x) = (1 / σ√2π) × e^(-(x-μ)²/2σ²)
Key Properties: - 68% within 1 standard deviation - 95% within 2 standard deviations - 99.7% within 3 standard deviations
Z-Score
z = (x - μ) / σ
- Standardized distance from mean
- Allows comparison across distributions
Percentiles
Percentile rank = (values below x / total values) × 100
Regression Analysis
Simple Linear Regression
y = β₀ + β₁x + ε
Coefficients:
β₁ = Σ(x - x̄)(y - ȳ) / Σ(x - x̄)²
β₀ = ȳ - β₁x̄
R-Squared (Coefficient of Determination)
R² = 1 - (SS_residual / SS_total)
- Proportion of variance explained
- 0 to 1 scale (higher = better fit)
Multiple Regression
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Correlation
Pearson Correlation
r = Σ(x - x̄)(y - ȳ) / √[Σ(x - x̄)² × Σ(y - ȳ)²]
Interpretation: | r Value | Interpretation | |---------|----------------| | 0.9-1.0 | Very strong | | 0.7-0.9 | Strong | | 0.5-0.7 | Moderate | | 0.3-0.5 | Weak | | 0.0-0.3 | Very weak |
Correlation vs Causation
Correlation does not imply causation. Consider: - Confounding variables - Reverse causality - Spurious correlation
Hypothesis Testing
Null and Alternative Hypotheses
- H₀: Null hypothesis (no effect)
- H₁: Alternative hypothesis (effect exists)
P-Value
- Probability of observing result if H₀ is true
- Common threshold: p < 0.05
Type I and Type II Errors
| H₀ True | H₀ False | |
|---|---|---|
| Reject H₀ | Type I (α) | Correct |
| Fail to Reject | Correct | Type II (β) |
Confidence Intervals
CI = x̄ ± z*(σ/√n)
95% CI uses z* = 1.96
Sample Size and Power
Standard Error
SE = σ / √n
Sample Size Calculation
For detecting effect size d with power (1-β):
n = 2 × [(z_α + z_β) / d]²
Power Analysis
Power = 1 - β = P(reject H₀ | H₁ true)
- Typically aim for 80% power minimum
Bayesian Basics
Bayes' Theorem
P(A|B) = P(B|A) × P(A) / P(B)
Components: - P(A|B): Posterior probability - P(B|A): Likelihood - P(A): Prior probability - P(B): Evidence
Prior and Posterior
- Prior: Belief before seeing data
- Posterior: Updated belief after data
- As data increases, posterior becomes more certain
Time Series Concepts
Exponential Weighted Moving Average
EWMA_t = α × x_t + (1-α) × EWMA_{t-1}
- α = smoothing factor (0-1)
- Higher α = more recent weight
Autocorrelation
- Correlation of series with lagged version of itself
- Important for detecting patterns
Regression to the Mean
Extreme observations tend to be followed by more moderate ones:
Regressed value = mean + r × (observed - mean)
Where r = reliability coefficient
Sports-Specific Statistical Considerations
Small Sample Sizes
- NFL has 17 games per team
- Individual game samples very small
- Use multi-season data when possible
Selection Bias
- Only see outcomes for players who made it
- Beware survivorship bias
Variance vs Skill
In small samples, separate signal from noise:
True skill = Observed + regression_to_mean × (Observed - Average)
Pythagorean Expectation
Win% = Points^n / (Points^n + Points_Allowed^n)
- NFL: n ≈ 2.37
Common Pitfalls
- Overfitting: Model fits training data too closely
- P-Hacking: Testing until significance found
- Multiple Comparisons: Inflated false positive rate
- Ignoring Base Rates: Priors matter
- Correlation ≠ Causation: Always consider alternatives
Quick Reference Formulas
| Metric | Formula |
|---|---|
| Mean | Σx / n |
| Variance | Σ(x-μ)² / n |
| Std Dev | √Variance |
| Z-score | (x - μ) / σ |
| Correlation | Σ(x-x̄)(y-ȳ) / √[Σ(x-x̄)²Σ(y-ȳ)²] |
| SE of Mean | σ / √n |
| 95% CI | x̄ ± 1.96 × SE |
This appendix provides the statistical foundation for NFL analytics. Refer to it when encountering statistical concepts in the main chapters.