Appendix A: Statistical Foundations

This appendix provides quick reference for the statistical concepts used throughout the textbook.


Descriptive Statistics

Measures of Central Tendency

Mean (Average):

μ = Σx / n
  • Sum all values, divide by count
  • Sensitive to outliers

Median: - Middle value when sorted - Robust to outliers

Mode: - Most frequent value - Useful for categorical data

Measures of Spread

Variance:

σ² = Σ(x - μ)² / n

Standard Deviation:

σ = √(σ²)

Coefficient of Variation:

CV = σ / μ
  • Standardized measure of spread
  • Useful for comparing variability across different scales

Range:

Range = max - min

Interquartile Range (IQR):

IQR = Q3 - Q1

Probability Distributions

Normal Distribution

f(x) = (1 / σ√2π) × e^(-(x-μ)²/2σ²)

Key Properties: - 68% within 1 standard deviation - 95% within 2 standard deviations - 99.7% within 3 standard deviations

Z-Score

z = (x - μ) / σ
  • Standardized distance from mean
  • Allows comparison across distributions

Percentiles

Percentile rank = (values below x / total values) × 100

Regression Analysis

Simple Linear Regression

y = β₀ + β₁x + ε

Coefficients:

β₁ = Σ(x - x̄)(y - ȳ) / Σ(x - x̄)²
β₀ = ȳ - β₁x̄

R-Squared (Coefficient of Determination)

R² = 1 - (SS_residual / SS_total)
  • Proportion of variance explained
  • 0 to 1 scale (higher = better fit)

Multiple Regression

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Correlation

Pearson Correlation

r = Σ(x - x̄)(y - ȳ) / √[Σ(x - x̄)² × Σ(y - ȳ)²]

Interpretation: | r Value | Interpretation | |---------|----------------| | 0.9-1.0 | Very strong | | 0.7-0.9 | Strong | | 0.5-0.7 | Moderate | | 0.3-0.5 | Weak | | 0.0-0.3 | Very weak |

Correlation vs Causation

Correlation does not imply causation. Consider: - Confounding variables - Reverse causality - Spurious correlation


Hypothesis Testing

Null and Alternative Hypotheses

  • H₀: Null hypothesis (no effect)
  • H₁: Alternative hypothesis (effect exists)

P-Value

  • Probability of observing result if H₀ is true
  • Common threshold: p < 0.05

Type I and Type II Errors

H₀ True H₀ False
Reject H₀ Type I (α) Correct
Fail to Reject Correct Type II (β)

Confidence Intervals

CI = x̄ ± z*(σ/√n)

95% CI uses z* = 1.96


Sample Size and Power

Standard Error

SE = σ / √n

Sample Size Calculation

For detecting effect size d with power (1-β):

n = 2 × [(z_α + z_β) / d]²

Power Analysis

Power = 1 - β = P(reject H₀ | H₁ true)
  • Typically aim for 80% power minimum

Bayesian Basics

Bayes' Theorem

P(A|B) = P(B|A) × P(A) / P(B)

Components: - P(A|B): Posterior probability - P(B|A): Likelihood - P(A): Prior probability - P(B): Evidence

Prior and Posterior

  • Prior: Belief before seeing data
  • Posterior: Updated belief after data
  • As data increases, posterior becomes more certain

Time Series Concepts

Exponential Weighted Moving Average

EWMA_t = α × x_t + (1-α) × EWMA_{t-1}
  • α = smoothing factor (0-1)
  • Higher α = more recent weight

Autocorrelation

  • Correlation of series with lagged version of itself
  • Important for detecting patterns

Regression to the Mean

Extreme observations tend to be followed by more moderate ones:

Regressed value = mean + r × (observed - mean)

Where r = reliability coefficient


Sports-Specific Statistical Considerations

Small Sample Sizes

  • NFL has 17 games per team
  • Individual game samples very small
  • Use multi-season data when possible

Selection Bias

  • Only see outcomes for players who made it
  • Beware survivorship bias

Variance vs Skill

In small samples, separate signal from noise:

True skill = Observed + regression_to_mean × (Observed - Average)

Pythagorean Expectation

Win% = Points^n / (Points^n + Points_Allowed^n)
  • NFL: n ≈ 2.37

Common Pitfalls

  1. Overfitting: Model fits training data too closely
  2. P-Hacking: Testing until significance found
  3. Multiple Comparisons: Inflated false positive rate
  4. Ignoring Base Rates: Priors matter
  5. Correlation ≠ Causation: Always consider alternatives

Quick Reference Formulas

Metric Formula
Mean Σx / n
Variance Σ(x-μ)² / n
Std Dev √Variance
Z-score (x - μ) / σ
Correlation Σ(x-x̄)(y-ȳ) / √[Σ(x-x̄)²Σ(y-ȳ)²]
SE of Mean σ / √n
95% CI x̄ ± 1.96 × SE

This appendix provides the statistical foundation for NFL analytics. Refer to it when encountering statistical concepts in the main chapters.