Further Reading: Linear Regression — Your First Predictive Model

Linear regression is one of the oldest and most studied techniques in all of statistics. Whether you want to deepen the mathematical foundations, explore advanced regression techniques, or see how regression is used in real-world research, these resources will take you further.

Tier 1: Verified Sources

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning (Springer, 2nd edition, 2021). Chapter 3 provides an excellent treatment of linear regression, including simple and multiple regression, potential problems (nonlinearity, multicollinearity, outliers), and the comparison between regression for prediction and regression for inference. Free PDF available from the authors' website. This is the natural next step if you want more mathematical rigor than our chapter provides.

Jake VanderPlas, Python Data Science Handbook (O'Reilly, 2016). Chapter 5 covers linear regression with scikit-learn, including feature engineering, regularization, and polynomial regression. The code examples directly complement our approach and are available as Jupyter notebooks.

Wes McKinney, Python for Data Analysis (O'Reilly, 3rd edition, 2022). While McKinney's focus is on data manipulation with pandas, his chapters on statistical modeling show how to use statsmodels — an alternative Python library that provides more detailed regression output (p-values, confidence intervals, diagnostic tests) than scikit-learn. If you need the statistical inference side of regression, statsmodels complements scikit-learn well.

Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference (Springer, 2004). Chapter 13 provides a rigorous mathematical treatment of linear regression from a statistical perspective. More technical than our chapter, but excellent for understanding the probabilistic foundations — why least squares works, what the assumptions are, and when they break down.

Francis Galton, "Regression towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute 15 (1886): 246-263. The original paper that gave "regression" its name. Galton observed that children of tall parents tended to be shorter than their parents (and vice versa) — they "regressed" toward the mean. This paper is historically fascinating and surprisingly readable. It's a window into how one of the most important ideas in statistics was born from a study of human height.

scikit-learn documentation: Linear Models. The official scikit-learn documentation includes user guides on LinearRegression, Ridge, Lasso, and ElasticNet, with clear explanations of when to use each. The examples section includes practical demonstrations with real datasets.

Tier 2: Attributed Resources

StatQuest with Josh Starmer, "Linear Regression" (YouTube series). Starmer's multi-video series on linear regression is among the clearest visual explanations available. He covers simple regression, multiple regression, R-squared, and residual analysis with characteristic enthusiasm and clarity. Search "StatQuest linear regression."

3Blue1Brown, "Linear Algebra" series (YouTube). If you want to understand why least squares works — the geometric interpretation of regression as projection in vector space — Grant Sanderson's linear algebra series provides the mathematical foundation. Start with "Vectors" and "Linear transformations" for the prerequisites.

Anscombe's Quartet. A famous set of four datasets with identical regression statistics (same mean, variance, correlation, and regression line) but completely different scatter plots. It's the ultimate argument for always visualizing data before trusting summary statistics. Search "Anscombe's quartet" — Wikipedia has a clear explanation with the original plots.

Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, 2006). For the reader who wants to use regression seriously for causal inference and social science research. Gelman and Hill are unusually clear about the assumptions underlying regression and when those assumptions are violated.

Kaggle's "House Prices: Advanced Regression Techniques" competition. A beginner-friendly competition that applies exactly the concepts from this chapter (and more) to a real housing dataset. If you want hands-on practice with regression beyond our exercises, this is an excellent starting point.

Recommended Next Steps

If you want deeper mathematical understanding: Read James et al., Chapter 3. The treatment of coefficient confidence intervals, F-tests, and the assumptions of regression will deepen your understanding significantly.
If you want more Python practice: Work through the Kaggle House Prices competition. It's a real dataset with messy features, missing values, and nonlinear relationships — everything we discussed in this chapter, applied to practice.
If you're interested in the history: Galton's 1886 paper is short and fascinating. For a broader history, Stephen Stigler's The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press, 1986) traces the development of regression through Galton, Pearson, and Fisher.
If you want to see regression used for causal inference: Read Angrist and Pischke's Mastering 'Metrics (recommended in Chapter 24's further reading) — they show how regression is used in economics to estimate causal effects, with careful attention to confounding and identification.
If you want more statistical detail: Use Python's statsmodels library, which provides p-values, confidence intervals, and diagnostic tests that scikit-learn omits. The official statsmodels documentation includes a regression tutorial.
If Anscombe's Quartet intrigued you: Look up "Datasaurus Dozen" — a modern extension that creates 12 datasets with identical summary statistics but wildly different scatter plot shapes, including one shaped like a dinosaur. It's a compelling argument for the importance of visualization.
If you're ready to move on: Chapter 27 introduces logistic regression — adapting the linear regression framework to predict categories instead of numbers. The mathematics changes (from a line to a curve), but the workflow stays the same.

A Final Thought

Linear regression is over 200 years old. Gauss and Legendre developed the method of least squares in the early 1800s. And yet it remains one of the most widely used techniques in data science, medicine, economics, engineering, and social science.

Why does such an old, simple method endure? Because it's interpretable, it's fast, it works surprisingly well for many problems, and — perhaps most importantly — it's a foundation that more complex methods build upon. Regularized regression, polynomial regression, generalized linear models, and even neural networks can be understood as extensions of linear regression.

You've just learned one of the most powerful and durable tools in all of quantitative reasoning. It will serve you well.