Prerequisites

This book assumes you have completed an introductory data science course or the equivalent self-study. Here is what you should already know, organized by confidence level.

You Should Be Comfortable With

These are skills you use without looking anything up:

Python Programming - Writing functions with parameters and return values - Loops, list comprehensions, and dictionary comprehensions - Basic object-oriented programming (classes, methods, __init__) - Importing and using third-party libraries - Reading error messages and debugging with print statements or a debugger

pandas - Creating and manipulating DataFrames - Filtering rows and selecting columns - groupby(), merge(), pivot_table() - Reading CSV, Excel, and JSON files - Basic data cleaning (renaming columns, changing dtypes, handling duplicates)

Visualization - Creating line plots, bar charts, scatter plots, and histograms with matplotlib - Using seaborn for statistical plots (heatmaps, pair plots, box plots) - Customizing labels, titles, legends, and color palettes

SQL - SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY - JOIN (inner, left, right) - Aggregate functions (COUNT, SUM, AVG, MIN, MAX) - Subqueries (basic)

You Should Understand Conceptually

These are ideas you can explain, even if the formulas are fuzzy:

Statistics - Mean, median, mode, standard deviation - Correlation (positive, negative, none) - Normal distribution (bell curve, 68-95-99.7 rule) - Hypothesis testing (null hypothesis, p-value — the general idea) - Confidence intervals (the general idea)

Machine Learning - The difference between supervised and unsupervised learning - Linear regression (fit a line, minimize error) - Logistic regression (predict a probability, use a threshold) - Train/test split (why you need one) - Overfitting (what it is, why it is bad)

Probability - Probability as a number between 0 and 1 - Independent vs. dependent events - Conditional probability (the general idea)

You Do NOT Need to Know

This book teaches these from scratch:

Regularization (Ridge, Lasso, Elastic Net)
Decision trees, random forests, or gradient boosting
Cross-validation (beyond basic train/test split)
Feature engineering or feature selection
Advanced SQL (window functions, CTEs)
Any deployment or MLOps concepts
SHAP or model interpretation techniques
Bayesian statistics
Matrix algebra or calculus (Chapter 4 covers what you need)
Docker, FastAPI, or any web framework

Self-Assessment

If you are unsure whether you are ready, try this quick check:

Can you write a Python function that takes a list of numbers and returns the mean and standard deviation? Without looking anything up?
Can you use pandas to load a CSV, filter rows where age > 30, group by city, and compute the average salary per city?
Can you explain why you should not evaluate a model on the same data you trained it on?
Can you write a SQL query that joins a customers table with an orders table and returns the total order amount per customer?

If you answered "yes" to all four, you are ready. If you answered "no" to one or two, you can probably start — but expect to reference introductory materials occasionally. If you answered "no" to three or more, consider completing the DataField.Dev Introduction to Data Science textbook first.

Recommended Setup

Before Chapter 1, have the following installed:

Python 3.10 or later
Jupyter Lab or VS Code with the Jupyter extension
A package manager (conda recommended, pip works)
Git

See Appendix D: Environment Setup for step-by-step instructions and a requirements.txt for all dependencies.