Prerequisites
This book assumes you have completed an introductory data science course or the equivalent self-study. Here is what you should already know, organized by confidence level.
You Should Be Comfortable With
These are skills you use without looking anything up:
Python Programming
- Writing functions with parameters and return values
- Loops, list comprehensions, and dictionary comprehensions
- Basic object-oriented programming (classes, methods, __init__)
- Importing and using third-party libraries
- Reading error messages and debugging with print statements or a debugger
pandas
- Creating and manipulating DataFrames
- Filtering rows and selecting columns
- groupby(), merge(), pivot_table()
- Reading CSV, Excel, and JSON files
- Basic data cleaning (renaming columns, changing dtypes, handling duplicates)
Visualization - Creating line plots, bar charts, scatter plots, and histograms with matplotlib - Using seaborn for statistical plots (heatmaps, pair plots, box plots) - Customizing labels, titles, legends, and color palettes
SQL
- SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY
- JOIN (inner, left, right)
- Aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- Subqueries (basic)
You Should Understand Conceptually
These are ideas you can explain, even if the formulas are fuzzy:
Statistics - Mean, median, mode, standard deviation - Correlation (positive, negative, none) - Normal distribution (bell curve, 68-95-99.7 rule) - Hypothesis testing (null hypothesis, p-value — the general idea) - Confidence intervals (the general idea)
Machine Learning - The difference between supervised and unsupervised learning - Linear regression (fit a line, minimize error) - Logistic regression (predict a probability, use a threshold) - Train/test split (why you need one) - Overfitting (what it is, why it is bad)
Probability - Probability as a number between 0 and 1 - Independent vs. dependent events - Conditional probability (the general idea)
You Do NOT Need to Know
This book teaches these from scratch:
- Regularization (Ridge, Lasso, Elastic Net)
- Decision trees, random forests, or gradient boosting
- Cross-validation (beyond basic train/test split)
- Feature engineering or feature selection
- Advanced SQL (window functions, CTEs)
- Any deployment or MLOps concepts
- SHAP or model interpretation techniques
- Bayesian statistics
- Matrix algebra or calculus (Chapter 4 covers what you need)
- Docker, FastAPI, or any web framework
Self-Assessment
If you are unsure whether you are ready, try this quick check:
-
Can you write a Python function that takes a list of numbers and returns the mean and standard deviation? Without looking anything up?
-
Can you use pandas to load a CSV, filter rows where
age > 30, group bycity, and compute the averagesalaryper city? -
Can you explain why you should not evaluate a model on the same data you trained it on?
-
Can you write a SQL query that joins a
customerstable with anorderstable and returns the total order amount per customer?
If you answered "yes" to all four, you are ready. If you answered "no" to one or two, you can probably start — but expect to reference introductory materials occasionally. If you answered "no" to three or more, consider completing the DataField.Dev Introduction to Data Science textbook first.
Recommended Setup
Before Chapter 1, have the following installed:
- Python 3.10 or later
- Jupyter Lab or VS Code with the Jupyter extension
- A package manager (conda recommended, pip works)
- Git
See Appendix D: Environment Setup for step-by-step instructions and a requirements.txt for all dependencies.