The installer recommends against this, but for beginners, checking this box can make things easier. If you're unsure, leave it unchecked (the Anaconda Prompt will still work). - **"Register Anaconda3 as my default Python"** — Check this box. It means when programs look for Python on your computer, t → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
Data: Country-level dataset, aggregated to continent means - Aesthetics: x = continent, y = mean life expectancy, color = continent - Geom: Bar - Scale: y-axis linear starting at 0; categorical x-axis; distinct colors per continent - Coordinates: Cartesian - Faceting: None → Chapter 14 Exercises: The Grammar of Graphics
(a) Daily work:
A data analyst's typical day involves pulling data from databases using SQL, creating dashboards and reports, computing business metrics, and presenting findings to stakeholders. The work is primarily descriptive — answering "what happened?" and "how are we doing?" - A data scientist's typical day i → Chapter 36 Quiz: Reflection and Career Planning
(a) Ethical issues:
**Harm to users:** The algorithm may be promoting content that causes anxiety, outrage, political polarization, and decreased wellbeing. Optimizing for engagement is not the same as optimizing for user value — users can be "engaged" by content that makes them angry or upset. - **Societal harm:** Amp → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
(b)
Data: Country-level dataset, one row per country - Aesthetics: x = GDP per capita, y = CO2 per capita, size = population - Geom: Point (circle) - Scale: x and y linear (or log for GDP); size proportional to population - Coordinates: Cartesian - Faceting: None → Chapter 14 Exercises: The Grammar of Graphics
(b) Most relevant skills from this book:
For data analyst: pandas data wrangling (Chapters 7-12), visualization (Chapters 14-18), descriptive statistics (Chapter 19), and communication skills (Chapter 31). The biggest gap is SQL, which the book didn't cover in depth. - For data scientist: everything the analyst needs, plus hypothesis testi → Chapter 36 Quiz: Reflection and Career Planning
(c)
Data: Country-level vaccination rates - Aesthetics: x = vaccination rate (binned), y = count - Geom: Bar (histogram bars) - Scale: x-axis linear; y-axis linear (count) - Coordinates: Cartesian - Faceting: By WHO region (6 panels) → Chapter 14 Exercises: The Grammar of Graphics
(c) Alternative approaches:
Optimize for "time well spent" rather than "time spent" — measure user satisfaction, not just engagement - Include content diversity metrics in the optimization function to prevent filter bubbles - Down-weight content that is flagged as divisive or misleading - Add friction to sharing (e.g., "Did yo → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
1. Data team (Monday):
**Format:** Jupyter notebook shared via repository, discussed in meeting - **Level of detail:** Full — methodology, code, statistical tests, limitations - **Include:** Reproducible code, data sources, confidence intervals, alternative models tested - **Leave out:** Policy recommendations (that is fo → Chapter 31 Quiz: Communicating Results: Reports, Presentations, and the Art of the Data Story
Benefits: Students who are correctly identified as at-risk and receive helpful advising. The university (higher retention rates, better outcomes). - Potential harms: Students falsely identified as at-risk may feel stigmatized, treated as less capable, or resentful of mandatory requirements. The "at- → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
12 slices
far too many for a pie chart. The smallest slices are indistinguishable. 3. **Gradient fills** add visual complexity without encoding data. 4. **Overlapping labels** make several categories unreadable. 5. **Drop shadow and textured background** are chartjunk. 6. **Pie chart is wrong for this data.** → Case Study 1: Redesigning a Government Report for Accessibility
14
multiplication before addition. 2. `(2 + 3) * 4` = `5 * 4` = **20** --- parentheses override precedence. 3. `10 - 6 / 2` = `10 - 3.0` = **7.0** --- division before subtraction; note the result is a float because `/` always returns float. 4. `2 ** 3 + 1` = `8 + 1` = **9** --- exponentiation before ad → Answers to Selected Exercises
Using family income and zip code as predictors means the model will disproportionately flag low-income and minority students as "at-risk." These students may face real barriers, but the model may be capturing socioeconomic disadvantage rather than individual academic risk. Students from wealthy fami → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
2. School principal (Wednesday):
**Format:** 8-10 slide presentation with one-page executive summary handout - **Level of detail:** Moderate — key findings with supporting charts, recommendations - **Include:** Specific, actionable recommendations (e.g., "redirect resources from X to Y") with estimated impact - **Leave out:** Code, → Chapter 31 Quiz: Communicating Results: Reports, Presentations, and the Art of the Data Story
[ ] If the data has natural groups (patients, users, companies), all observations from the same group are in the same split - [ ] No duplicate or near-duplicate rows exist across train and test → Case Study 2: The Data Leakage Disaster — A Cautionary Tale
3. Parents (Thursday):
**Format:** 4-5 slides with large, simple charts, plus a one-page handout to take home - **Level of detail:** High-level — big-picture trends and what they mean for children - **Include:** What the school is doing in response and how parents can help - **Leave out:** All statistical terminology, com → Chapter 31 Quiz: Communicating Results: Reports, Presentations, and the Art of the Data Story
False positives: Students wrongly labeled as at-risk are subjected to mandatory advising they do not need, potentially experiencing it as patronizing or stigmatizing. - False negatives: Students who are actually at risk but do not match the model's profile (e.g., wealthy students with personal probl → Chapter 32 Quiz: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
[ ] The model's AUC or accuracy is in a plausible range for the problem domain - [ ] Performance doesn't drop significantly when moving from cross-validation to production - [ ] If performance seems "too good to be true," investigate → Case Study 2: The Data Leakage Disaster — A Cautionary Tale
4. Network and apply.
Attend local meetups or virtual data science communities - Contribute to open-source data analysis projects - Apply broadly --- "entry-level" job postings often list aspirational requirements, not strict minimums - Be prepared to discuss your portfolio projects in detail → Appendix E: Frequently Asked Questions
how to make your code take different paths based on data values (like categorizing vaccination rates as "low," "medium," or "high") - **`for` loops** — how to repeat an operation for every item in a collection (like computing a statistic for each country) - **Functions** — how to package reusable lo → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
centers the diverging colormap at zero and caps at 30 points. Counties won by more than 30 points all appear the same saturated color. - **`size="total_votes"`** — larger dots for more populous counties, so the visual weight reflects the number of votes, not just geographic area. - **Hidden axis lab → Case Study 2: Election Night Live — Building an Interactive Results Tracker
`read_csv`
One-line CSV loading with automatic type detection and `NaN` for missing values. Replaces `csv.DictReader` + loop + manual type conversion. → Key Takeaways: Introduction to pandas
one or two sentences explaining what question the notebook addresses, what data it uses, or what project it belongs to. This helps readers (and future-you) decide whether this is the notebook they're looking for. → Chapter 2 Quiz: Setting Up Your Toolkit
A detailed project specification
what the finished product should contain 3. **A milestone checklist** — how to break the work into manageable pieces 4. **A rubric** — how the project will be evaluated (or how you can evaluate yourself) 5. **Examples of what "done" looks like** — concrete descriptions of successful capstone project → Chapter 35: Capstone Project: A Complete Data Science Investigation
A few more common patterns:
**Sum of squared values:** $\sum x_i^2$ means square each value, then add them up. - **Sum of squared differences from the mean:** $\sum (x_i - \bar{x})^2$. This is the numerator in the variance formula. It measures how spread out the data is. - **Double summation:** $\sum_{i=1}^{m} \sum_{j=1}^{n} a → Appendix A: Math Foundations Refresher
A local meetup
find one on Meetup.com, Eventbrite, or your local tech community calendar 2. **An online community** — Reddit, Discord, Slack, or a forum relevant to your interests 3. **A conference or virtual event** — upcoming data science conferences, hackathons, or workshops → Chapter 36 Exercises: Planning Your Future in Data Science
Abstract
a brief summary (like the executive summary, but with more technical detail) 2. **Introduction** — background, research question, and significance 3. **Data and Methods** — data sources, cleaning steps, analytical methods, tools used 4. **Results** — findings presented with tables, charts, and stati → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
Use `.reset_index()` to flatten a multi-index into regular columns - Use `.sort_values()` to rank groups - Use `.unstack()` as an alternative to `pivot_table` for reshaping grouped results → Key Takeaways: Reshaping and Transforming Data
Use objective performance metrics (sales numbers, code commits, project completion rates) instead of subjective reviews. - Blind the training data by removing demographic information AND potential proxies (names, photos, university names). - Audit the model's predictions across demographic groups be → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
when no specific region is selected, all data is shown. This is the default that provides the big picture. - **Side-by-side charts** — the map and scatter share the same row, giving geographic and statistical views simultaneously. - **A vertical reference line** on the trend chart shows which year i → Chapter 17: Interactive Visualization — plotly, Dashboard Thinking
An overfit weather model
one that tries to predict based on dozens of local, short-lived atmospheric features — might have low bias (it captures real phenomena) but high variance (its predictions are unstable, sensitive to small measurement errors). On days when its inputs are accurate, it's brilliant. On days when a sensor → Case Study 1: The Weather Forecaster's Dilemma — Simple vs. Complex Models
Anaconda
a free distribution that bundles Python, Jupyter, and hundreds of data science libraries into one installer - **Python** — the programming language we'll use throughout this book - **Jupyter Notebook** — the interactive environment where we write and run code alongside explanatory text → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
comparing means across 3+ groups. 2. **Chi-square test** — both variables are categorical. 3. **One-sample t-test** — comparing a sample mean to a known value. 4. **Paired t-test** — the same subjects measured twice (before and after), so observations are not independent. 5. **Two-sample t-test** — → Chapter 23 Exercises: Hypothesis Testing
Anscombe's Quartet
that have nearly identical summary statistics. Each dataset has the same mean of x, the same mean of y, the same variance of x, the same variance of y, the same correlation between x and y, and the same linear regression line. If you only looked at the numbers, you would conclude these four datasets → Chapter 14: The Grammar of Graphics — Why Visualization Matters and How to Think About Charts
Answer the question they asked
they want *recommendations for reducing churn*, not just a model. End with actionable business insights, not just accuracy metrics. 2. **Communicate clearly** — write narrative Markdown throughout; include a summary at the top so the reviewer can get the gist in 60 seconds. 3. **Show judgment, not j → Chapter 34 Quiz: Building Your Portfolio
Ask
Section 6.1 (formulating questions about the WHO data) 2. **Acquire** — Section 6.2 (loading the CSV file) 3. **Clean** — Section 6.5 (identifying data quality issues — though we didn't fully clean the data yet) 4. **Explore** — Sections 6.3 and 6.4 (inspecting structure and computing statistics) 5. → Chapter 6 Exercises: Your First Data Analysis
`year` is an integer, `coverage_pct` is a float, text columns are objects 3. **Missing values become `NaN`** — not empty strings that crash your math. `NaN` (Not a Number) is pandas's sentinel for missing data. It participates safely in computations: `NaN + 5` is `NaN`, not an error. → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
Avoid regex when:
A string method solves the problem just as well (simpler is better) - The pattern would be unreadable (more than ~30 characters — consider breaking it up) - You're trying to parse a structured format like HTML or JSON (use a proper parser) - You're trying to match natural language meaning, not struc → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Population: All 12,000 batteries produced that day. - Sample: The 50 tested batteries. - Parameter: True average lifetime of all 12,000 batteries. - Statistic: Average lifetime of the 50 tested batteries. → Answers to Selected Exercises
Before deployment:
[ ] Can I explain why the model makes specific predictions? - [ ] Have I documented the model's limitations and failure modes? - [ ] Is there a process for people to challenge or appeal the model's decisions? - [ ] Have I considered how the system could be misused? → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Before you begin:
[ ] Have I clearly defined the problem I am solving, and is that problem worth solving? - [ ] Who will be affected by the results? Have I considered impacts on marginalized or vulnerable groups? - [ ] Does the data I am using represent the population I am making claims about? - [ ] Was the data coll → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Population: All patients (present and future) who could take this medication. - Sample: The 80 patients studied. - Parameter: True average blood pressure effect of the medication. - Statistic: Average effect observed in the 80 patients. → Answers to Selected Exercises
boolean expression
the test. Python evaluates it and gets either `True` or `False`. - The colon `:` at the end of the `if` line is required. Forget it, and Python will complain. - The next line is **indented** by four spaces. This indentation isn't decorative — it tells Python that this line *belongs to* the `if` bloc → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Boolean indexing
Filtering rows using a True/False mask (`df[df["col"] > value]`). The pandas replacement for loop-with-if-statement. → Key Takeaways: Introduction to pandas
previously done with error-prone SUM formulas across tabs - **Comparing products** — previously done by manual tallying - **Identifying trends** — previously done by eyeballing - **Producing charts** — previously done by fighting with Excel chart formatting - **Answering ad hoc questions** — "what w → Case Study 2: From Spreadsheet Chaos to Notebook Clarity — A Business Analyst's Migration Story
Calendar features:
`day_of_week`: Monday through Sunday (categorical) - `month`: January through December (categorical) - `is_holiday`: Whether tomorrow is a federal holiday (binary) - `is_weekend`: Whether tomorrow is Saturday or Sunday (binary) → Case Study 1: End-to-End — From Raw Data to Deployed Prediction
Capstone work session.
**Lab:** Evaluate models with multiple metrics. Build a complete pipeline. Write executive summary. Conduct ethical audit. **Capstone workshop.** - **Assignment:** Capstone project due end of week 14. Chapters 31--33 quiz. → 15-Week University Semester Syllabus
Causal
it's asking whether sitting in the front row *causes* better grades. (It probably doesn't — motivated students choose to sit up front, and motivation, not location, drives the grades. This is a classic confounding variable situation.) > 2. **Predictive** — it's asking what we'd *expect* for an unobs → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
either **code cells** (for Python) or **Markdown cells** (for formatted text). The **kernel** is the engine that executes your code. The **notebook server** runs in the background. - **Code cells:** You write Python code and run it with Shift+Enter. Jupyter displays the output immediately below the → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
`.query()` for filtering (cleaner in chains than bracket notation) - `.assign()` for new columns (returns new DataFrame, doesn't modify in place) - `.pipe(func)` for custom functions that take a DataFrame and return a DataFrame → Key Takeaways: Reshaping and Transforming Data
Chapter 7: Introduction to pandas
DataFrames, Series, and the grammar of data manipulation - **Chapter 8: Cleaning Messy Data** — Professional techniques for handling the problems you spotted manually - **Chapter 9: Reshaping and Transforming** — Merging datasets, pivoting tables, grouping and aggregating - **Chapters 10-13: Working → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
Chapter introduction
What you'll learn, why it matters, and what you need to have completed first. 2. **Core content** — The main teaching material, with worked examples, code walkthroughs, and visualizations. 3. **Project checkpoint** — A task that adds to your progressive public health analysis project. 4. **Key takea → How to Use This Book
Chart Plan:
Question: How have vaccination rates changed over time for three countries with different trajectories? - Chart type: Multi-panel line chart (3 panels) - Data: Time series for three countries - Audience: Explanatory — for a policy brief → Chapter 15: matplotlib Foundations — Building Charts from the Ground Up
it predicts a category (approve, review, or deny). Decision trees can also do **regression** — predicting a continuous number, like the loan amount to offer — but we'll focus on classification in this chapter because it connects directly to the logistic regression work you did in Chapter 27. → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
a gradient palette from yellow through orange to red. matplotlib has many colormaps: `"viridis"` (default, perceptually uniform), `"Blues"`, `"coolwarm"` (diverging), etc. - **`fig.colorbar(scatter, ...)`**: Adds a color legend showing what the colors mean. - **`edgecolors="gray"`**: Adds a thin gra → Chapter 15: matplotlib Foundations — Building Charts from the Ground Up
Confusing data science with programming. Students assume they need to become expert coders before they can "do" data science. Emphasize that code is a means, not the end. - Struggling to formulate specific, answerable questions. Students propose vague questions like "What is happening with vaccinati → Teaching Notes for All 36 Chapters
temperature, or more broadly, summer weather — causes *both*. Hot weather makes people buy more ice cream. Hot weather also makes people swim more, which increases the opportunity for drowning. Ice cream and drowning are correlated not because one causes the other, but because they share a common ca → Case Study 1: Ice Cream and Drowning — The Classic Confounding Story (And Its Modern Equivalents)
Proportion to percentage: multiply by 100. So $0.73 \rightarrow 73\%$. - Percentage to proportion: divide by 100. So $85\% \rightarrow 0.85$. → Appendix A: Math Foundations Refresher
Correct: (A)
**(A)** is correct. The `.str` accessor gracefully handles missing values by propagating `NaN` (displayed as `None` or `NaN`) without raising errors. This is one of its key advantages over writing a manual loop. - **(B)** would happen if you tried to call `.lower()` directly on `None` in regular Pyt → Chapter 10 Quiz: Working with Text Data
Correct: (B)
**(A)** is too narrow — machine learning is one tool within data science, not the whole field. A project that never builds an ML model can still be data science (e.g., a descriptive analysis or a controlled experiment). - **(B)** captures the interdisciplinary nature, the focus on answering question → Chapter 1 Quiz: What Is Data Science? (And What It Isn't)
Correct: (C)
**(A)** is structured — rows, columns, numeric values. - **(B)** is structured — a relational table with a defined schema. - **(C)** is unstructured — scanned images of handwritten text have no predefined schema, no rows or columns. Extracting information requires OCR and possibly handwriting recogn → Chapter 1 Quiz: What Is Data Science? (And What It Isn't)
Correct: (D)
**(A)** works technically — `pandas.DataFrame(...)` is valid Python — but virtually nobody writes it this way. You'd have to type `pandas` in full every time. - **(B)** is the universal convention used by the pandas documentation, tutorials, books, and the overwhelming majority of data scientists. T → Chapter 7 Quiz: Introduction to pandas
`str.contains(pat, case=False)` — case-insensitive search - `str.contains(pat, na=False)` — treat NaN as False (essential for filtering) - `str.replace(old, new, regex=False)` — literal replacement (no regex interpretation) → Key Takeaways: Working with Text Data
Focuses on answering specific business questions with existing data - Primary tools: SQL, Excel, Tableau, basic Python or R - Outputs: dashboards, reports, ad-hoc analyses - Typical question: "What were our sales by region last quarter?" → Appendix E: Frequently Asked Questions
Data Engineer:
Focuses on building and maintaining the infrastructure that makes data available - Primary tools: SQL, Python, cloud platforms (AWS, GCP), Apache Spark, Airflow - Outputs: data pipelines, warehouses, ETL systems - Typical question: "How do we move 50 million rows of transaction data from production → Appendix E: Frequently Asked Questions
Data is accessible
without the data file, the notebook can't run. 2. **Cells run in order** — out-of-order execution creates hidden state that can't be replicated. 3. **Dependencies are documented** — missing libraries cause import errors. 4. **Random seeds are set** — without seeds, random processes produce different → Chapter 6 Exercises: Your First Data Analysis
data literacy
the ability to read, interpret, and reason with data — becomes essential. Data literacy is for data what reading comprehension is for text. It's not about technical skills; it's about understanding what numbers and charts are actually *saying*. → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Data Scientist:
Focuses on building models, conducting statistical investigations, and discovering patterns - Primary tools: Python or R, SQL, machine learning libraries - Outputs: models, statistical analyses, notebooks, research findings - Typical question: "Can we predict which customers will churn, and what dri → Appendix E: Frequently Asked Questions
light gridlines help; heavy, numerous gridlines distract. - **3D effects** — adding depth to a 2D bar chart distorts bar lengths and adds no information. - **Gradient fills** — making bars fade from dark to light adds visual complexity without encoding data. - **Background images** — a photo behind → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
It asks "what happened?" using historical data. No prediction or causal claim involved. 2. **Predictive** — It asks about a future outcome for a specific patient. The goal is forecasting, not explaining *why*. 3. **Causal** — The word "cause" is a giveaway, but even rephrased ("Did the new flow incr → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Design for balance:
Use decentralized architecture (phones exchange anonymous tokens, not identities) - Implement automatic data deletion after 14-21 days - Make participation voluntary, not mandatory - Use differential privacy for aggregate analysis - Prohibit use of contact tracing data for law enforcement - Establis → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
dictionary
a mapping from measurement names to values. > - The countries in South America? That is a **set** — a collection where uniqueness matters and order does not. > - A row of CSV data? That is a **list** — an ordered sequence of fields. > - A GPS coordinate? That is a **tuple** — a fixed pair of values → Chapter 5: Working with Data Structures: Dictionaries, Files, and Thinking in Data
seeing data as a shape, not just a single number — is the mindset shift that makes everything else in statistics click. Every time someone gives you an "average," your new reflex should be: "What's the shape? What's the spread? Is the average even a good summary?" → Key Takeaways: Descriptive Statistics
distributions
the mathematical shapes that describe how probabilities are spread across outcomes. We'll meet the normal curve (the bell curve), learn why it shows up everywhere, and discover the Central Limit Theorem — the reason that your sampling variability simulation at the end of this chapter produced bell-s → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
this would reduce the dataset from 183 to 147 countries and systematically exclude low-income countries, biasing the analysis toward wealthier nations where data infrastructure is stronger. > 2. **Impute with regional medians** — this assumes countries within a WHO region have similar healthcare wor → Case Study 1: A Model Capstone: Complete Vaccination Rate Analysis
[ ] Have I checked for representation gaps in the data? Which groups are underrepresented or absent? - [ ] If I am using proxy variables, could any of them serve as proxies for protected attributes? - [ ] Have I tested my model's performance across subgroups, not just overall? - [ ] Am I optimizing → Chapter 32: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
E
Elena
Public health analyst exploring COVID vaccination rates across demographics and regions, discovering disparities and communicating findings to policymakers 2. **Marcus** — Small business owner analyzing sales data to understand seasonal patterns, customer segments, and product promotion strategy 3. → Introduction to Data Science: From Curiosity to Code
**Who benefits?** The company (reduced churn). **Who is harmed?** Users who want to cancel but cannot easily do so (financial harm, frustration, erosion of trust). - **Was there consent?** No — users did not agree to be part of this test. - **Is it transparent?** No — the design intentionally obscur → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Ethical reflection should engage genuinely:
Who is represented in your data, and who is invisible? - How could your findings be misinterpreted or misused? - What assumptions have you embedded in your analysis choices? - What responsibility do you have as the person presenting these results? → Chapter 35 Exercises: Capstone Project Milestones
Rural vaccination rates declined by an average of 11 percentage points between 2019 and 2022, compared to 3 points in urban areas, creating a growing rural-urban gap. - Among rural counties, those with community health clinics maintained rates 8 points higher than demographically similar counties wi → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
you express intent, not mechanics - **Faster to run** — pandas operates on entire columns at once using optimized C code under the hood - **Easier to read** — even someone unfamiliar with pandas can guess what `groupby("region")["coverage_pct"].mean()` does - **Safer** — pandas handles type conversi → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
`credit_score`: Applicant's credit score (300-850) - `annual_income`: Self-reported annual income - `debt_to_income`: Monthly debt payments divided by monthly income - `loan_amount`: Amount requested - `employment_years`: Years at current employer - `loan_purpose`: Reason for the loan (home improvem → Case Study 1: Should We Approve the Loan? A Decision Tree for Credit Risk
labels get cut off - [ ] **Bar chart y-axis not starting at zero** -- visually misleading - [ ] **Overlapping x-axis labels** -- fix with rotation or horizontal bars - [ ] **Rainbow colors on bars that represent the same variable** -- use one color - [ ] **Title says the topic, not the finding** -- → Key Takeaways: matplotlib Foundations
Formulas are best when:
The answer needs to be exact (not approximate) - You need to compute the answer quickly (simulation takes time) - You want to understand *why* the answer is what it is (formulas reveal structure) - You need to communicate the logic to others (formulas are more transparent than code) → Chapter 20: Probability Thinking — Uncertainty, Randomness, and Why Your Intuition Lies
a command that tells Python to do something. In this case, it tells Python to display whatever is inside the parentheses. - `"Hello, world!"` is a **string** — a piece of text. The quotation marks tell Python "this is text, not code." - When you ran the cell, the notebook sent `print("Hello, world!" → Chapter 2: Setting Up Your Toolkit: Python, Jupyter, and Your First Notebook
G
GDP per capita
wealthier countries can afford both higher health spending and better vaccination infrastructure. (2) **Government effectiveness** — well-functioning governments both allocate more to health and implement programs effectively. (3) **Education levels** — more educated populations both demand more hea → Chapter 24 Quiz: Correlation, Causation, and the Danger of Confusing the Two
GDPR requirements:
Users must explicitly consent to their social media data being used for credit decisions (consent must be specific, informed, and freely given) - The company must explain the logic of the decision-making process (Article 22) - Users have the right to contest automated decisions - The data must be ad → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
a version control system that tracks every change to every file in a project, lets you go back to any previous version, and enables multiple people to work simultaneously without conflicts. Alongside git, you will learn about **virtual environments** (which capture the exact software versions your c → Chapter 33: Reproducibility and Collaboration: Git, Environments, and Working with Teams
GitHub
Your portfolio home base. Learn GitHub Pages if you want a free personal website. - **Kaggle Datasets** — For finding interesting datasets (use the datasets section, not just competitions). - **Google Dataset Search** — A search engine specifically for datasets. - **data.gov / data.gov.uk / EU Open → Further Reading: Building Your Portfolio
Good scope for a portfolio project:
Can be completed in 15-25 hours of focused work - Uses one to three data sources - Requires meaningful cleaning but not months of it - Has a clear question that can be answered with the data available - Produces three to eight polished visualizations - Fits in a single notebook with clear narrative → Chapter 34: Building Your Portfolio: Projects That Get You Hired
are the structural transformations at the heart of data wrangling. They don't change the *values* in your data; they change its *shape*. And until you're comfortable with them, you'll be stuck: staring at data that has all the information you need but isn't arranged in a way that lets you use it. → Chapter 9: Reshaping and Transforming Data — Merge, Join, Pivot, Melt, and GroupBy
COVID-19 case surveillance (millions of rows --- good for large-data practice) - BRFSS (Behavioral Risk Factor Surveillance System) --- annual survey of 400,000+ adults - WONDER (Wide-ranging ONline Data for Epidemiologic Research) --- mortality and population data → Appendix D: Data Sources Guide
Historical demand:
`demand_yesterday`: Yesterday's actual demand (MWh) - `demand_last_week`: Demand 7 days ago - `demand_last_year`: Demand on the same date last year - `avg_demand_7day`: Rolling 7-day average demand → Case Study 1: End-to-End — From Raw Data to Deployed Prediction
Honesty
[ ] Bar chart y-axes start at zero. - [ ] The time range is representative, not cherry-picked. - [ ] Dual y-axes are avoided (or clearly labeled and justified). - [ ] Area encodings are proportional to values, not to radii or heights. - [ ] Missing context (sample size, uncertainty, baseline) is pro → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
How the web works
at least enough to understand HTTP requests and responses. - **APIs** — the structured, polite way to ask a server for data. - **Web scraping** — the sometimes-necessary, sometimes-controversial alternative when no API exists. - **Ethics and legality** — because just because you *can* access data do → Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets
How to write well:
Start with the question, not the code. Your reader should understand what you're investigating in the first paragraph. - Use visualizations as anchors for the narrative. A good blog post alternates between text and charts, with each chart accompanied by interpretation. - Show your code, but not all → Chapter 34: Building Your Portfolio: Projects That Get You Hired
What question should this chart answer? 2. **Audit the encodings** — Is each visual element the best choice for its variable? 3. **Check accessibility** — Colorblind-safe? High contrast? Alt text? 4. **Check honesty** — Zero-based bars? Full time range? Fair scales? 5. **Remove chartjunk** — Can any → Key Takeaways: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Is the "express" checkout lane at the grocery store actually faster? - Does the weather affect your mood? - How has the cost of your grocery basket changed over the past year? - Do you actually sleep better on weekends, or does it just feel that way? → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
In your interests:
Does home-field advantage matter more in some sports than others? - Are sequels generally rated lower than original movies? - Has the length of popular songs changed over the past 50 years? - Do books that win literary prizes actually sell more copies? → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
the title and context are original, not referencing a textbook assignment; (2) **Decision justification** — the choices are explained with reasoning, not attributed to instructions; (3) **Awareness of consequences** — the note about Sub-Saharan Africa shows the author understands the analytical impl → Chapter 34 Exercises: Building Your Portfolio
a loop whose condition never becomes `False`: > > ```python > count = 1 > while count <= 5: > print(f"Count is {count}") > # Oops! Forgot to update count! > ``` > > This prints "Count is 1" forever (or until you interrupt it). In Jupyter, you'll see the cell keep running with a `[*]` that never turn → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Mean = Median: roughly symmetric distribution - Mean > Median: right-skewed (pulled up by high outliers) - Mean < Median: left-skewed (pulled down by low outliers) → Key Takeaways: Your First Data Analysis
IQR Fence Method:
Lower fence = Q1 - 1.5 * IQR - Upper fence = Q3 + 1.5 * IQR - Values beyond the fences are flagged as outliers - Robust — based on the median and quartiles → Key Takeaways: Descriptive Statistics
"I analyzed 10 years of USDA crop yield data to investigate whether organic farming productivity has been catching up to conventional farming." - "I scraped 5,000 job postings for data science positions to identify the most in-demand skills by city and company size." - "I built a model to predict wh → Chapter 34: Building Your Portfolio: Projects That Get You Hired
K
Key activities:
Install Anaconda and create your project notebook - Complete all Chapter 3 and 4 coding exercises --- these fundamentals must be solid - Write helper functions for the progressive project - If you get stuck on installation, consult Appendix C (Setup Guide) and Appendix E (FAQ) → Self-Paced Learning Guide
Key concepts from this chapter:
**Exploratory data analysis (EDA)** is the process of systematically examining a dataset to discover patterns, spot anomalies, check assumptions, and generate hypotheses. It's a conversation with your data. - **Data loading** with Python's `csv.DictReader` gives you a list of dictionaries — one per → Chapter 6: Your First Data Analysis — Loading, Exploring, and Asking Questions of Real Data
GDP per capita explained 45% of variance alone, but healthcare worker density (physicians + nurses per capita) added 12 percentage points of explanatory power - Sub-Saharan Africa showed the widest within-region variation, suggesting country-level factors dominate over regional ones - The relationsh → Chapter 35: Capstone Project: A Complete Data Science Investigation
Key function vocabulary:
**Domain:** The set of valid inputs. For $f(x) = \sqrt{x}$, the domain is $x \geq 0$ (you cannot take the square root of a negative number in the real numbers). - **Range:** The set of possible outputs. - **Monotonic:** A function that only goes up (increasing) or only goes down (decreasing), never → Appendix A: Math Foundations Refresher
Key principles:
"Publicly accessible" does not mean "freely usable" - Legal and ethical are separate questions — something can be legal but unethical - Scale matters — what's fine for 50 data points may be problematic for 50,000 - Always prefer APIs over scraping - When in doubt, slow down and investigate → Key Takeaways: Getting Data from the Web
Key rules you probably remember:
**Addition and subtraction** are performed left to right: $10 - 3 + 2 = 9$. - **Multiplication and division** are performed before addition and subtraction: $2 + 3 \times 4 = 14$, not 20. - **Parentheses** override everything: $(2 + 3) \times 4 = 20$. - **Exponents** are performed before multiplicat → Appendix A: Math Foundations Refresher
KeyError
column names are case-sensitive. Fix: `df["country"]`. 2. **ValueError** — use `&` instead of `and`, and wrap conditions in parentheses. Fix: `df[(df["coverage_pct"] > 90) & (df["year"] == 2022)]`. 3. **KeyError** — multiple columns need double brackets. Fix: `df[["country", "region"]]`. 4. **Settin → Chapter 7 Exercises: Introduction to pandas
the student list is your primary dataset. You want every student, with scores where available. (b) **Inner join** — you only want complete records that exist in both systems. (c) **Outer join** — you need the full picture from both sides to identify gaps. (d) **Left join** — your customer list is pr → Chapter 9 Exercises: Reshaping and Transforming Data
government data is public by design, the purpose is public benefit, and no personal harm results. Still, check robots.txt and ToS. 2. **It depends** — competitive intelligence is common, but hourly scraping could violate ToS and strain the competitor's servers. The legality varies by jurisdiction. 3 → Chapter 13 Exercises: Getting Data from the Web
Limitations should be specific:
Not "the data may have errors" but "the WHO vaccination data relies on country self-reporting, and countries with weaker health information systems may undercount doses administered" - Not "more data would be helpful" but "individual-level vaccination data (rather than country-level aggregates) woul → Chapter 35 Exercises: Capstone Project Milestones
List
ordered and allows duplicates. 2. Country-to-code mapping: **Dictionary** --- fast lookup by name. 3. Unique vaccine manufacturers: **Set** --- automatic deduplication, order irrelevant. 4. Latitude/longitude pair: **Tuple** --- fixed, immutable pair that can serve as a dictionary key. 5. Patient re → Answers to Selected Exercises
logistic regression
a model specifically designed for classification. Despite its name (it has "regression" in it), logistic regression is a classification algorithm. It predicts the *probability* that an observation belongs to a particular category, and then uses that probability to make a classification decision. → Chapter 27: Logistic Regression and Classification — Predicting Categories
Look at the caret
it points to where Python first noticed the problem. > 3. **Check for missing quotes, parentheses, or colons.** > 4. **Check the line *above*** — sometimes the error is on the previous line, but Python doesn't notice until the next line. → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
Looking back:
Chapter 19 gave us the tools to describe individual variables (means, standard deviations) - Chapters 22-23 gave us tools to estimate parameters and test hypotheses about one or two variables - This chapter extends the toolkit to *relationships between variables* and introduces the critical distinct → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
Looking forward:
Chapter 25 introduces formal modeling — using one variable to *predict* another - Chapters 26-28 build regression and classification models that quantify relationships while controlling for confounders - Chapter 32 revisits the ethical implications of causal claims → Chapter 24: Correlation, Causation, and the Danger of Confusing the Two
Focuses on deploying models into production systems - Primary tools: Python, Docker, Kubernetes, ML frameworks (TensorFlow, PyTorch) - Outputs: production-grade ML services, APIs - Typical question: "How do we serve this recommendation model to 10 million users with 50ms response time?" → Appendix E: Frequently Asked Questions
MAR (Missing at Random)
the missingness is related to equipment availability, not to the actual temperature or humidity values. The data is probably missing on days when a sensor malfunctioned or the station was offline, which is likely unrelated to the weather conditions themselves. → Chapter 6 Quiz: Your First Data Analysis
recall of 80% vs. 50%. It catches 80% of customers who will churn. 2. **Model A** — precision of 60% vs. 45%. Of the customers it flags, 60% actually churn (vs. 45% for Model B). 3. Suppose 100 customers, 20 will churn. Model A: catches 10 churners (recall 50%), flags 10/0.6 ≈ 17 total. Cost = 17 * → Chapter 27 Exercises: Logistic Regression and Classification — Predicting Categories
Start an intensive SQL course (e.g., chapters 1-10 of *Practical SQL*). Complete one chapter per day. - Begin learning Tableau (free trial + public gallery for practice). - Build a second portfolio project that demonstrates SQL and visualization skills — perhaps analyzing a dataset entirely in SQL, → Chapter 36 Quiz: Reflection and Career Planning
Complete SQL course. Take a practice SQL assessment to benchmark skills. - Build a third portfolio project in a domain relevant to companies you're targeting. - Apply to 15-20 positions per week, tailoring cover letters to each. - Practice behavioral interview answers using STAR-D framework with pro → Chapter 36 Quiz: Reflection and Career Planning
Based on interview feedback, address any consistent skill gaps. - Write a blog post about one of your portfolio projects to increase visibility. - Continue applications (15-20/week). Reach out to at least three people for informational interviews. - Practice SQL interview questions (window functions → Chapter 36 Quiz: Reflection and Career Planning
Continue applications and interviews. By now you should have had several phone screens and hopefully some technical interviews. - Refine your portfolio based on what generates the most interview interest. - Follow up with all networking contacts. - If no offers yet, assess: is the issue the resume/p → Chapter 36 Quiz: Reflection and Career Planning
Self-assessment: What did I learn? What gaps remain? - Portfolio update: What new projects can I add? - Career progress: What applications, interviews, or connections have I made? - Next six months: What comes after this roadmap? → Chapter 36 Exercises: Planning Your Future in Data Science
Monthly (1--2 hours):
Read one in-depth article or blog post about a technique you have not used. - Try a Kaggle competition or work on a personal project. → Appendix E: Frequently Asked Questions
tests recall and conceptual understanding - **True/False** — tests common misconceptions - **Short answer** — tests your ability to explain concepts in your own words - **Code reading** — shows you code and asks "what does this produce?" or "what's wrong with this?" → How to Use This Book
1,000 rapid requests will likely trigger a 429 (Too Many Requests) response, and after that, `response.json()` might return an error message without a `name` key, causing a `KeyError`. 2. **No error handling** — if any request fails (network error, timeout, server error), the script crashes. 3. **No → Chapter 13 Quiz: Getting Data from the Web
the bell curve. You've seen it mentioned a hundred times. Now you'll understand *why* it's everywhere. The answer involves one of the most beautiful results in mathematics: the **Central Limit Theorem**. And the way we'll discover it is by running a simulation that will make you say "wait, THAT happ → Chapter 21: Distributions and the Normal Curve — The Shape That Shows Up Everywhere
Not data science
This is **data engineering**. Building pipelines is essential infrastructure, but it is not itself asking or answering a question. (It supports stage 2, data collection, but it is not *doing* data science.) 2. **Not data science** — This is **mathematical statistics / statistical theory**. There is → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
the `Figure` and `Axes` approach — rather than the simpler but less flexible `pyplot` shortcut. The object-oriented interface gives you full control over every element of your chart, and it's what you'll need for professional-quality work. → Chapter 15: matplotlib Foundations — Building Charts from the Ground Up
the same countries are measured before and after. 2. Differences: [6, 3, 7, 2, 8, 1, 7, 2, 7, 6]. Mean = 4.9, SD = 2.56. 3. SE = 2.56/√10 = 0.809. t = 4.9/0.809 = 6.06. With df = 9, p < 0.001. 4. The increase is statistically significant. However, we can't conclude the campaign *caused* the increase → Chapter 23 Exercises: Hypothesis Testing
parameter
a variable that represents the input the function expects. When you call the function, you'll pass in actual data, and it will be assigned to this parameter name. - The colon `:` and indented block work just like `if` and `for` — everything indented is the function's body. - **`return total, average → Chapter 4: Python Fundamentals II: Control Flow, Functions, and Thinking Like a Programmer
Part I
You download the data and take your first look. What's in here? What are the columns? What questions could we ask? - **Part II** — You clean the data, handle missing values, reshape it, and merge in additional sources. - **Part III** — You create visualizations that reveal patterns — and learn to sp → Preface
Part I is fully linear
beginners need to build skills in a specific order. - **Later parts have more flexibility** — once you have the foundations, you can explore topics that interest you most. → How to Use This Book
Part I: Welcome to Data Science (Chapters 1--6)
Establishes Python fundamentals and the mindset of data analysis using pure Python, creating productive frustration that motivates the pandas library. - **Part II: Data Wrangling (Chapters 7--13)** --- Introduces pandas, data cleaning, reshaping, text/date handling, and data acquisition from files a → Instructor Guide: Overview
equal data differences produce unequal perceived color differences, so some data variations appear larger than they are. (2) **Colorblind inaccessibility** — the red-green transitions are invisible to deuteranopic viewers (~8% of men). (3) **False boundaries** — sharp hue transitions create perceive → Chapter 18 Quiz: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Perceptual principles
how the human visual system processes charts, and how to design for it. 2. **Accessibility** — how to ensure your visualizations work for people with color vision deficiency, low vision, and screen readers. 3. **Ethics** — how to avoid (and detect) misleading visualization techniques. → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
perfect classifier
it achieves 100% true positive rate with 0% false positive rate. AUC = 1.0. This rarely happens in practice. 2. A **random classifier** — no better than flipping a coin. AUC = 0.5. The model provides no useful information. 3. A **good classifier** — it achieves high true positive rates with relative → Chapter 29 Exercises: Evaluating Models
Structured. Each row is an event (foul, shot, turnover) with timestamps, player IDs, and score at the time of the event. 2. **Referee assignment records** — Structured. Which referees officiated each game. Available from official NBA data. 3. **Game video footage** — Unstructured. Priya might review → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Population: All registered voters in Ohio. - Sample: The 1,200 polled voters. - Parameter: True proportion of all Ohio registered voters who support the measure. - Statistic: Proportion in the sample who support it. → Answers to Selected Exercises
Possible explanations for the score gap:
**Bias in peer evaluations:** Research consistently shows that identical behaviors are perceived as "leadership" in men and "bossiness" in women. Peer evaluations may reflect gender stereotypes. - **Opportunity gap:** If women are less likely to be assigned to high-visibility projects, they may have → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Pre-merge checklist:
[ ] Check both key columns have the same dtype (`df['key'].dtype`) - [ ] Check for duplicate keys (`df['key'].duplicated().sum()`) - [ ] Check for key overlap (`left['key'].isin(right['key']).sum()`) - [ ] Check for NaN in key columns (`df['key'].isna().sum()`) - [ ] Strip whitespace and standardize → Key Takeaways: Reshaping and Transforming Data
She chose a line chart because this is temporal data with a continuous trend. - She set the y-axis from 15 to 50 rather than 0 to 100 because the full range would flatten the trend into near-invisibility. For a line chart (which encodes position, not length), this is appropriate. - She annotated two → Case Study 2: The Sports Page Goes Digital — Priya's NBA Shot Charts
Graded as a portfolio of incremental submissions. Early milestones should be graded generously to build confidence; later milestones demand higher polish and rigor. → Instructor Guide: Overview
Learn one new tool or library and apply it to a real problem. - Attend a meetup, webinar, or conference talk (many are free and virtual). → Appendix E: Frequently Asked Questions
Need to look up by name? Use a **dictionary**. - Need an ordered, changeable collection? Use a **list**. - Need unique values or fast membership checks? Use a **set**. - Need a fixed, unchangeable sequence? Use a **tuple**. → Key Takeaways: Working with Data Structures
when we need to identify which countries need intervention > - Use the **Decision Tree for communication** — when we need to explain the key risk factors to policymakers > - The two models agree on the most important features (GDP per capita and physicians per 1,000), which gives us confidence in bo → Case Study 2: Comparing Three Models — Which Predicts Vaccination Best?
randomized controlled experiment
the same design used in medical trials. You randomly assign some people to get the treatment and others to get a placebo, and you compare the outcomes. But in data science, experiments aren't always possible. You can't randomly assign countries to have high GDP to see if it increases vaccination rat → Chapter 1: What Is Data Science? (And What It Isn't) — A Map of the Field
Rate limiting
the server needs to know who's making requests so it can enforce usage limits (e.g., "1,000 requests per hour per user"). 2. **Access control** — some data is only available to authorized users. 3. **Accountability** — if someone misuses the API, the provider needs to know who did it. → Chapter 13: Getting Data from the Web — APIs, Web Scraping, and Building Your Own Datasets
measures what fraction of actual positives the model catches, which is critical when positive cases are rare and important (fraud, disease). (2) **F1-score** — the harmonic mean of precision and recall, providing a balanced measure that is only high when both precision and recall are reasonable. Bot → Chapter 27 Quiz: Logistic Regression and Classification — Predicting Categories
Recommendations:
Audit the leadership score for gender bias specifically - Consider using structured evaluation criteria rather than subjective peer ratings - Test the promotion algorithm with and without the leadership score to measure its contribution to the gender gap - If the feature cannot be debiased, consider → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Recommended datasets for beginners:
Titanic survival data (classification practice) - House prices (regression practice) - Iris flower dataset (clustering and classification) - New York City Airbnb listings (exploratory analysis) → Appendix D: Data Sources Guide
Red-green color pair
inaccessible to deuteranopic viewers. Red and green are the first two colors assigned to what are presumably the two most important regions. 2. **Six different colors for a single variable** — the bars represent different regions but the same metric (a value). Using different colors implies the colo → Chapter 18 Quiz: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
the legend labels are sufficient. 4. **Make sure the y-axis starts at zero** — for bar charts, this is non-negotiable. 5. **Add a descriptive title** that states the finding, not just the topic. Not "Vaccination Rates by Region" but "Sub-Saharan Africa Lags 30 Points Behind the Global Average in Vac → Chapter 14: The Grammar of Graphics — Why Visualization Matters and How to Think About Charts
reproducibility
when he runs the same analysis next month, the code is already written, not a series of forgotten clicks; (2) **automation** — he can write a script that processes his sales data every week without manual work; and (3) **scale** — as his business grows, Excel will struggle with larger datasets, but → Chapter 2 Exercises: Setting Up Your Toolkit
Respect the citation honesty system:
Tier 1: Only for sources you can verify exist - Tier 2: Attributed but unverified claims - Tier 3: Clearly labeled illustrative examples 4. **Maintain voice consistency** with the existing chapters 5. **Test code examples** if modifying any code (Python 3.12+) 6. **Submit a pull request** with a cle → Contributing to Introduction to Data Science
return a new string
they don't modify the original. Strings in Python are **immutable**: once created, they cannot be changed. If you want the uppercase version, you need to save it: > > ```python > city = "Minneapolis" > city_upper = city.upper() > print(city_upper) # "MINNEAPOLIS" > ``` > > Or reassign: > > ```python → Chapter 3: Python Fundamentals I — Variables, Data Types, and Expressions
(a) 2 points: correctly identifies the causal vs. descriptive gap. - (b) 4 points: 2 points per plausible, well-explained alternative. - (c) 4 points: describes a study with a comparison group and random assignment (or a strong quasi-experimental design). Loses points for vague designs ("just study → Chapter 1 Quiz: What Is Data Science? (And What It Isn't)
Run the analysis both ways
with and without the missing rows — and report whether the conclusions differ. 2. **Create a "Not Reported" category** for race/ethnicity, treating it as its own group rather than deleting it. 3. **Use ZIP code as a supplementary indicator** — while not a substitute for individual race/ethnicity dat → Case Study 1: The Messy Reality of Hospital Records
S
Safeguards:
Use victim-reported crime data (calls for service) rather than arrest data - Exclude minor offenses (drug possession, loitering) that are enforcement-dependent - Cap the amount of additional policing the model can direct to any single area - Regular audits of the model's racial and geographic impact → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Sample size
larger samples give more power. (2) **Effect size** — larger effects are easier to detect. (3) **Significance level** — higher α (e.g., 0.10 vs. 0.01) gives more power but at the cost of more false positives. → Chapter 23 Quiz: Hypothesis Testing
Sample the data
use `df.sample(5000)` to plot a random subset. Fast, but loses some information. (2) **Aggregate** — use `groupby` to compute means or medians by group, reducing 500K rows to hundreds. Changes the chart type from scatter to bar or grouped scatter. (3) **Use `px.density_heatmap()`** — bin the data in → Chapter 17 Quiz: Interactive Visualization — plotly, Dashboard Thinking
each random sample selects different countries, producing different results. This is the fundamental randomness of sampling. (b) The distribution would be approximately **bell-shaped** (normal), centered around the true population mean. This is a preview of the Central Limit Theorem (Chapter 21). (c → Chapter 20 Quiz: Probability Thinking
A descriptive title (not "Capstone Project" but something specific and interesting) - A brief abstract summarizing the question, data, methods, and key findings - This should be readable by someone with no data science background → Chapter 35: Capstone Project: A Complete Data Science Investigation
Section 9: Ethical Reflection (300-500 words)
Who is represented in your data, and who might be missing? - Could your findings be misused? By whom, and how? - What responsibilities do you have as the analyst? - Were there ethical tensions in the analysis itself? (e.g., privacy, consent, representation) → Chapter 35: Capstone Project: A Complete Data Science Investigation
the enrollments table is joined with itself, using different aliases (`e1` and `e2`). The `WHERE` clause filters so that `e1` only has CS101 records and `e2` only has CS201 records, while the `ON` clause ensures they're the same student. → Case Study 2: Querying a University Database — Jordan Discovers SQL
different results each run without them. 2. **Record library versions** — different scikit-learn versions may produce different results. 3. **Document the data source and date** — data may change over time. 4. **Save the trained pipeline** — so you can reload and verify without retraining. 5. **Incl → Chapter 30 Exercises: The Machine Learning Workflow
Shape arithmetic:
Melting a table with R rows and C value columns produces R x C rows - Pivoting reduces rows by the number of unique values in the `columns` parameter → Key Takeaways: Reshaping and Transforming Data
establishes the baseline (things were good in 2019). 2. **Complication** — introduces the problem (pandemic caused a decline; outbreaks occurred). 3. **Resolution** — presents the analysis finding (the decline was geographically concentrated; clinic closures explain it). 4. **Call to Action** — reco → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Slide 1: Title Slide
Title: "Protecting Our Children: Data-Driven Recommendations for Improving Vaccination Rates in [County]" - Visual: Clean title with county logo, presenter name, date - Speaker notes: "Good morning. I'm here today to share what our analysis of vaccination data tells us about where we are, where we'r → Chapter 31 Exercises: Communicating Results: Reports, Presentations, and the Art of the Data Story
Making equivalent values look the same ("NYC" and "New York City" should become a single value) 2. **Extraction** — Pulling structured information out of unstructured text (getting the number "250" out of "250mL bottle") 3. **Searching** — Finding rows that match certain patterns (all entries contai → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Start with a title cell
Markdown with heading, author, date, purpose 2. **Use section headings** — `##` for major sections, `###` for subsections 3. **Explain before you compute** — Markdown cell before code, interpretation after 4. **Name files descriptively** — `sales-analysis-jan-2024.ipynb`, not `Untitled3.ipynb` 5. ** → Key Takeaways: Setting Up Your Toolkit
Step 1: Frame the problem
**Target:** Vaccination rate (a continuous number from 0 to 100) - **Features:** GDP per capita, healthcare spending per capita, education index, urbanization rate - **Type:** Supervised learning, regression (because the target is continuous) - **Success metric:** How close are our predictions to ac → Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
stratify by urban/suburban/rural and sample within each stratum. This ensures representation of each school type and produces more precise estimates when strata differ. 2. **Systematic** — select every 100th item (or some regular interval). This is efficient on an assembly line and gives good covera → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Strip whitespace
remove leading/trailing spaces with `.str.strip()` 2. **Standardize case** — convert to lowercase (or uppercase) with `.str.lower()` 3. **Remove or standardize punctuation** — remove unnecessary dots, commas, etc. 4. **Collapse whitespace** — replace multiple spaces with a single space 5. **Map know → Chapter 10 Quiz: Working with Text Data
Structured
Rows and columns with defined types. Challenge: missing values, inconsistent coding across departments (e.g., "M" vs. "Male"), and privacy restrictions. 2. **Unstructured** — Free-form text with no fixed schema. Challenge: extracting meaning from natural language — sarcasm, misspellings, varying len → Chapter 1 Exercises: What Is Data Science? (And What It Isn't)
Net rating difference and winning percentage difference are the two most important predictors — together accounting for 50% of the model's decisions. - Rest days have a measurable but small effect — about 6% of predictive weight. - The model is 68-69% accurate across seasons, matching professional p → Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA
Test edge cases
earliest year, latest year, each region individually. (2) **Check for empty results** — some region-year combinations may have no data; the callback should handle this gracefully (e.g., show an empty chart with a message rather than crashing). (3) **Add validation in the callback** — check if the fi → Chapter 17 Quiz: Interactive Visualization — plotly, Dashboard Thinking
Test on a small string first
use `re.findall()` on a single example 2. **Build incrementally** — start with the simplest pattern that matches *something*, then add complexity 3. **Check for unescaped special characters** — `.`, `$`, `(`, `)`, `*`, `+`, `?` all need `\` for literal matching 4. **Check greedy vs. lazy** — if matc → Key Takeaways: Working with Text Data
Test tooltips exhaustively
hover over edge cases (small countries, missing data, outliers) to ensure they display correctly. - **Choose export format based on audience** — HTML for exploration, static images for reports, Dash for recurring dashboards. - **Start simple** — one excellent interactive chart is better than a medio → Case Study 1: An Interactive Global Health Dashboard
so readers know who created the analysis and when. This matters for accountability and for understanding whether the analysis might be outdated. → Chapter 2 Quiz: Setting Up Your Toolkit
WHO COVID-19 Vaccination Data (194 countries, 2021-2023) - World Bank Development Indicators (GDP per capita, population, education levels) - WHO Global Health Expenditure Database (healthcare spending, workforce density) - Optional: additional sources you've identified during the course → Chapter 35: Capstone Project: A Complete Data Science Investigation
three to five bullet points, each stating an insight (not a finding). Use plain language. Avoid jargon. Include numbers but make them meaningful. - "Rural vaccination rates dropped 12% between 2019 and 2022, compared to 3% in urban areas." - "Counties with mobile clinic programs maintained rates 8 p → Chapter 31: Communicating Results: Reports, Presentations, and the Art of the Data Story
**Data inventory and access control:** Document what data the company collects, where it is stored, who can access it, and what it is used for. Restrict access to need-to-know basis. - **Privacy-by-design:** All new products and features must include a privacy review before launch. Default to collec → Chapter 32 Exercises: Ethics in Data Science: Bias, Privacy, Consent, and Responsible Practice
Three things the project does well
be specific. "Good charts" is less helpful than "The chart comparing vaccination rates by income group is clear, well-labeled, and immediately communicates the main finding." 3. **Three things that could be improved** — be constructive. "Your limitations section is weak" is less helpful than "Your l → Chapter 35: Capstone Project: A Complete Data Science Investigation
Three types of missing data:
**MCAR (Missing Completely at Random):** No pattern to the missingness. Least problematic. - **MAR (Missing at Random):** Missingness is related to an observed variable. Manageable with care. - **MNAR (Missing Not at Random):** Missingness is related to the missing value itself. Most problematic — c → Key Takeaways: Your First Data Analysis
Three types of questions:
**Descriptive:** What happened? What does the data look like? (Start here.) - **Predictive:** What is likely to happen? (Requires modeling — Part V.) - **Causal:** What would happen if we changed something? (Requires experimental design — Chapter 24.) → Key Takeaways: Your First Data Analysis
Tips for self-study:
**Set a schedule and stick to it.** Consistency beats intensity. One hour a day, five days a week, is better than a single 10-hour marathon on Saturday. - **Type every code example.** Don't copy-paste. The act of typing forces your brain to engage with every character, and you'll catch details you'd → How to Use This Book
Title and axis labels
mandatory for every chart. 2. **One key finding** in the title or subtitle — the reader should know the main message without studying the chart. 3. **Reference lines or annotations** for the most important comparison. 4. **Data labels** on the most important data points (not all of them). 5. **Sourc → Chapter 18: Visualization Design — Principles, Accessibility, Ethics, and Common Mistakes
Too big:
"I'm going to build a real-time stock price prediction system with a streaming data pipeline." - "I want to analyze every tweet ever posted about climate change." - "I'm building a recommendation engine for every movie on Netflix." → Chapter 34: Building Your Portfolio: Projects That Get You Hired
"interesting" isn't specific enough to guide analysis. 2. **Good for EDA** — specific column, specific filter, specific statistic. 3. **Unanswerable with this data** — "why" requires causal information (conflict, infrastructure) not in the dataset. 4. **Good for EDA** — you could compute the standar → Chapter 6 Exercises: Your First Data Analysis
any nonzero number is truthy. 2. `bool(0)` = **False** --- zero is falsy. 3. `bool(-1)` = **True** --- negative numbers are nonzero, therefore truthy. 4. `bool("")` = **False** --- empty string is falsy. 5. `bool(" ")` = **True** --- a space character makes the string non-empty. 6. `bool("0")` = **T → Answers to Selected Exercises
it examines the values and converts them to appropriate Python/NumPy types (int64, float64, object). The Python concept is **type conversion** (Ch.3) — pandas just does it for you automatically. > > > **From Chapter 4:** The `apply()` method takes a function as an argument. What is this p → Chapter 7: Introduction to pandas — DataFrames, Series, and the Grammar of Data Manipulation
TypeError
you can't use `+` to combine a string (`"The average is: "`) with a number (`average`). Fix options: `print("The average is:", average)` (using comma) or `print("The average is: " + str(average))` (converting number to string). → Chapter 2 Exercises: Setting Up Your Toolkit
U
University mental health survey:
Population: All students at the university. - Sample: The 400 randomly selected students. - Parameter: True proportion of all students who use the mental health center. - Statistic: Proportion in the sample who use the center. → Answers to Selected Exercises
Use a decision tree when:
Interpretability is the top priority — your audience needs to understand *why* the model makes each prediction - You're building a preliminary model to explore which features matter - The dataset is small and a complex model would overfit - You need to explain the model to non-technical stakeholders → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
Use a histogram when:
You want to see exact counts per bin - Your audience is not statistically sophisticated (histograms are universally understood) - You are presenting a single distribution → Chapter 16: Statistical Visualization with seaborn
Use a random forest when:
Accuracy is the top priority and you can sacrifice some interpretability - You have a medium-to-large dataset - You want a model that's robust to small changes in the data - You want reliable feature importance scores - You're in a competitive setting (random forests are strong default models for ma → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
You want to compare distributions without the visual confusion of overlapping curves - You need to read off percentiles directly (the y-axis gives cumulative probability) - You want a representation that does not depend on bin width or bandwidth choices → Chapter 16: Statistical Visualization with seaborn
Use KDE when:
You want to compare multiple distributions on the same axes (overlapping KDEs are clearer than overlapping histograms) - You want to emphasize the smooth shape of the distribution - You have enough data points (at least 50-100) for a reliable density estimate → Chapter 16: Statistical Visualization with seaborn
Use regex when:
You need to match a *pattern* rather than a fixed string ("any sequence of digits") - You need to *extract* part of a string (capture groups with `.str.extract()`) - You need to match with *flexibility* (one word OR another, optional characters) - You need *anchoring* (must start with, must end with → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Use rug when:
You want to show individual observations alongside another distribution plot - Your dataset is small to moderate (under a few hundred points) - You want to verify that the KDE or histogram is not masking gaps or clusters → Chapter 16: Statistical Visualization with seaborn
Use simple string methods when:
You're doing case conversion (`.str.lower()`, `.str.upper()`) - You're stripping whitespace (`.str.strip()`) - You're replacing a known, fixed substring (`.str.replace("old", "new", regex=False)`) - You're splitting on a simple delimiter (`.str.split(",")`) - You're checking for a known, fixed subst → Chapter 10: Working with Text Data — String Methods, Regular Expressions, and Extracting Meaning
Use something else when:
You have a very large dataset (>100K samples) and need fast training — consider gradient boosting (XGBoost, LightGBM) - You need a linear model for inference or theoretical reasons — stick with logistic regression - Your data has a strong linear structure — tree-based models can struggle with simple → Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
V
Vectorized operations
Applying an operation to an entire column at once (`df["col"] * 2`) rather than looping through values one by one. Faster, safer, more readable. → Key Takeaways: Introduction to pandas
only people with strong feelings (very happy or very unhappy) bother filling out cards. The 85% likely overestimates satisfaction because moderately satisfied people don't fill out cards, but the few angry ones often do, creating an odd mix. 2. **Voluntary response / self-selection bias** — people w → Chapter 22 Exercises: Sampling, Estimation, and Confidence Intervals
Skim one or two data science newsletters. Recommended: *Data Science Weekly*, *The Batch* (by Andrew Ng), or *Towards Data Science* (on Medium). - Browse the front page of [/r/datascience](https://www.reddit.com/r/datascience/) on Reddit. → Appendix E: Frequently Asked Questions
Spotify Web API audio features (tempo, energy, danceability, valence, acousticness, instrumentalness, speechiness, loudness) for tracks appearing on the Billboard Hot 100, 2004-2024 - Billboard Hot 100 chart data (song titles, artists, peak position, weeks on chart) - He collected data for approxima → Case Study 2: An Alternative Capstone: Analyzing Your Own Dataset
Do not try to learn every new framework that appears on Hacker News. Most will be irrelevant to your work. - Do not feel inadequate because someone on Twitter is discussing techniques you have not learned. Everyone's knowledge has gaps. - Do not confuse reading about data science with doing data sci → Appendix E: Frequently Asked Questions
What she found:
Census tract-level demographic data from the American Community Survey (ACS), including median household income, racial composition, educational attainment, and housing tenure (rent vs. own) - Zillow Home Value Index data at the ZIP code level - Building permit data from the city's open data portal, → Case Study 2: An Alternative Capstone: Analyzing Your Own Dataset
What she would change:
The county map as a scatter plot was functional but looked amateurish compared to a real choropleth with county boundaries. For the next election, she would use GeoJSON county boundaries with `px.choropleth_mapbox()`. - The 15-minute refresh cycle felt slow on election night. A true real-time dashbo → Case Study 2: Election Night Live — Building an Interactive Results Tracker
**For prediction:** Often nothing. Multicollinearity doesn't affect prediction accuracy much — it affects coefficient interpretation. If you only care about making good predictions, you can often ignore it. → Chapter 26: Linear Regression — Your First Predictive Model
Take one of your portfolio projects and write it up as a narrative blog post. Not a tutorial ("how to build a random forest") but an investigation story ("What I learned about global vaccination disparities by building three different models"). - Write about a concept you struggled to understand. "A → Chapter 34: Building Your Portfolio: Projects That Get You Hired
What worked:
The tooltip design was critical. Board members and casual readers both praised the ability to hover over a county and see exact numbers without navigating to a separate table. - The demographic scatter was the most shared chart on social media. People found the education-voting correlation striking → Case Study 2: Election Night Live — Building an Interactive Results Tracker
What you'll do:
Combine and refine all the work you've done across chapters 1-34 - Fill any analytical gaps (sections you skipped, analyses you started but didn't finish) - Add new analysis where needed to tell a complete story - Polish everything into a single, cohesive narrative notebook - Write an executive summ → Chapter 35: Capstone Project: A Complete Data Science Investigation
When deletion is appropriate:
The missing values are a small percentage of your data (often cited as less than 5%) - The data is **missing completely at random** (MCAR) — the reason for missingness has nothing to do with the value itself or any other variable - You have enough data that losing some rows won't affect your analysi → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
When deletion is dangerous:
The missingness is *not* random. If low-income patients are more likely to have missing insurance information, dropping those rows silently removes low-income patients from your analysis. Your results now describe only the people with complete records — which may not represent the population you car → Chapter 8: Cleaning Messy Data: Missing Values, Duplicates, Type Errors, and the 80% of the Job
**CSV** when simplicity and compatibility matter, and the data is a single flat table. - **Excel** when sharing with non-technical stakeholders or when the data naturally has multiple related sheets. - **JSON** when the data is hierarchical or comes from a web API. - **Database** when the data is la → Chapter 12: Getting Data from Files — CSVs, Excel, JSON, and Databases
When to use each:
**Two-tailed:** Default choice. Use when you don't have a strong directional prediction, or when an effect in either direction would be interesting. - **One-tailed:** Use only when you have a clear directional hypothesis *stated before looking at the data*, and an effect in the other direction would → Chapter 23: Hypothesis Testing — Making Decisions with Data (and What P-Values Actually Mean)
When to use KDE vs. histograms:
Use histograms when you want to see the exact count in each bin and when your audience is less technical. - Use KDE when you want to compare distributions across groups (overlapping KDEs are easier to read than overlapping histograms). - Use both together when exploring data for yourself. → Chapter 16: Statistical Visualization with seaborn
**Medium** (and specifically its data science publications like *Towards Data Science*) has the largest built-in audience for data science content. - **dev.to** is popular among developers and data practitioners. - **A personal website** using GitHub Pages, Jekyll, Hugo, or a simple site builder giv → Chapter 34: Building Your Portfolio: Projects That Get You Hired
the position between a word character (`\w`) and a non-word character. It matches a *position*, not a character. This is why `r"\bcat\b"` matches "cat" as a whole word but not "catfish" or "concatenate." Note: outside of regex, `\b` does mean backspace in regular Python strings, which is another rea → Chapter 10 Quiz: Working with Text Data
Works for ANY shape
uniform, skewed, bimodal, anything 2. **n >= 30 is usually enough** (more for heavily skewed data) 3. **Standard error = sigma / sqrt(n)** — larger samples give more precise means → Key Takeaways: Distributions and the Normal Curve
Write and run code cells
use `print()`, do arithmetic, understand cell output - [ ] **Write and run Markdown cells** — create headings, bold, italic, lists, links - [ ] **Switch between cell types** — use the dropdown or M/Y shortcuts - [ ] **Use keyboard shortcuts** — at minimum: Shift+Enter, Esc, Enter, A, B, D-D, M, Y - → Key Takeaways: Setting Up Your Toolkit
Writing insights:
The decision tree was more useful for storytelling than the random forest. The tree's simple rules ("if the home team has a better record AND a better net rating, they'll probably win") are easy to explain in an article. The random forest's improved accuracy came at the cost of narrative clarity. - → Case Study 2: Predicting Game Outcomes — Priya's Random Forest for NBA
Y
Yes
the variable is still in memory from when you previously ran cell 3. Python doesn't "know" that the cell was deleted. 2. **NameError** — after a kernel restart, all variables are cleared. Cell 3 no longer exists to recreate `patient_count`. 3. Best practices: (a) periodically do Kernel → Restart & R → Chapter 3 Exercises: Python Fundamentals I — Variables, Data Types, and Expressions
z = (x - mean) / standard deviation - |z| > 2: unusual - |z| > 3: very unusual - Less robust — based on the mean, which outliers themselves distort → Key Takeaways: Descriptive Statistics