Answers to Selected Exercises

How to use this appendix: Solutions are provided for odd-numbered exercises from Parts A and B of selected chapters. The goal is to show reasoning, not just final answers. If your answer differs from ours, that does not necessarily mean it is wrong --- many of these problems have multiple valid approaches. Check whether your reasoning is sound, not just whether your answer matches.

Chapters covered: 1, 3, 5, 7, 9, 11, 14, 19, 22, 25.

Chapter 1: What Is Data Science?

Exercise 1.1 --- Defining the field

Sample answer: Data science is the practice of extracting knowledge and insight from data using a combination of statistical reasoning, computational tools, and domain expertise, then communicating those findings to inform decisions. A family-friendly version: "Data science means using computers and careful thinking to find answers hidden in information."

The two versions differ because the technical version names the methods (statistics, computation, domain expertise), while the family version focuses on the outcome (finding answers). This difference illustrates that data science is ultimately about answering questions --- the tools are means, not ends.

Exercise 1.3 --- The lifecycle, from memory

The six stages are:

Question formulation --- Define a clear, answerable question connected to a real problem.
Data collection --- Identify, locate, and gather the data needed to address the question.
Data cleaning --- Inspect the data for errors, inconsistencies, and missing values; repair and transform it into a usable form.
Exploratory analysis --- Summarize and visualize the data to discover patterns, spot anomalies, and refine the question.
Modeling --- Apply statistical or machine learning techniques to formally answer the question.
Communication --- Translate findings into a form the audience can understand and act upon.

Key insight: these stages are iterative, not strictly linear. Real projects frequently loop back from later stages to earlier ones.

Exercise 1.5 --- Structured vs. unstructured

Hospital EHR in relational database: Structured. Challenge: missing values, inconsistent coding across departments, privacy restrictions.
50,000 customer reviews: Unstructured. Challenge: extracting meaning from natural language (sarcasm, misspellings, multiple languages).
Server log files: Semi-structured. Challenge: parsing irregular formats, handling entries that break the expected pattern.
10,000 wildlife photographs: Unstructured. Challenge: images vary in lighting and angle; species identification requires labeling or computer vision.
Monthly sales spreadsheet: Structured. Challenge: potential inconsistencies across stores (different calendars, missing months).

Exercise 1.7 --- Data literacy for everyone

This is an open-ended prompt with no single correct answer. A strong response either argues for universal data literacy (citing concrete examples like misinterpreting vaccine efficacy statistics or being misled by cherry-picked charts) or against it (arguing that better data communication by experts is more realistic than expecting statistical fluency from everyone). Either position can be well-defended; the key is engaging with specific examples rather than vague generalities.

Chapter 3: Python Fundamentals I

Exercise 3.1 --- Variables as labels

Variables in Python are labels (names) pointing to values in memory, not boxes containing values. When you execute:

a = 100
b = a

One copy of the integer 100 exists in memory. Two labels (a and b) both point to that same object. This distinction becomes critical with mutable objects (lists, dictionaries) in Chapter 5, where modifying a list through one label is visible through all labels pointing to it.

Exercise 3.3 --- Operator precedence

2 + 3 * 4 = 2 + 12 = 14 --- multiplication before addition.
(2 + 3) * 4 = 5 * 4 = 20 --- parentheses override precedence.
10 - 6 / 2 = 10 - 3.0 = 7.0 --- division before subtraction; note the result is a float because / always returns float.
2 ** 3 + 1 = 8 + 1 = 9 --- exponentiation before addition.
15 // 4 + 15 % 4 = 3 + 3 = 6 --- floor division gives 3 (15 / 4 = 3 remainder 3), modulo gives 3.
10 > 5 and 3 + 2 == 5 = True and True = True --- arithmetic first (3+2=5), then comparisons (10>5 is True, 5==5 is True), then logical and.

Exercise 3.5 --- Truthiness

bool(1) = True --- any nonzero number is truthy.
bool(0) = False --- zero is falsy.
bool(-1) = True --- negative numbers are nonzero, therefore truthy.
bool("") = False --- empty string is falsy.
bool(" ") = True --- a space character makes the string non-empty.
bool("0") = True --- the string "0" contains one character and is non-empty.
bool(0.0) = False --- zero as a float is still falsy.
bool(None) = False --- None is always falsy.

The tricky cases: bool("0") is True because Python checks whether the string is empty, not whether it represents zero. Similarly, bool(" ") is True because the string contains a space character.

Exercise 3.7 --- Variable creation

dataset_name = "Global Health Observatory"
last_updated = 2023
row_count = 1284
country_count = 195
avg_life_expectancy = 73.4

print(f"Dataset: {dataset_name} (updated {last_updated})")
print(f"Rows: {row_count:,} | Countries: {country_count}")
print(f"Average life expectancy: {avg_life_expectancy} years")

Key detail: the :, format specifier inside the f-string adds the comma thousand-separator, producing "1,284" instead of "1284".

Exercise 3.9 --- String methods practice

country_clean = "   United States   ".strip()          # "United States"
temp_number = "98.6 degrees".split(" ")[0]             # "98.6"
code_upper = "us".upper()                               # "US"

Important: temp_number is still a string ("98.6"). To do arithmetic with it, you would need float(temp_number). This is a common gotcha when reading data from files --- everything starts as text.

Exercise 3.11 --- Type conversion chain

Starting with "3.14159":

Step	Operation	Value	Type
0	(original)	`"3.14159"`	`str`
1	`float()`	`3.14159`	`float`
2	`int()`	`3`	`int`
3	`str()`	`"3"`	`str`
4	`bool()`	`True`	`bool`

Key insights: int() truncates (3.14159 becomes 3, not 3 --- no rounding). And bool("3") is True because it is a non-empty string. Even bool("0") and bool("False") would be True --- Python only checks emptiness for strings.

Exercise 3.13 --- f-string formatting

population = 8045311
growth_rate = 0.02847
city = "new york"
pi = 3.14159265358979

print(f"Population: {population:,}")                 # 8,045,311
print(f"Growth rate: {growth_rate * 100:.2f}%")     # 2.85%
print(f"City: {city.title()}")                       # New York
print(f"Pi to 4 decimals: {pi:.4f}")                # 3.1416

Notes: :.2f means two decimal places in float format. .title() capitalizes the first letter of each word. :.4f rounds the last digit (9 rounds up to 6).

Chapter 5: Working with Data Structures

Exercise 5.1 --- Choosing the right structure

Country names in order: List --- ordered and allows duplicates.
Country-to-code mapping: Dictionary --- fast lookup by name.
Unique vaccine manufacturers: Set --- automatic deduplication, order irrelevant.
Latitude/longitude pair: Tuple --- fixed, immutable pair that can serve as a dictionary key.
Patient record with named fields: Dictionary --- named fields (keys) make the data self-documenting.

Exercise 5.3 --- Dictionary access patterns

Given the student dictionary with keys "name", "gpa", "major", and "courses":

Jordan's name: student["name"]
Jordan's GPA: student["gpa"]
The first course: student["courses"][0]
The number of courses: len(student["courses"])
Whether "major" exists as a key: "major" in student

Exercise 5.5 (assuming typical exercise pattern) --- List vs. dictionary performance

Lists are searched sequentially: checking "Brazil" in country_list requires examining each element until a match is found (O(n) time). Dictionaries use hash tables: checking "Brazil" in country_dict is nearly instantaneous regardless of size (O(1) time). For 195 countries the difference is negligible, but for millions of records, dictionaries are dramatically faster for lookups.

Chapter 7: Introduction to pandas

Exercise 7.1 --- The threshold concept

"Thinking in vectors rather than loops" means applying transformations to entire columns at once instead of processing rows individually. Example:

# Loop approach (slow, verbose)
decimals = []
for pct in df["coverage_pct"]:
    decimals.append(pct / 100)
df["decimal"] = decimals

# Vectorized approach (fast, clear)
df["decimal"] = df["coverage_pct"] / 100

The vectorized version is preferred because: (1) it is faster (pandas uses optimized C code internally), (2) it is more readable (the intent is immediately clear), and (3) it handles NaN values automatically.

Exercise 7.3 --- loc vs. iloc

Given a DataFrame with index [10, 20, 30, 40]:

df.iloc[0] --- Returns the row at position 0 (Alice, 88) as a Series.
df.loc[10] --- Returns the row with label 10 (Alice, 88). Same data, different access method.
df.iloc[1:3] --- Returns rows at positions 1 and 2 (Bob and Carol). iloc uses exclusive end.
df.loc[10:30] --- Returns rows with labels 10, 20, and 30 (Alice, Bob, Carol). loc uses inclusive end --- this is the key difference.
df.iloc[4] --- Raises IndexError because there is no position 4 (only 0--3).

Exercise 7.5 --- Boolean indexing mechanics

When pandas evaluates df[df["score"] > 80]:

Inner expression: df["score"] > 80 is evaluated first. This compares every value in the "score" column to 80, producing a boolean Series of the same length as the DataFrame: [True, True, False, True, ...].
Outer expression: df[boolean_series] uses the boolean Series as a mask (filter). Rows where the mask is True are kept; rows where it is False are dropped.
Result: A new DataFrame containing only the rows where the score exceeds 80. The original DataFrame is not modified.

Chapter 9: Reshaping and Transforming Data

Exercise 9.1 --- Join type selection

(a) Student list + optional exam scores: Left join. The student list is primary; you want every student even if they have not taken the exam.

(b) HR records + Payroll records, only employees in both: Inner join. You only want complete records that exist in both systems.

(c) Auditing two inventory systems for completeness: Outer join. You need to see every item from either system to find gaps.

(d) Customer list enriched with optional third-party demographics: Left join. Your customer list is primary; the external data supplements it.

Exercise 9.3 --- Split-apply-combine

When you run df.groupby('department')['salary'].mean():

Split: Pandas divides the DataFrame into separate groups based on unique values in the department column. If there are 5 departments, there are 5 groups.
Apply: For each group, pandas computes the mean of the salary column.
Combine: Pandas collects the five per-department mean values into a single Series, with department names as the index and mean salaries as the values.

Exercise 9.5 (typical melt/pivot exercise)

To convert wide-format quarterly revenue data to long format:

long_df = pd.melt(
    wide_df,
    id_vars=["Company"],
    value_vars=["Q1_Revenue", "Q2_Revenue", "Q3_Revenue", "Q4_Revenue"],
    var_name="Quarter",
    value_name="Revenue"
)

This produces a DataFrame where each row is one company-quarter combination, with columns "Company", "Quarter", and "Revenue". The long format is better for time-series plotting because plotting libraries expect a single x-axis variable (Quarter) and a single y-axis variable (Revenue).

Chapter 11: Working with Dates, Times, and Time Series

Exercise 11.1 --- Why dates need parsing

String sorting is lexicographic (character by character). The string "12/01/2023" (December 1) sorts before "2/15/2023" (February 15) because the character "1" comes before "2". But chronologically, February comes before December.

After parsing to datetime objects, pandas sorts by actual chronological order. Any pair where a later month (10, 11, 12) has a leading digit that is lexicographically less than an earlier month's leading digit will sort incorrectly as strings.

Exercise 11.3 --- Resampling vs. groupby

.resample() is a time-aware version of .groupby(). Both split data into groups and apply aggregate functions. Use resample when grouping by time periods (weeks, months, quarters) because: (1) it requires a DatetimeIndex, enforcing proper date handling; (2) it handles incomplete periods correctly; (3) it supports upsampling (inserting missing time periods); and (4) it uses standardized frequency aliases ("M", "W", "Q").

You can approximate resample with groupby(df["date"].dt.month), but resample is more robust and idiomatic for time-series work.

Chapter 14: The Grammar of Graphics

Exercise 14.1 --- Identifying components

(a) Bar chart of average life expectancy by continent, bars colored by continent: - Data: Country-level dataset, aggregated to continent means. - Aesthetics: x = continent, y = mean life expectancy, color = continent. - Geom: Bar. - Scale: y-axis linear starting at 0; categorical x-axis; distinct color per continent. - Coordinates: Cartesian. - Faceting: None.

(b) Scatter plot of GDP per capita vs. CO2 emissions, point size = population: - Data: Country-level dataset, one row per country. - Aesthetics: x = GDP per capita, y = CO2 per capita, size = population. - Geom: Point (circle). - Scale: x and y linear (or log for GDP); size proportional to population. - Coordinates: Cartesian. - Faceting: None.

(c) Six histograms of vaccination rates by WHO region: - Data: Country-level vaccination rates. - Aesthetics: x = vaccination rate (binned), y = count. - Geom: Bar (histogram bars). - Scale: x linear, y linear (count). - Coordinates: Cartesian. - Faceting: By WHO region (6 panels).

Exercise 14.3 (typical chart-type selection exercise)

For showing change over time: line chart (emphasizes trends and continuity). For comparing categories: bar chart (emphasizes magnitude differences). For showing relationship between two continuous variables: scatter plot (reveals correlation, clusters, outliers). For showing distribution of a single variable: histogram (shows shape, center, spread). For showing composition of a whole: stacked bar chart or pie chart (though pie charts are often discouraged because humans compare angles poorly).

Chapter 19: Descriptive Statistics

Exercise 19.1 --- Mean vs. median

The newspaper's $425,000 is the **mean**; the agent's $310,000 is the median. Home prices are right-skewed --- most homes sell for moderate prices, but a few luxury homes sell for much more. Those expensive outliers pull the mean upward, above the median.

For a first-time homebuyer, the median is more informative because it represents what a "typical" home costs. The mean is inflated by homes the buyer likely cannot afford. This is a textbook example of why the mean can be misleading for skewed distributions.

Exercise 19.3 --- Choosing the right measure

(a) Bell-shaped test scores: Mean and standard deviation. Symmetric distributions are well-summarized by mean and SD.

(b) CEO compensation: Median and IQR. Executive pay is extremely right-skewed, so mean and SD would be inflated by a few enormous compensation packages.

(c) ER visits per day: Mean and standard deviation (if approximately symmetric) or median and IQR (if notably skewed). Inspect the distribution first.

(d) Web server response times: Median and IQR. Response times are typically right-skewed with a long tail of slow requests. The median captures "typical" performance better than the mean, which is pulled up by outlier slow responses.

Exercise 19.5 (typical computation exercise)

Given values: 12, 15, 18, 22, 25, 28, 30, 35, 40, 95.

Mean: (12 + 15 + 18 + 22 + 25 + 28 + 30 + 35 + 40 + 95) / 10 = 320 / 10 = 32.0.

Median: With 10 values, the median is the average of the 5th and 6th values when sorted: (25 + 28) / 2 = 26.5.

Observation: The mean (32.0) is notably higher than the median (26.5), pulled up by the outlier value of 95. This suggests the distribution is right-skewed. The median is the more representative measure of center here.

Chapter 22: Sampling, Estimation, and Confidence Intervals

Exercise 22.1 --- Population vs. sample identification

University mental health survey: - Population: All students at the university. - Sample: The 400 randomly selected students. - Parameter: True proportion of all students who use the mental health center. - Statistic: Proportion in the sample who use the center.
Battery factory: - Population: All 12,000 batteries produced that day. - Sample: The 50 tested batteries. - Parameter: True average lifetime of all 12,000 batteries. - Statistic: Average lifetime of the 50 tested batteries.
Political poll: - Population: All registered voters in Ohio. - Sample: The 1,200 polled voters. - Parameter: True proportion of all Ohio registered voters who support the measure. - Statistic: Proportion in the sample who support it.
Blood pressure study: - Population: All patients (present and future) who could take this medication. - Sample: The 80 patients studied. - Parameter: True average blood pressure effect of the medication. - Statistic: Average effect observed in the 80 patients.

Exercise 22.3 --- Standard error reasoning

Doubling the sample size does not halve the standard error. Because $SE = \sigma / \sqrt{n}$, doubling $n$ only reduces SE by a factor of $\sqrt{2} \approx 1.41$. To actually halve the SE, you need to quadruple the sample size.
A sample of 100 gives a smaller SE than a sample of 25. With $\sigma = 10$: $SE_{25} = 10/\sqrt{25} = 2.0$ and $SE_{100} = 10/\sqrt{100} = 1.0$.
Researcher B (n=200) will have the narrower confidence interval. By approximately how much? $SE_A / SE_B = \sqrt{200} / \sqrt{50} = \sqrt{4} = 2$. So Researcher B's CI will be about half the width of Researcher A's.
Increasing sample size has diminishing returns because SE decreases with $\sqrt{n}$, not $n$. Going from n=25 to n=100 (a 4x increase) only halves the SE. Going from n=100 to n=400 (another 4x increase) halves it again. Each halving requires quadrupling the sample.

Chapter 25: What Is a Model?

Exercise 25.1 --- Models as simplifications

Pizza delivery time: Include: distance to customer, time of day (traffic), order backlog. Ignore: driver's shoe size, what music they are listening to, the pizza flavor. The simplification lets the app give a useful estimate without modeling everything about the universe.
Student graduation: Include: GPA, credits completed, major, enrollment status (full/part-time). Ignore: favorite color, number of library visits, preferred study snack. The simplification helps advisors identify at-risk students early enough to intervene.
Tomorrow's temperature: Include: today's temperature, season, historical averages, approaching weather fronts. Ignore: the exact state of every air molecule, butterfly wing flaps. The simplification makes prediction computationally feasible while remaining useful for planning.

Exercise 25.3 --- Supervised vs. unsupervised

Predicting house prices from square footage and location: Supervised (regression). Features: square footage, location. Target: price.
Grouping customers by purchasing behavior: Unsupervised (clustering). The model seeks natural groupings --- there is no predefined "correct" cluster label.
Classifying emails as spam or not spam: Supervised (classification). Features: email content, sender, metadata. Target: spam/not-spam label.
Reducing 50 variables to 5 components: Unsupervised (dimensionality reduction, e.g., PCA). The model seeks the most informative low-dimensional representation of the data.
Predicting diabetes from blood tests: Supervised (classification). Features: blood test results. Target: diabetes/no-diabetes diagnosis.

Exercise 25.5 (typical train-test split exercise)

If you evaluate a model on the same data used to train it, you will get an overly optimistic estimate of performance. The model has already "seen" the answers and may have memorized noise specific to that data. This is like a student who memorizes the answer key --- they will score perfectly on that test but poorly on a new one. The train-test split estimates how the model will perform on data it has never seen, which is what actually matters in practice.

The standard approach: split data into ~70-80% training and ~20-30% test. Train the model on the training set only. Evaluate on the test set. Report the test set performance as your estimate of real-world accuracy.

Solutions are intended for self-study and self-assessment. If you are using this book in a course, your instructor may assign different exercises or require different formats. The reasoning matters more than the specific answers --- if your reasoning is sound and your answer differs, discuss the discrepancy with an instructor or study partner.