Chapter 19 Exercises: Descriptive Statistics

Contributors to Introduction to Data Science

Chapter 19 Exercises: Descriptive Statistics

How to use these exercises: Work through the sections in order. Each section builds on the previous one, moving from recall toward original analysis. For coding exercises, work in a Jupyter notebook and keep your code — you'll extend it in future chapters.

Difficulty key: Foundational | Intermediate | Advanced | Extension

Part A: Conceptual Understanding (Foundational)

Exercise 19.1 — Mean vs. median

A local newspaper reports that "the average price of homes sold in our county last month was $425,000." A real estate agent tells you "the typical home in our county sold for about $310,000." Both are using the same dataset. Explain how both statements can be true simultaneously, and which statistic each person is likely using. Which is more informative for a first-time homebuyer, and why?

Guidance

The newspaper is reporting the mean, and the agent is reporting the median. Home prices are right-skewed — most homes sell for modest prices, but a few luxury homes sell for much more. Those expensive homes pull the mean up above the median. For a first-time homebuyer, the median is more useful because it better represents what a "typical" home costs. The mean is inflated by houses the buyer can't afford anyway.

Exercise 19.2 — Shape recognition

For each of the following datasets, predict whether the distribution would be symmetric, right-skewed, or left-skewed. Explain your reasoning.

(a) Ages of students in a university introductory course (b) Annual household income in the United States (c) Scores on a very difficult exam (most students struggle) (d) Heights of adult women in centimeters (e) Number of Instagram followers per account (f) Wait times at a doctor's office (in minutes)

Guidance

(a) **Right-skewed.** Most students are 18-22, but some non-traditional students are 30+, 40+, or older. (b) **Right-skewed.** Most households earn moderate incomes, but a long tail of very high earners extends rightward. (c) **Right-skewed.** Most students score low, with a few doing well — creating a tail to the right. (d) **Approximately symmetric.** Heights tend to follow a bell-shaped distribution. (e) **Extremely right-skewed.** Most accounts have few followers; a tiny fraction have millions. (f) **Right-skewed.** Most waits are short to moderate, but some patients wait a very long time.

Exercise 19.3 — Choosing the right measure

For each scenario below, state whether you'd report the mean or median as the measure of center, and whether you'd report the standard deviation or IQR as the measure of spread. Explain each choice in one sentence.

(a) Test scores in a large class where the distribution is approximately bell-shaped (b) Annual CEO compensation at Fortune 500 companies (c) Number of emergency room visits per day at a busy hospital (d) Response times (in milliseconds) for a web server — most are fast but some are very slow

Guidance

(a) **Mean and standard deviation.** The distribution is symmetric, so mean and SD work well. (b) **Median and IQR.** CEO pay is extremely right-skewed, so mean and SD would be distorted. (c) **Mean and standard deviation** (likely, unless there are extreme days — check the histogram first). (d) **Median and IQR.** Response times are typically right-skewed with occasional very slow outliers.

Exercise 19.4 — Five-number summary interpretation

A dataset of exam scores has the following five-number summary: Min = 32, Q1 = 65, Median = 78, Q3 = 88, Max = 100.

(a) What percentage of students scored above 65? (b) What is the IQR? (c) Is the distribution symmetric, right-skewed, or left-skewed? How can you tell? (d) Using the IQR fence method, what is the lower fence? Would a score of 32 be classified as an outlier? (e) What would this look like as a box plot? Sketch one (or describe it in words).

Guidance

(a) 75% scored above 65 (Q1 marks the 25th percentile, so 75% are above it). (b) IQR = Q3 - Q1 = 88 - 65 = 23. (c) Left-skewed. The distance from Min to Median (46 points) is much larger than from Median to Max (22 points). The left tail is longer. (d) Lower fence = Q1 - 1.5 * IQR = 65 - 34.5 = 30.5. A score of 32 is above 30.5, so it is NOT an outlier by this method (barely). (e) The box would span from 65 to 88, with a line at 78 (the median). The left whisker extends down toward 32. The right whisker extends to 100. The box is taller on the left of the median line (median is closer to Q3), suggesting left skew.

Exercise 19.5 — Outlier impact

Consider this dataset: {10, 12, 11, 13, 12, 14, 11, 12, 13, 100}.

(a) Compute the mean and median. (b) Remove the value 100 and recompute. How much did each measure change? (c) Which measure was more affected by the outlier? Why? (d) Compute the standard deviation and IQR both with and without the outlier. Which spread measure was more affected?

Guidance

(a) With 100: Mean = 20.8, Median = 12.0. (b) Without 100: Mean = 12.0, Median = 12.0. (c) The mean changed by 8.8; the median didn't change at all. The mean is more sensitive to extreme values because it uses every value in its calculation. (d) With 100: SD ≈ 26.8, IQR ≈ 2.0. Without 100: SD ≈ 1.2, IQR ≈ 1.5. The standard deviation was far more affected.

Exercise 19.6 — Z-score interpretation

Jordan is comparing their scores in two different classes. In Statistics, they scored 82, where the class mean was 75 and the standard deviation was 8. In Sociology, they scored 88, where the class mean was 83 and the standard deviation was 3.

(a) Compute Jordan's z-score in each class. (b) In which class did Jordan perform relatively better, compared to their classmates? (c) If a z-score of 2.0 or above is "exceptional," did Jordan achieve exceptional performance in either class?

Guidance

(a) Statistics: z = (82 - 75) / 8 = 0.875. Sociology: z = (88 - 83) / 3 = 1.67. (b) Jordan performed relatively better in Sociology (z = 1.67 vs. z = 0.875), even though the raw score difference from the mean was larger in Statistics. The Sociology class had less variability, so being 5 points above the mean was more unusual. (c) Neither score crosses the z = 2.0 threshold, so no — but the Sociology score is close.

Exercise 19.7 — Bimodality

You're analyzing commute times for employees at a company. The mean commute is 35 minutes and the standard deviation is 18 minutes. But when you look at the histogram, you see two distinct peaks — one around 15 minutes and another around 55 minutes.

(a) Why is the mean of 35 minutes a poor summary of this data? (b) What might explain the bimodal pattern? (c) What would you recommend doing instead of reporting a single mean?

Guidance

(a) The mean of 35 falls in the valley between the two peaks — it's a "typical" value that doesn't describe anyone. Very few employees actually commute around 35 minutes. (b) The two peaks likely represent two groups: employees who live near the office (short commute) and employees who live in a different city or suburb (long commute). Other explanations: different office locations, different transportation modes. (c) Report statistics separately for each group. "Employees fall into two groups: those with short commutes (mean ≈ 15 min) and those with long commutes (mean ≈ 55 min)." Also show the histogram.

Exercise 19.8 — Anscombe's lesson

In your own words, explain the lesson of Anscombe's Quartet. Give a real-world example where relying solely on summary statistics (without plotting) could lead to a wrong conclusion.

Guidance

Anscombe's Quartet shows four datasets with identical means, standard deviations, correlations, and regression lines — but completely different patterns when plotted. The lesson: summary statistics can hide crucial structure in data. Always visualize. Real-world example: A company tracking customer satisfaction scores over time might see a stable mean of 4.0/5.0 every quarter. But plotting the data reveals that scores are becoming bimodal — more 1s and 5s, fewer 3s and 4s — meaning satisfaction is becoming polarized even as the average stays the same.

Part B: Applied Coding (Intermediate)

Exercise 19.9 — Complete descriptive profile

Using pandas and NumPy, write a function descriptive_profile(data) that takes a pandas Series and returns a dictionary with the following: mean, median, mode, standard deviation, variance, range, IQR, Q1, Q3, skewness, and kurtosis. Test it on the following data: [23, 45, 12, 67, 89, 34, 56, 78, 23, 45, 67, 90, 12, 34, 56].

Guidance

from scipy import stats
import numpy as np
import pandas as pd

def descriptive_profile(data):
    s = pd.Series(data) if not isinstance(data, pd.Series) else data
    return {
        'mean': s.mean(),
        'median': s.median(),
        'mode': s.mode().tolist(),
        'std_dev': s.std(),
        'variance': s.var(),
        'range': s.max() - s.min(),
        'IQR': s.quantile(0.75) - s.quantile(0.25),
        'Q1': s.quantile(0.25),
        'Q3': s.quantile(0.75),
        'skewness': s.skew(),
        'kurtosis': s.kurt()
    }

Exercise 19.10 — Visualizing center and spread

Create a figure with three subplots showing three different distributions (generate them using np.random): (a) a symmetric distribution, (b) a right-skewed distribution, and (c) a bimodal distribution. On each histogram, draw vertical lines for the mean and median. Below each plot, add a text annotation showing the mean, median, and standard deviation. Use this to visually confirm the relationship between skewness and the mean-median gap.

Guidance

Use `np.random.normal()` for symmetric, `np.random.exponential()` for right-skewed, and a mixture of two `np.random.normal()` calls for bimodal. Use `ax.axvline()` for the vertical lines and `ax.text()` or `ax.set_xlabel()` for annotations.

Exercise 19.11 — Outlier detection function

Write a function detect_outliers(data, method='iqr') that returns outliers using either the IQR fence method (default) or the z-score method (when method='zscore'). The function should return a DataFrame with columns: value, is_outlier, and detail (explaining why it's an outlier — e.g., "z-score = 3.2, exceeds threshold of 2.0" or "below lower fence of 15.5"). Test it on: [12, 14, 13, 15, 14, 200, 13, 14, 15, 16, 14, -50].

Guidance

For the IQR method, compute Q1, Q3, IQR, and fences. For z-score, compute z = (x - mean) / std and flag values with |z| > 2. Return a DataFrame so results are easy to read. Both 200 and -50 should be flagged by either method.

Exercise 19.12 — Grouped descriptive statistics

Load (or simulate) a dataset with at least 200 rows, a numerical column (e.g., scores, prices, or rates), and a categorical grouping column (e.g., region, category, or group). Compute grouped descriptive statistics using .groupby() and .agg(). Include mean, median, std, skewness, count, min, and max for each group. Then create side-by-side box plots. Write a paragraph interpreting your findings.

Guidance

# Example using simulated data
grouped = df.groupby('group')['value'].agg(
    ['count', 'mean', 'median', 'std', 'min', 'max']
)
# Add skewness separately
grouped['skewness'] = df.groupby('group')['value'].apply(stats.skew)

For box plots, use `df.boxplot(column='value', by='group')` or seaborn's `sns.boxplot()`.

Exercise 19.13 — Standard deviation intuition builder

Generate three datasets, each with 1000 values and a mean of 50, but with standard deviations of 2, 10, and 25. Plot all three as overlapping histograms (use alpha=0.5 for transparency). What do you observe? Then for each dataset, compute what percentage of values fall within 1, 2, and 3 standard deviations of the mean. How do these percentages compare to the 68-95-99.7 rule you may have heard of?

Guidance

np.random.seed(42)
d1 = np.random.normal(50, 2, 1000)
d2 = np.random.normal(50, 10, 1000)
d3 = np.random.normal(50, 25, 1000)

for name, d, sd in [('SD=2', d1, 2), ('SD=10', d2, 10), ('SD=25', d3, 25)]:
    within_1 = np.mean(np.abs(d - 50) <= sd) * 100
    within_2 = np.mean(np.abs(d - 50) <= 2*sd) * 100
    within_3 = np.mean(np.abs(d - 50) <= 3*sd) * 100
    print(f"{name}: {within_1:.1f}% within 1 SD, {within_2:.1f}% within 2 SD, {within_3:.1f}% within 3 SD")

For normal distributions, you should see approximately 68%, 95%, and 99.7% — the empirical rule.

Exercise 19.14 — Comparing describe() across subgroups

Using the Gapminder dataset (or a similar dataset with country-level health and economic indicators), compute .describe() for life expectancy grouped by continent. Identify which continent has the highest median, the most variation (highest IQR), and the most outliers. Explain whether the mean or median is more appropriate for each continent based on skewness.

Guidance

Use pandas `.groupby('continent')['lifeExp'].describe()` and supplement with skewness calculations. Africa typically shows the most variation and right skew in life expectancy. Europe tends to be left-skewed (most countries cluster high with a few lower). Recommendations should match: median + IQR for skewed distributions, mean + SD for symmetric ones.

Exercise 19.15 — The effect of sample size on statistics

Generate a right-skewed distribution (e.g., exponential) with 100,000 values. Then take random samples of size 10, 50, 100, 500, and 5000. For each sample size, compute the mean, median, and standard deviation. Repeat this 1000 times and plot how the estimates converge as sample size increases. What do you observe? Does the mean or median stabilize faster?

Guidance

population = np.random.exponential(10, 100000)
sample_sizes = [10, 50, 100, 500, 5000]

for n in sample_sizes:
    means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
    medians = [np.median(np.random.choice(population, n)) for _ in range(1000)]
    print(f"n={n}: mean of means={np.mean(means):.2f}, std of means={np.std(means):.2f}")
    print(f"       mean of medians={np.mean(medians):.2f}, std of medians={np.std(medians):.2f}")

Both converge, but the mean typically converges faster for most distributions. The variability of both estimates decreases as sample size increases.

Exercise 19.16 — Robust vs. non-robust statistics

Create a "clean" dataset of 100 values drawn from a normal distribution with mean 50 and SD 10. Then create a "contaminated" version by replacing 5 values with extreme outliers (e.g., 500). Compare the following statistics before and after contamination: mean, median, standard deviation, IQR, range. Create a table showing the percentage change for each statistic. Which statistics are robust?

Guidance

The median and IQR should show minimal change (< 5% typically). The mean, standard deviation, and range should all change dramatically. This demonstrates the concept of *robustness* — a robust statistic is one that isn't unduly influenced by a small number of extreme values.

Part C: Synthesis and Real-World Application (Advanced)

Exercise 19.17 — Misleading averages in the news

Find a real news article, blog post, or social media post that reports an "average" (mean) for data that is likely skewed — income, home prices, medical costs, response times, etc. Write a short critique (200-300 words) explaining: (a) Why the mean is likely misleading for this data (b) What measure of center would be more appropriate (c) What additional information (spread, shape, visualization) would make the statistic more informative

Exercise 19.18 — Simpson's Paradox and descriptive statistics

A university reports that the average starting salary for graduates is $65,000, which is higher than a competing university's $60,000. But when you break down by major, the competing university pays more for every single major. How is this possible? (Hint: think about the composition of majors at each university.) Create a simulated dataset that demonstrates this paradox, and compute the overall and by-major means.

Guidance

Simpson's Paradox occurs when a trend that appears in aggregated data reverses when the data is split into groups. In this case, if the first university has a higher proportion of students in high-paying majors (like engineering), its overall average can be higher even if the second university pays more within each major. Simulate with two groups: high-paying major (mean $80k) and low-paying major (mean $40k), with different group proportions at each university.

Exercise 19.19 — Elena's full vaccination analysis

Using the progressive project data (or simulated data), write a complete analysis report that includes: (a) Descriptive statistics for vaccination rates overall and by income group (b) A determination of which summary statistics are most appropriate for each group (based on shape) (c) An outlier analysis identifying any countries with unusually low or high rates given their income group (d) At least three visualizations (histograms, box plots, and one of your choice) (e) A 200-word narrative summary suitable for a non-technical audience

Exercise 19.20 — Comparing two groups rigorously

Marcus wants to compare Saturday sales to weekday sales at his bakery. Generate simulated data for 52 Saturdays and 260 weekdays. Compute descriptive statistics for each group. Create overlapping histograms and side-by-side box plots. Based on the descriptive statistics alone (we haven't learned hypothesis testing yet), what can Marcus conclude? What can he not conclude? Write up your findings in a format Marcus — a non-technical bakery owner — could understand.

Guidance

Marcus can describe the *observed* differences — e.g., "Saturday sales were higher on average and more variable." He CANNOT conclude that the difference is statistically significant or that it's not due to random chance — that requires the inferential statistics tools from Chapters 22-23. This exercise builds the bridge to hypothesis testing by showing where descriptive statistics end and inferential statistics begin.

Exercise 19.21 — Building a descriptive statistics dashboard

Create a function stats_dashboard(df, numeric_col, group_col=None) that produces a complete descriptive statistics report. The function should: (a) Print a formatted table of statistics (mean, median, mode, std, IQR, skewness, kurtosis) (b) Create a figure with: a histogram with mean/median lines, a box plot, and (if group_col is provided) grouped box plots (c) Print an automated interpretation (e.g., "The distribution is right-skewed. Consider using the median rather than the mean.") (d) Flag any detected outliers

Test it on at least two different datasets.

Guidance

This is a larger project that synthesizes everything from the chapter. Focus on making the output readable and the function reusable. Consider using f-strings for formatted output and `matplotlib.gridspec` for complex figure layouts.

Exercise 19.22 — Historical data investigation

The Gapminder dataset contains GDP per capita for countries over time. Pick a single year and compute descriptive statistics for GDP per capita across all countries. Then pick a single country and compute descriptive statistics for its GDP per capita over time. How do the two analyses differ in what they're measuring? What different questions do they answer?

Exercise 19.23 — Percentile-based analysis

In education, test scores are often reported as percentiles rather than raw scores. Explain the difference between "scoring 85 points" and "scoring at the 85th percentile." A student scores at the 90th percentile on a math test and the 75th percentile on a reading test. Can you conclude they are better at math than reading? Why or why not?

Guidance

An 85th percentile means 85% of test-takers scored lower. It depends on the distribution of scores, not the raw score. You cannot directly compare percentiles across different tests because the distributions may differ — the 90th percentile in math might represent a smaller absolute gap from the average than the 75th percentile in reading if the math scores have less variation.

Part D: Extension Problems (Challenge)

Exercise 19.24 — Weighted means

Marcus's bakery sells three products: croissants (profit margin 40%, volume 500/month), sourdough loaves (profit margin 25%, volume 200/month), and custom cakes (profit margin 60%, volume 30/month). Compute the simple average profit margin and the volume-weighted average profit margin. Which gives a more accurate picture of the bakery's overall profitability? Implement a Python function for weighted mean.

Guidance

Simple average: (40 + 25 + 60) / 3 = 41.7%. Weighted average: (40*500 + 25*200 + 60*30) / (500 + 200 + 30) = (20000 + 5000 + 1800) / 730 = 36.7%. The weighted average better reflects reality because most items sold are croissants.

def weighted_mean(values, weights):
    return np.average(values, weights=weights)

Exercise 19.25 — Geometric mean

The annual returns for a stock over 5 years are: +20%, -10%, +15%, +5%, -5%. Compute the arithmetic mean return and the geometric mean return. Which gives a more accurate picture of the investment's actual performance? When should you use the geometric mean instead of the arithmetic mean?

Guidance

Arithmetic mean: (20 + (-10) + 15 + 5 + (-5)) / 5 = 5%. Geometric mean: [(1.20)(0.90)(1.15)(1.05)(0.95)]^(1/5) - 1 ≈ 4.3%. The geometric mean is more accurate for growth rates because it accounts for compounding. Use the geometric mean for rates of change, growth rates, and financial returns.

Exercise 19.26 — Trimmed mean

A trimmed mean removes a percentage of the most extreme values before computing the mean. It's a compromise between the mean (uses everything) and the median (ignores everything except the middle). Using scipy.stats.trim_mean(), compute the 10% trimmed mean and 20% trimmed mean for a right-skewed dataset. Compare these to the regular mean and the median. When might a trimmed mean be preferable to either?

Exercise 19.27 — Coefficient of variation

The coefficient of variation (CV) is the standard deviation divided by the mean, expressed as a percentage. It lets you compare variability across datasets with different scales. Elena's vaccination data has a mean of 72% with SD of 15%, while GDP per capita has a mean of $15,000 with SD of $20,000. Compute the CV for each. Which variable is relatively more variable?

Guidance

CV for vaccination: (15/72) * 100 = 20.8%. CV for GDP: (20000/15000) * 100 = 133.3%. GDP per capita is far more variable relative to its mean. The CV is useful when comparing the variability of measurements on different scales.

Exercise 19.28 — Recreating Anscombe's Quartet

Look up the exact values of Anscombe's Quartet (all four datasets) and recreate the classic four-panel figure. For each dataset, compute and display: mean of x, mean of y, standard deviation of x, standard deviation of y, and correlation between x and y. Verify that all four datasets produce nearly identical statistics despite their visual differences.

Exercise 19.29 — The Datasaurus Dozen

Research the "Datasaurus Dozen" — a modern extension of Anscombe's Quartet created by Alberto Cairo and Justin Matejka. What point does it make beyond what Anscombe already showed? If you can find the dataset (it's freely available), recreate the visualization.

Exercise 19.30 — Design your own misleading statistic

Create a dataset (at least 50 values) where the mean, median, and mode are all very different from each other. Plot the distribution and explain what real-world scenario could produce data like this. Then write two "headlines" using this data — one that uses the mean and another that uses the median — showing how the choice of statistic can tell opposite stories.

Guidance

One approach: Create a dataset with many values at 0 (making the mode 0), a cluster around 20 (pulling the median toward 20), and a few extreme values at 500+ (pulling the mean much higher). Example real-world scenario: donation amounts to a charity — most people give $0 (don't donate), the typical donor gives $20, but a few major donors give $10,000+.

Reflection

After completing these exercises, you should be comfortable computing and interpreting all the major descriptive statistics. Before moving on, make sure you can:

[ ] Explain the difference between mean and median, and when to use each
[ ] Compute and interpret standard deviation and IQR
[ ] Recognize symmetric, skewed, and bimodal distributions
[ ] Detect outliers using both the IQR method and z-scores
[ ] Write a clear, non-technical interpretation of a set of descriptive statistics
[ ] Use pandas methods (.describe(), .groupby().agg(), etc.) fluently

If any of these feel shaky, revisit the relevant section before moving to Chapter 20.