Chapter 6 Quiz: Your First Data Analysis

Contributors to Introduction to Data Science

Chapter 6 Quiz: Your First Data Analysis

Instructions: This quiz tests your understanding of Chapter 6. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (8 questions, 5 points each)

Question 1. What is the primary purpose of exploratory data analysis (EDA)?

(A) To build a machine learning model that makes predictions
(B) To discover patterns, spot anomalies, check assumptions, and generate hypotheses through systematic examination of a dataset
(C) To prove that a specific hypothesis is correct
(D) To clean the data and remove all missing values

Answer

**Correct: (B)** - **(A)** describes modeling, which comes *after* EDA in the data science lifecycle. EDA doesn't build models — it informs whether and how to build them. - **(B)** captures EDA's exploratory, hypothesis-generating nature. As John Tukey emphasized, EDA is about discovering what the data has to tell you, not confirming what you already believe. - **(C)** is backwards. EDA generates hypotheses; it doesn't prove them. Hypothesis testing ([Chapter 23](../../part-04-statistical-thinking/chapter-23-hypothesis-testing/index.md)) is a separate step. - **(D)** is one activity within EDA, but it's not the primary purpose. Data quality assessment is one component of a broader exploration.

Question 2. You load a CSV file using csv.DictReader and examine the first row. The value for a column called "population" is '1250000'. What is the data type of this value in Python?

(A) int
(B) float
(C) str
(D) NoneType

Answer

**Correct: (C)** `csv.DictReader` reads *all* values as strings, regardless of what they look like. The value `'1250000'` is the string `"1250000"`, not the integer `1250000`. To use it in arithmetic, you must explicitly convert it with `int()` or `float()`. This is one of the most common gotchas for beginners working with CSV data.

Question 3. Which of the following is the best example of a specific, answerable EDA question?

(A) "What's going on with this dataset?"
(B) "Why do some countries have lower vaccination rates?"
(C) "What is the median vaccination coverage for MCV1 across all countries in 2022?"
(D) "Is this data any good?"

Answer

**Correct: (C)** - **(A)** is too vague — "going on" doesn't specify what to compute or examine. - **(B)** asks "why," which requires causal reasoning beyond what's typically in a dataset. - **(C)** specifies the exact statistic (median), the exact column (vaccination coverage), the exact filter criteria (MCV1, 2022), and the scope (all countries). You could write code to answer this immediately. - **(D)** is subjective — "good" isn't a measurable property without defining criteria.

Question 4. When computing the mean of a column that contains some empty strings, what should you do?

(A) Replace all empty strings with zero and include them in the calculation
(B) Skip empty strings and compute the mean using only the non-empty numeric values
(C) Replace all empty strings with the mean of the other values
(D) Stop the analysis because the data is invalid

Answer

**Correct: (B)** - **(A)** would bias the mean downward, because you'd be treating "unknown" as "zero" — which is a very different thing. - **(B)** is the safest default. By excluding missing values, you compute the mean of the data you actually have. You should also report how many values were skipped. - **(C)** is a valid imputation strategy in some contexts ([Chapter 8](../../part-02-data-wrangling/chapter-08-cleaning-messy-data/index.md) will cover this), but it's not appropriate as a default during initial EDA, and it creates a circular dependency (the mean depends on the imputed values, which depend on the mean). - **(D)** is too extreme. Most real datasets have some missing values. Stopping the analysis isn't practical.

Question 5. You compute the mean and median of vaccination coverage and find: mean = 82.7%, median = 89.0%. What does this tell you about the distribution?

(A) The data is normally distributed (symmetric)
(B) The data is right-skewed (pulled up by a few very high values)
(C) The data is left-skewed (pulled down by a tail of low values)
(D) There must be an error in the calculation

Answer

**Correct: (C)** When the median is higher than the mean, the distribution is pulled toward lower values — meaning there's a tail of low outliers dragging the mean down. This is called left-skew (or negative skew). In the WHO dataset, most countries have fairly high vaccination coverage (clustering near the median of 89%), but a smaller number of countries have much lower rates (some as low as 6%), which pulls the mean below the median.

Question 6. A data dictionary is:

(A) A Python dictionary that stores your dataset
(B) A document describing each column in a dataset, including its meaning, type, and any quality notes
(C) A function that converts data types
(D) A database that stores metadata about other databases

Answer

**Correct: (B)** A data dictionary is a reference document — not Python code — that describes each column's name, meaning, expected data type, range of valid values, and any known issues. Creating a data dictionary is one of the first steps in any professional analysis. It ensures that everyone working with the data interprets the columns consistently. Option (A) confuses the term with Python's `dict` data structure — they share the word "dictionary" but are completely different concepts.

Question 7. You find that target_population and doses_administered are both missing for exactly the same 47.8% of records. This most likely suggests:

(A) A random bug in the CSV file
(B) The two fields come from the same reporting process, and some countries don't report either
(C) The values were intentionally removed for privacy
(D) The columns are duplicates of each other

Answer

**Correct: (B)** When two columns are missing at exactly the same rate and for the same rows, it strongly suggests they come from the same data collection process. Countries either report both population targets and administered doses (because they have the tracking infrastructure to do so) or report neither. This is an example of MAR (missing at random) — the missingness is related to the country's reporting capacity, not to the values themselves.

Question 8. Which of the following is the most important reason for checking the first and last rows of a dataset?

(A) To make sure the data is sorted alphabetically
(B) To catch file corruption, truncation, or unexpected formatting at the beginning or end of the file
(C) To find the minimum and maximum values
(D) To determine the data types of each column

Answer

**Correct: (B)** Checking the first few rows confirms that the file loaded correctly and the column structure is as expected. Checking the last few rows catches problems that only appear at the end of a file — truncation (the file was cut off), corruption (garbled characters), or unexpected trailing rows (like summary rows or footnotes that shouldn't be treated as data). Sorting (A), min/max (C), and data types (D) are separate checks that don't require examining the last rows specifically.

Section 2: True/False (3 questions, 5 points each)

Question 9. True or False: An outlier should always be removed from the dataset before computing summary statistics.

Answer

**False.** Outliers should be *investigated*, not automatically removed. Some outliers are genuine extreme values that represent real phenomena — like a country with 6% vaccination coverage due to ongoing conflict. Removing them would misrepresent reality. Other outliers are genuine errors — like a typo that turned 85 into 850. The correct response depends on the context: investigate the outlier, determine whether it's real or erroneous, and then make a documented decision about how to handle it. Automatic removal is almost never the right approach.

Question 10. True or False: Reproducibility means that someone else can run your notebook and get the same results.

Answer

**True.** Reproducibility is the principle that an independent person, using the same data and code, should be able to replicate your analysis and arrive at the same results. This requires that the data is accessible, the code runs in order from top to bottom, dependencies are documented, and any random processes use fixed seeds. Reproducibility is a cornerstone of scientific integrity and professional data science.

Question 11. True or False: When the csv module encounters an empty cell in a CSV file, it stores it as Python's None value.

Answer

**False.** The `csv` module stores empty cells as empty strings (`""`), not `None`. This distinction matters because `None` and `""` behave differently in Python. For example, `"" == None` evaluates to `False`. When checking for missing values in CSV data, you need to check for empty strings (and possibly strings that are just whitespace), not for `None`.

Section 3: Short Answer (4 questions, 5 points each)

Question 12. Explain the difference between data provenance and a data dictionary. Why are both important?

Answer

**Data provenance** documents *where the data came from*: the source, collection method, date, purpose, and known limitations. It answers "can I trust this data?" and "what might be missing or biased?" A **data dictionary** documents *what the data contains*: column names, meanings, expected types, valid ranges, and quality notes. It answers "what does each column mean?" and "how should I interpret these values?" Both are important because data without context is uninterpretable. Provenance tells you whether the data is trustworthy for your purposes. The data dictionary tells you how to correctly use the data. Together, they form the documentation that makes an analysis reproducible and verifiable.

Question 13. A colleague says, "I always compute the mean of my data. That tells me everything I need to know about the distribution." Explain, with a concrete example, why this is insufficient.

Answer

The mean alone can be highly misleading because it doesn't capture the *shape* of the distribution. For example, consider two countries: Country A has vaccination rates of [90, 91, 89, 90, 90] (mean = 90), and Country B has rates of [50, 50, 100, 100, 100] (mean = 80). The means suggest Country A is doing well and Country B is somewhat lower. But Country B has a split pattern — some vaccines are at 100% while others are at 50% — which the mean hides completely. At minimum, you also need the median (to detect skew), the min and max (to understand the range), and ideally a measure of spread like standard deviation. These together give a much more complete picture than the mean alone.

Question 14. What is the difference between MCAR, MAR, and MNAR missing data? Which type is most problematic for analysis, and why?

Answer

- **MCAR (Missing Completely at Random):** The missingness has no relationship to any variable. A random glitch deleted some values. - **MAR (Missing at Random):** The missingness is related to an *observed* variable. Smaller countries may be less likely to report. - **MNAR (Missing Not at Random):** The missingness is related to the *unobserved value itself*. Countries with low coverage may choose not to report. **MNAR is the most problematic** because it systematically biases your analysis. If countries with low vaccination rates don't report, your computed average will be higher than reality — you're only averaging the countries that reported, which are disproportionately the higher-coverage ones. MCAR is the least problematic because the missing values don't bias the remaining data in any direction.

Question 15. Describe the "notebook narrative" concept in 3-4 sentences. How does it differ from a code-only notebook?

Answer

A notebook narrative treats a Jupyter notebook as a communication document, not just a code container. It interleaves code cells with Markdown cells that explain the analyst's reasoning: what question each section addresses, what the code does, what the output means, and what conclusions to draw. Unlike a code-only notebook, which requires the reader to reverse-engineer the analyst's intent from raw code and output, a narrative notebook guides the reader through a logical story — from question to evidence to conclusion. This makes the analysis accessible to non-programmers, easier to review for errors, and more valuable as a reference for future work.

Section 4: Applied Scenarios (3 questions, 5 points each)

Question 16. You load a dataset of employee salaries and compute these statistics:

Count:  500
Min:    $28,000
Max:    $4,200,000
Mean:   $87,500
Median: $62,000

Describe three observations you can make from these statistics alone. What would you investigate next?

Answer

Three observations: 1. **There's likely an outlier or extreme value.** The max of $4.2M is 48 times the mean and 68 times the median. This could be a legitimate executive salary or a data entry error (e.g., $42,000 entered as $4,200,000). 2. **The distribution is right-skewed.** The mean ($87,500) is much higher than the median ($62,000), indicating a tail of high-salary employees pulling the mean upward. Most employees earn closer to $62,000. 3. **The range is enormous.** The $4.17M spread between min and max suggests very different types of employees in the same dataset — possibly from entry-level workers to C-suite executives. **Next steps:** Investigate the max value to determine if it's real or an error. Compute the mean and median *excluding* the top 1% of salaries to see how sensitive the statistics are to extreme values. Break down salary statistics by department or job level to understand the within-group variation.

Question 17. You're given a CSV file of daily temperature readings for a weather station. After loading it, you run a missing values check and find:

date:        0 missing  (0.0%)
high_temp:  12 missing  (3.3%)
low_temp:   12 missing  (3.3%)
humidity:   45 missing  (12.3%)
wind_speed: 45 missing  (12.3%)

What pattern do you notice? What does it suggest about how the data was collected? What type of missing data is this most likely?

Answer

Two patterns are visible: 1. `high_temp` and `low_temp` are missing at exactly the same rate (3.3%, 12 records each), suggesting they come from the same sensor or recording process. When the temperature sensor was down, both values were lost. 2. `humidity` and `wind_speed` are missing at exactly the same rate (12.3%, 45 records each), suggesting they share a different sensor or process — one that was down more frequently than the temperature sensor. This is most likely **MAR (Missing at Random)** — the missingness is related to equipment availability, not to the actual temperature or humidity values. The data is probably missing on days when a sensor malfunctioned or the station was offline, which is likely unrelated to the weather conditions themselves. However, if equipment tends to fail during extreme weather (very high winds, extreme temperatures), it could be **MNAR** — the very conditions you want to measure are the ones most likely to be missing.

Question 18. A data journalist is writing a story with the headline: "Average Vaccination Rate Drops to 82% — Global Health in Crisis." You know from your analysis that the median is 89%. Write a 3-4 sentence response explaining why the headline might be misleading and what additional context would make it more accurate.

Answer

The headline uses the mean (82%) rather than the median (89%), which may give a misleading impression of the "typical" country's vaccination status. Because the distribution is left-skewed — a smaller number of countries have very low rates that pull the mean down — the mean understates what most countries are actually achieving. A more accurate headline might say "Most Countries Maintain High Vaccination Rates, but a Handful Fall Dramatically Behind," because the median of 89% shows that at least half of all countries have coverage near 90%. The journalist should also specify which vaccine(s) and time period the number refers to, since "vaccination rate" without context could mean many different things.

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. Consider this code:

total = 0
count = 0
for row in data:
    total += int(row["coverage_pct"])
    count += 1
average = total / count

This code has a bug that will likely cause it to crash on the WHO dataset. Identify the bug, explain what error it will produce, and write a corrected version.

Answer

**Bug:** The code calls `int(row["coverage_pct"])` on every row, but some rows have empty strings (`""`) for `coverage_pct`. Calling `int("")` raises a `ValueError: invalid literal for int()`. Additionally, even if all values were present, using `int()` instead of `float()` would truncate decimal values (e.g., `"85.5"` would also cause a ValueError since `int("85.5")` fails). **Corrected version:**

total = 0
count = 0
for row in data:
    value = row["coverage_pct"].strip()
    if value != "":
        try:
            total += float(value)
            count += 1
        except ValueError:
            pass  # Skip non-numeric values
average = total / count if count > 0 else None

The fix: (1) check for empty strings before conversion, (2) use `float()` instead of `int()`, (3) wrap in try/except for safety, and (4) guard against division by zero.

Question 20. What does this code compute? Describe its output in plain English.

result = {}
for row in data:
    region = row["region"]
    val = row["coverage_pct"].strip()
    if val == "":
        if region not in result:
            result[region] = 0
        result[region] += 1

for region, count in sorted(result.items()):
    print(f"{region}: {count}")

Answer

This code counts the number of **missing coverage values** (empty strings) in each WHO region and prints the results sorted alphabetically by region code. In plain English: "For each WHO region, how many records are missing a vaccination coverage percentage?" The output would look something like:

AFRO: 35
AMRO: 15
EMRO: 9
EURO: 15
SEARO: 6
WPRO: -3

(Exact numbers depend on the dataset.) Note: There's a subtle issue in the code — the `if region not in result` initialization only happens inside the `if val == ""` block. This means regions where *no* values are missing won't appear in the output at all. A more robust version would initialize all regions to 0 before the loop, or use `result[region] = result.get(region, 0) + 1` instead.