Chapter 7 Quiz: Introduction to pandas

Q: Predict the output of this code: ```python import pandas as pd s = pd.Series([10, 20, 30, 40, 50]) print(s[s > 25].sum()) ```

``` 120 ``` Step by step: (1) `s > 25` produces `[False, False, True, True, True]`. (2) `s[s > 25]` selects the values 30, 40, and 50. (3) `.sum()` adds them: 30 + 40 + 50 = 120.

Contributors to Introduction to Data Science

Chapter 7 Quiz: Introduction to pandas

Instructions: This quiz tests your understanding of Chapter 7. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For code analysis questions, predict the output without running the code. Total points: 100.

Section 1: Multiple Choice (8 questions, 5 points each)

Question 1. What is the correct way to import pandas?

(A) import pandas
(B) import pandas as pd
(C) from pandas import *
(D) Both (A) and (B) work, but (B) is the universal convention

Answer

**Correct: (D)** - **(A)** works technically — `pandas.DataFrame(...)` is valid Python — but virtually nobody writes it this way. You'd have to type `pandas` in full every time. - **(B)** is the universal convention used by the pandas documentation, tutorials, books, and the overwhelming majority of data scientists. The alias `pd` is so standard that writing `import pandas` without it would confuse experienced readers. - **(C)** is bad practice for any library. Wildcard imports pollute the namespace and make it unclear where functions come from. - **(D)** is correct: both work, but (B) is the convention you should follow.

Question 2. What type does df["coverage_pct"] return?

(A) A Python list
(B) A pandas DataFrame
(C) A pandas Series
(D) A NumPy array

Answer

**Correct: (C)** Selecting a single column from a DataFrame using bracket notation returns a **Series** — a one-dimensional labeled array. Selecting *multiple* columns (with double brackets like `df[["col1", "col2"]]`) returns a DataFrame. This distinction is important because Series and DataFrame have different methods and behaviors.

Question 3. Which of the following correctly filters a DataFrame to show only rows where coverage is above 90%?

(A) df.filter(coverage_pct > 90)
(B) df[df["coverage_pct"] > 90]
(C) df.where("coverage_pct" > 90)
(D) df.select(df.coverage_pct > 90)

Answer

**Correct: (B)** - **(A)** is not valid pandas syntax. `df.filter()` exists but is used for filtering by column/index labels, not by values. - **(B)** uses boolean indexing: `df["coverage_pct"] > 90` creates a boolean mask, and passing it into `df[...]` selects only the `True` rows. - **(C)** is not valid — `"coverage_pct" > 90` compares a string to an integer, which is a Python comparison, not a pandas operation. - **(D)** `df.select()` does not exist as a standard pandas method.

Question 4. What is the key difference between df.iloc[1:4] and df.loc[1:4]?

(A) iloc is faster than loc
(B) iloc uses exclusive end (returns rows at positions 1, 2, 3), while loc uses inclusive end (returns rows with labels 1, 2, 3, 4)
(C) iloc works with numbers and loc works with strings
(D) There is no difference when the index is the default integer index

Answer

**Correct: (B)** - **(A)** is not the key difference; performance is similar for most operations. - **(B)** is correct. `iloc` follows Python's standard slicing convention (exclusive end), so `iloc[1:4]` returns rows at positions 1, 2, and 3. `loc` uses label-based slicing with an inclusive end, so `loc[1:4]` returns rows with index labels 1, 2, 3, and 4 — that's four rows, not three. - **(C)** is an oversimplification. `loc` works with any index type (including integers); it uses labels, which can be integers. - **(D)** is wrong precisely because of the inclusive vs. exclusive end behavior.

Question 5. What does df["coverage_pct"].apply(some_function) do?

(A) Applies the function to the entire column at once
(B) Calls the function once for each value in the column, returning a new Series of results
(C) Modifies the column in place
(D) Filters the column to keep only values where the function returns True

Answer

**Correct: (B)** `apply()` calls the given function on *each individual value* in the Series, one at a time, and returns a new Series containing all the return values. It does not modify the original Series in place (C is wrong). It is not the same as a true vectorized operation that processes the whole column at once (A is wrong) — under the hood, `apply` still iterates. And it doesn't filter (D is wrong) — it transforms.

Question 6. You write df[df["region"] == "AFRO" and df["year"] == 2022]. What happens?

(A) It correctly filters for AFRO region in 2022
(B) It raises a SyntaxError
(C) It raises a ValueError about ambiguous truth values
(D) It returns an empty DataFrame

Answer

**Correct: (C)** Python's `and` operator tries to evaluate each operand as a single boolean value. But `df["region"] == "AFRO"` is a Series with thousands of True/False values — it can't be reduced to a single True or False. pandas raises `ValueError: The truth value of a Series is ambiguous`. The fix is to use `&` (element-wise and) with parentheses: `df[(df["region"] == "AFRO") & (df["year"] == 2022)]`.

Question 7. What does pd.read_csv() do with empty cells in a CSV file?

(A) Stores them as empty strings ""
(B) Stores them as the integer 0
(C) Stores them as NaN (Not a Number)
(D) Raises an error and refuses to load the file

Answer

**Correct: (C)** This is one of the most important behavioral differences between pandas and Python's `csv` module. The `csv` module stores empty cells as empty strings (`""`), which crash arithmetic operations. pandas stores them as `NaN`, which propagates safely through computations and is automatically excluded from statistical methods like `.mean()` and `.count()`.

Question 8. What is the purpose of .reset_index(drop=True) after sorting or filtering?

(A) It removes the index column entirely
(B) It creates a new clean 0-based index and discards the old one
(C) It sorts the DataFrame by its index values
(D) It converts the index to a regular column

Answer

**Correct: (B)** After sorting or filtering, the index values may be out of order (e.g., 3, 1, 4, 0, 2). `.reset_index(drop=True)` replaces these with a clean 0, 1, 2, 3, ... sequence. The `drop=True` parameter prevents the old index from being added as a new column. Without `drop=True`, the old index values would be saved as an additional column named "index."

Section 2: True/False (3 questions, 5 points each)

Question 9. True or False: df.country and df["country"] always produce the same result.

Answer

**False.** They produce the same result *most of the time*, but dot notation fails in several cases: (1) when the column name contains spaces or special characters, (2) when the column name conflicts with a DataFrame method or attribute (e.g., a column named "count" or "shape"), and (3) when the column name is stored in a variable. For these reasons, bracket notation is generally preferred.

Question 10. True or False: A vectorized operation like df["col"] * 2 is generally faster than using a Python for loop to multiply each value by 2.

Answer

**True.** Vectorized operations in pandas use optimized C/Cython code under the hood (via NumPy). For large datasets (thousands to millions of rows), vectorized operations can be 10x to 100x faster than equivalent Python `for` loops because they avoid Python's per-element overhead and leverage CPU-level optimizations.

Question 11. True or False: df.describe() includes all columns in its output by default.

Answer

**False.** By default, `df.describe()` only includes *numeric* columns (int64, float64). Text/object columns are excluded. To include all columns, use `df.describe(include="all")`. This is actually a useful design choice — summary statistics like mean and standard deviation don't make sense for text columns.

Section 3: Short Answer (4 questions, 5 points each)

Question 12. Explain in 2-3 sentences why the SettingWithCopyWarning exists and how to avoid it.

Answer

The `SettingWithCopyWarning` exists because when you filter a DataFrame, pandas may return either a *view* (linked to the original data) or a *copy* (independent). If you modify a view, you might accidentally change the original DataFrame. To avoid it, use `.copy()` when creating a subset you plan to modify (`subset = df[condition].copy()`), or use `.loc` for direct modifications to the original DataFrame.

Question 13. What is the difference between a Series and a DataFrame? Give one example of an operation that returns each.

Answer

A Series is a one-dimensional labeled array (one column of data), while a DataFrame is a two-dimensional labeled table (rows and columns). Selecting a single column returns a Series: `df["country"]`. Selecting multiple columns returns a DataFrame: `df[["country", "region"]]`. A DataFrame can be thought of as a collection of Series sharing the same index.

Question 14. Describe two advantages that NaN has over empty strings for representing missing data.

Answer

(1) `NaN` propagates safely through arithmetic (`NaN + 5` is `NaN`, not an error), while empty strings cause `TypeError` when used in math. (2) pandas statistical methods like `.mean()`, `.sum()`, and `.count()` automatically exclude `NaN` values, while empty strings require manual filtering before any computation. This eliminates the need for try/except blocks and explicit "skip if empty" logic.

Question 15. What does the phrase "thinking in vectors rather than loops" mean? Why is it considered a threshold concept?

Answer

"Thinking in vectors" means operating on entire columns at once (e.g., `df["col"] * 2`) rather than writing loops to process values one at a time. It's a threshold concept because it represents a fundamental shift in how you approach data problems — once you internalize it, you can't go back to loop-based thinking, and it unlocks dramatically more concise and efficient code. It changes not just what you write, but how you *think* about data operations.

Section 4: Code Analysis (2 questions, 5 points each)

Question 16. Predict the exact output of the following code:

import pandas as pd

df = pd.DataFrame({
    "name": ["Ana", "Ben", "Cal"],
    "score": [90, 75, 88]
})

result = df[df["score"] > 80].sort_values("score")
print(result)

Answer

  name  score
2  Cal     88
0  Ana     90

Step by step: (1) `df["score"] > 80` produces the mask `[True, False, True]`. (2) Filtering keeps rows 0 (Ana, 90) and 2 (Cal, 88). (3) Sorting by score ascending puts Cal (88) before Ana (90). Note the preserved index values: Cal is still row 2, Ana is still row 0.

Question 17. Predict the output of this code:

import pandas as pd

s = pd.Series([10, 20, 30, 40, 50])
print(s[s > 25].sum())

Answer

Step by step: (1) `s > 25` produces `[False, False, True, True, True]`. (2) `s[s > 25]` selects the values 30, 40, and 50. (3) `.sum()` adds them: 30 + 40 + 50 = 120.

Section 5: Applied Scenarios (3 questions, 5 points each)

Question 18. Marcus, the coffee shop owner from Chapter 1, has loaded his sales data into a pandas DataFrame called sales with columns: date, product, quantity, unit_price, category. He wants to know which product category generates the most revenue. Write the pandas code he should use. (Assume revenue = quantity * unit_price.)

Answer

sales["revenue"] = sales["quantity"] * sales["unit_price"]
print(sales.groupby("category")["revenue"].sum().sort_values(ascending=False))

First, create a computed column for revenue using vectorized multiplication. Then group by category, sum the revenue within each group, and sort to see the highest first. This combines three Chapter 7 skills: creating computed columns, groupby (previewed in the chapter), and sorting.

Question 19. You load a CSV file and see this output from df.dtypes:

student_id      int64
name           object
grade          object
gpa           float64

The grade column contains values like "A", "B", "C", "D", "F". A colleague suggests converting it to a numeric type. Explain why this would be a mistake and what object means in this context.

Answer

The `grade` column contains letter grades (categorical text data), not numbers. Converting it to a numeric type would either fail (because "A" can't become a float) or lose the meaning of the data. The `object` dtype in pandas typically represents text/string data. It's the correct type for categorical text values like letter grades. If you wanted to do ordered comparisons, you could later convert it to a pandas `Categorical` type, but `object` is the appropriate default.

Question 20. Your teammate writes this code and asks why it's slow on a DataFrame with 2 million rows:

results = []
for i in range(len(df)):
    row = df.iloc[i]
    if row["score"] > 80:
        results.append(row["name"])

Explain why this is slow.
Rewrite it using pandas best practices (one or two lines).

Answer

1. **Why it's slow:** This code iterates through every row using a Python `for` loop with `iloc`, which is extremely slow for large DataFrames. Each `iloc[i]` call has overhead, and Python loops are inherently slower than vectorized operations because they can't leverage the optimized C code that pandas uses internally. 2. **Rewritten:**

results = df[df["score"] > 80]["name"]

Or if you need a plain Python list: `results = df.loc[df["score"] > 80, "name"].tolist()` This uses boolean indexing (vectorized), which processes the entire column at once using optimized code and is typically 50-100x faster than the loop version for 2 million rows.