Chapter 7 Quiz: Introduction to pandas
Instructions: This quiz tests your understanding of Chapter 7. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For code analysis questions, predict the output without running the code. Total points: 100.
Section 1: Multiple Choice (8 questions, 5 points each)
Question 1. What is the correct way to import pandas?
- (A)
import pandas - (B)
import pandas as pd - (C)
from pandas import * - (D) Both (A) and (B) work, but (B) is the universal convention
Answer
**Correct: (D)** - **(A)** works technically — `pandas.DataFrame(...)` is valid Python — but virtually nobody writes it this way. You'd have to type `pandas` in full every time. - **(B)** is the universal convention used by the pandas documentation, tutorials, books, and the overwhelming majority of data scientists. The alias `pd` is so standard that writing `import pandas` without it would confuse experienced readers. - **(C)** is bad practice for any library. Wildcard imports pollute the namespace and make it unclear where functions come from. - **(D)** is correct: both work, but (B) is the convention you should follow.Question 2. What type does df["coverage_pct"] return?
- (A) A Python list
- (B) A pandas DataFrame
- (C) A pandas Series
- (D) A NumPy array
Answer
**Correct: (C)** Selecting a single column from a DataFrame using bracket notation returns a **Series** — a one-dimensional labeled array. Selecting *multiple* columns (with double brackets like `df[["col1", "col2"]]`) returns a DataFrame. This distinction is important because Series and DataFrame have different methods and behaviors.Question 3. Which of the following correctly filters a DataFrame to show only rows where coverage is above 90%?
- (A)
df.filter(coverage_pct > 90) - (B)
df[df["coverage_pct"] > 90] - (C)
df.where("coverage_pct" > 90) - (D)
df.select(df.coverage_pct > 90)
Answer
**Correct: (B)** - **(A)** is not valid pandas syntax. `df.filter()` exists but is used for filtering by column/index labels, not by values. - **(B)** uses boolean indexing: `df["coverage_pct"] > 90` creates a boolean mask, and passing it into `df[...]` selects only the `True` rows. - **(C)** is not valid — `"coverage_pct" > 90` compares a string to an integer, which is a Python comparison, not a pandas operation. - **(D)** `df.select()` does not exist as a standard pandas method.Question 4. What is the key difference between df.iloc[1:4] and df.loc[1:4]?
- (A)
ilocis faster thanloc - (B)
ilocuses exclusive end (returns rows at positions 1, 2, 3), whilelocuses inclusive end (returns rows with labels 1, 2, 3, 4) - (C)
ilocworks with numbers andlocworks with strings - (D) There is no difference when the index is the default integer index
Answer
**Correct: (B)** - **(A)** is not the key difference; performance is similar for most operations. - **(B)** is correct. `iloc` follows Python's standard slicing convention (exclusive end), so `iloc[1:4]` returns rows at positions 1, 2, and 3. `loc` uses label-based slicing with an inclusive end, so `loc[1:4]` returns rows with index labels 1, 2, 3, and 4 — that's four rows, not three. - **(C)** is an oversimplification. `loc` works with any index type (including integers); it uses labels, which can be integers. - **(D)** is wrong precisely because of the inclusive vs. exclusive end behavior.Question 5. What does df["coverage_pct"].apply(some_function) do?
- (A) Applies the function to the entire column at once
- (B) Calls the function once for each value in the column, returning a new Series of results
- (C) Modifies the column in place
- (D) Filters the column to keep only values where the function returns True
Answer
**Correct: (B)** `apply()` calls the given function on *each individual value* in the Series, one at a time, and returns a new Series containing all the return values. It does not modify the original Series in place (C is wrong). It is not the same as a true vectorized operation that processes the whole column at once (A is wrong) — under the hood, `apply` still iterates. And it doesn't filter (D is wrong) — it transforms.Question 6. You write df[df["region"] == "AFRO" and df["year"] == 2022]. What happens?
- (A) It correctly filters for AFRO region in 2022
- (B) It raises a
SyntaxError - (C) It raises a
ValueErrorabout ambiguous truth values - (D) It returns an empty DataFrame
Answer
**Correct: (C)** Python's `and` operator tries to evaluate each operand as a single boolean value. But `df["region"] == "AFRO"` is a Series with thousands of True/False values — it can't be reduced to a single True or False. pandas raises `ValueError: The truth value of a Series is ambiguous`. The fix is to use `&` (element-wise and) with parentheses: `df[(df["region"] == "AFRO") & (df["year"] == 2022)]`.Question 7. What does pd.read_csv() do with empty cells in a CSV file?
- (A) Stores them as empty strings
"" - (B) Stores them as the integer
0 - (C) Stores them as
NaN(Not a Number) - (D) Raises an error and refuses to load the file
Answer
**Correct: (C)** This is one of the most important behavioral differences between pandas and Python's `csv` module. The `csv` module stores empty cells as empty strings (`""`), which crash arithmetic operations. pandas stores them as `NaN`, which propagates safely through computations and is automatically excluded from statistical methods like `.mean()` and `.count()`.Question 8. What is the purpose of .reset_index(drop=True) after sorting or filtering?
- (A) It removes the index column entirely
- (B) It creates a new clean 0-based index and discards the old one
- (C) It sorts the DataFrame by its index values
- (D) It converts the index to a regular column
Answer
**Correct: (B)** After sorting or filtering, the index values may be out of order (e.g., 3, 1, 4, 0, 2). `.reset_index(drop=True)` replaces these with a clean 0, 1, 2, 3, ... sequence. The `drop=True` parameter prevents the old index from being added as a new column. Without `drop=True`, the old index values would be saved as an additional column named "index."Section 2: True/False (3 questions, 5 points each)
Question 9. True or False: df.country and df["country"] always produce the same result.
Answer
**False.** They produce the same result *most of the time*, but dot notation fails in several cases: (1) when the column name contains spaces or special characters, (2) when the column name conflicts with a DataFrame method or attribute (e.g., a column named "count" or "shape"), and (3) when the column name is stored in a variable. For these reasons, bracket notation is generally preferred.Question 10. True or False: A vectorized operation like df["col"] * 2 is generally faster than using a Python for loop to multiply each value by 2.
Answer
**True.** Vectorized operations in pandas use optimized C/Cython code under the hood (via NumPy). For large datasets (thousands to millions of rows), vectorized operations can be 10x to 100x faster than equivalent Python `for` loops because they avoid Python's per-element overhead and leverage CPU-level optimizations.Question 11. True or False: df.describe() includes all columns in its output by default.
Answer
**False.** By default, `df.describe()` only includes *numeric* columns (int64, float64). Text/object columns are excluded. To include all columns, use `df.describe(include="all")`. This is actually a useful design choice — summary statistics like mean and standard deviation don't make sense for text columns.Section 3: Short Answer (4 questions, 5 points each)
Question 12. Explain in 2-3 sentences why the SettingWithCopyWarning exists and how to avoid it.
Answer
The `SettingWithCopyWarning` exists because when you filter a DataFrame, pandas may return either a *view* (linked to the original data) or a *copy* (independent). If you modify a view, you might accidentally change the original DataFrame. To avoid it, use `.copy()` when creating a subset you plan to modify (`subset = df[condition].copy()`), or use `.loc` for direct modifications to the original DataFrame.Question 13. What is the difference between a Series and a DataFrame? Give one example of an operation that returns each.
Answer
A Series is a one-dimensional labeled array (one column of data), while a DataFrame is a two-dimensional labeled table (rows and columns). Selecting a single column returns a Series: `df["country"]`. Selecting multiple columns returns a DataFrame: `df[["country", "region"]]`. A DataFrame can be thought of as a collection of Series sharing the same index.Question 14. Describe two advantages that NaN has over empty strings for representing missing data.
Answer
(1) `NaN` propagates safely through arithmetic (`NaN + 5` is `NaN`, not an error), while empty strings cause `TypeError` when used in math. (2) pandas statistical methods like `.mean()`, `.sum()`, and `.count()` automatically exclude `NaN` values, while empty strings require manual filtering before any computation. This eliminates the need for try/except blocks and explicit "skip if empty" logic.Question 15. What does the phrase "thinking in vectors rather than loops" mean? Why is it considered a threshold concept?
Answer
"Thinking in vectors" means operating on entire columns at once (e.g., `df["col"] * 2`) rather than writing loops to process values one at a time. It's a threshold concept because it represents a fundamental shift in how you approach data problems — once you internalize it, you can't go back to loop-based thinking, and it unlocks dramatically more concise and efficient code. It changes not just what you write, but how you *think* about data operations.Section 4: Code Analysis (2 questions, 5 points each)
Question 16. Predict the exact output of the following code:
import pandas as pd
df = pd.DataFrame({
"name": ["Ana", "Ben", "Cal"],
"score": [90, 75, 88]
})
result = df[df["score"] > 80].sort_values("score")
print(result)
Answer
name score
2 Cal 88
0 Ana 90
Step by step: (1) `df["score"] > 80` produces the mask `[True, False, True]`. (2) Filtering keeps rows 0 (Ana, 90) and 2 (Cal, 88). (3) Sorting by score ascending puts Cal (88) before Ana (90). Note the preserved index values: Cal is still row 2, Ana is still row 0.
Question 17. Predict the output of this code:
import pandas as pd
s = pd.Series([10, 20, 30, 40, 50])
print(s[s > 25].sum())
Answer
120
Step by step: (1) `s > 25` produces `[False, False, True, True, True]`. (2) `s[s > 25]` selects the values 30, 40, and 50. (3) `.sum()` adds them: 30 + 40 + 50 = 120.
Section 5: Applied Scenarios (3 questions, 5 points each)
Question 18. Marcus, the coffee shop owner from Chapter 1, has loaded his sales data into a pandas DataFrame called sales with columns: date, product, quantity, unit_price, category. He wants to know which product category generates the most revenue. Write the pandas code he should use. (Assume revenue = quantity * unit_price.)
Answer
sales["revenue"] = sales["quantity"] * sales["unit_price"]
print(sales.groupby("category")["revenue"].sum().sort_values(ascending=False))
First, create a computed column for revenue using vectorized multiplication. Then group by category, sum the revenue within each group, and sort to see the highest first. This combines three Chapter 7 skills: creating computed columns, groupby (previewed in the chapter), and sorting.
Question 19. You load a CSV file and see this output from df.dtypes:
student_id int64
name object
grade object
gpa float64
The grade column contains values like "A", "B", "C", "D", "F". A colleague suggests converting it to a numeric type. Explain why this would be a mistake and what object means in this context.
Answer
The `grade` column contains letter grades (categorical text data), not numbers. Converting it to a numeric type would either fail (because "A" can't become a float) or lose the meaning of the data. The `object` dtype in pandas typically represents text/string data. It's the correct type for categorical text values like letter grades. If you wanted to do ordered comparisons, you could later convert it to a pandas `Categorical` type, but `object` is the appropriate default.Question 20. Your teammate writes this code and asks why it's slow on a DataFrame with 2 million rows:
results = []
for i in range(len(df)):
row = df.iloc[i]
if row["score"] > 80:
results.append(row["name"])
- Explain why this is slow.
- Rewrite it using pandas best practices (one or two lines).
Answer
1. **Why it's slow:** This code iterates through every row using a Python `for` loop with `iloc`, which is extremely slow for large DataFrames. Each `iloc[i]` call has overhead, and Python loops are inherently slower than vectorized operations because they can't leverage the optimized C code that pandas uses internally. 2. **Rewritten:**results = df[df["score"] > 80]["name"]
Or if you need a plain Python list: `results = df.loc[df["score"] > 80, "name"].tolist()`
This uses boolean indexing (vectorized), which processes the entire column at once using optimized code and is typically 50-100x faster than the loop version for 2 million rows.