Chapter 11 Quiz: Loading and Exploring Real Business Datasets

Instructions: Choose the best answer for each multiple-choice question. For short-answer questions, write a complete response. The answer key appears at the end.

Time suggested: 25–35 minutes Total questions: 20


Part A: Multiple Choice (Questions 1–13)

Question 1

You receive a CSV file from a European colleague. When you try to open it with pd.read_csv("data.csv"), the first column shows "é" instead of "é". What is the most likely cause and fix?

A) The file uses semicolons as separators; add sep=";" B) The file has metadata rows at the top; add header=2 C) The file uses Latin-1 encoding; add encoding="latin-1" D) The date column was not parsed; add parse_dates=["date"]


Question 2

A financial analyst exports a file where the revenue column contains values like "$1,234.56". After loading with pd.read_csv(), what will df["revenue"].dtype show?

A) float64 B) int64 C) object D) datetime64[ns]


Question 3

You have a DataFrame with 500 rows. After running df.info(), the customer_id column shows "432 non-null". How many rows have a missing customer_id?

A) 432 B) 68 C) 500 D) Cannot determine without more information


Question 4

Which of the following pd.read_csv() calls will correctly load only the date, region, and revenue columns from a file that also contains salesperson, product, and quantity columns?

A) pd.read_csv("data.csv", columns=["date", "region", "revenue"]) B) pd.read_csv("data.csv", select=["date", "region", "revenue"]) C) pd.read_csv("data.csv", usecols=["date", "region", "revenue"]) D) pd.read_csv("data.csv", keep=["date", "region", "revenue"])


Question 5

You call df["region"].value_counts() and get:

Northeast    612
Southwest    521
Midwest      489
Southeast    468
West         397

The total is 2,487. Your DataFrame has 2,500 rows. What conclusion can you draw?

A) There are 13 duplicate rows in the dataset B) There are 13 rows where region is missing (NaN) C) There are 13 rows with an unknown region D) The value_counts() method has a bug


Question 6

Which method gives you the most complete single-call overview of a DataFrame, including column names, non-null counts, and data types?

A) df.describe() B) df.summary() C) df.dtypes D) df.info()


Question 7

A product catalog CSV contains a column product_code with values like "00234", "01456", and "09900". If you load the file without any special parameters, what will happen?

A) The values will be loaded as strings: ["00234", "01456", "09900"] B) The values will be loaded as integers: [234, 1456, 9900] C) pandas will raise a ValueError because leading zeros are invalid D) The values will be loaded as floats: [234.0, 1456.0, 9900.0]


Question 8

You want to load an Excel file and read the sheet named "Q2 Sales". Which call is correct?

A) pd.read_excel("report.xlsx", sheet="Q2 Sales") B) pd.read_excel("report.xlsx", tab="Q2 Sales") C) pd.read_excel("report.xlsx", sheet_name="Q2 Sales") D) pd.read_excel("report.xlsx", worksheet="Q2 Sales")


Question 9

You run df.nunique() on your sales DataFrame and the category column shows 6 unique values. You know there should only be 4 product categories. What is the most likely explanation?

A) There are 2 extra columns accidentally included in the file B) The category column contains the same value spelled multiple ways (e.g., different capitalization) C) nunique() counts missing values as unique values D) pandas is counting header rows as unique values


Question 10

The memory_usage(deep=True) call shows that a region column with 100,000 rows uses 7.8 MB, while a units_sold column with the same number of rows uses only 800 KB. What explains this difference?

A) The region column has more missing values B) String (object) columns store Python objects with significant overhead; numeric columns use fixed-size binary C) The units_sold column is sorted, which makes it more compact D) deep=True only measures string columns accurately


Question 11

Which of the following correctly reads only the first 1,000 rows of a large file?

A) pd.read_csv("data.csv", rows=1000) B) pd.read_csv("data.csv", max_rows=1000) C) pd.read_csv("data.csv", nrows=1000) D) pd.read_csv("data.csv", limit=1000)


Question 12

After loading a DataFrame, you run df.describe() and notice that the units_sold column shows count: 980 while the DataFrame has 1,000 rows. What does this indicate?

A) 20 rows have values of 0 in units_sold B) 20 rows have missing (NaN) values in units_sold C) describe() only samples 980 rows by default D) There are 20 duplicate rows in the DataFrame


Question 13

You want to load a CSV where missing values are represented as "TBD" and "-". Which parameter and value should you use?

A) missing=["TBD", "-"] B) null_values=["TBD", "-"] C) na_values=["TBD", "-"] D) empty=["TBD", "-"]


Part B: Short Answer (Questions 14–17)

Question 14

You load a CSV file and check the data types. The order_date column shows dtype: object instead of datetime64[ns].

a) What does this tell you about how the file was loaded? b) Write the corrected pd.read_csv() call that fixes this. c) Why does the correct date type matter for downstream analysis?


Question 15

Explain what .value_counts(dropna=False) does and why the dropna=False parameter is important when exploring business data. Give a specific example of when omitting it could lead to a wrong conclusion.


Question 16

You are given a new dataset and asked to "explore it." Write out the 8-step exploration workflow from this chapter as a numbered list. For each step, specify the exact Python command(s) you would run.


Question 17

What is the difference between df.shape, df.info(), and df.describe()? When would you use each one?


Part C: Code Analysis (Questions 18–20)

Question 18

Examine the following code. Identify any issues and explain what each parameter does.

df = pd.read_csv(
    "sales_data.csv",
    sep="\t",
    header=2,
    usecols=[0, 1, 4, 7],
    dtype={"sku": str, "units": int},
    parse_dates=["transaction_date"],
    na_values=["N/A", "--", ""],
    nrows=5000,
    encoding="utf-8"
)

a) Is there any issue with this code? Explain. b) What does header=2 do specifically? c) Why might dtype={"sku": str} be critical for the sku column? d) What would happen if you removed na_values=["N/A", "--", ""]?


Question 19

The following code produces unexpected output. Identify the bug and explain how to fix it.

import pandas as pd

df = pd.read_csv("acme_sales.csv")

# Check category distribution
print(df["category"].value_counts())

Output:

Office Chairs    847
Desks            623
Technology       589
Storage          431
office chairs      7
OFFICE CHAIRS      3

a) What is wrong with this output? b) Does this code have a bug, or is it working as intended? c) What should the analyst do next, and in which chapter will that be covered? d) What would the output look like after the correct fix is applied?


Question 20

Read the following .info() output and answer the diagnostic questions.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   transaction_id  3000 non-null   int64
 1   sale_date       3000 non-null   object
 2   customer        2941 non-null   object
 3   product_sku     3000 non-null   object
 4   quantity        3000 non-null   int64
 5   unit_price      3000 non-null   object
 6   discount        2876 non-null   float64
 7   total           3000 non-null   float64
dtypes: float64(2), int64(2), object(4)
memory usage: 390.7 KB

a) How many rows are missing a customer value? b) What type issue exists with sale_date? What parameter fixes this at load time? c) What type issue likely exists with unit_price? What does this suggest about the raw data? d) How many rows are missing a discount value? e) transaction_id is int64. Is there any risk in using integer IDs? Explain. f) Write the corrected pd.read_csv() call that addresses the fixable issues.


Answer Key


Part A: Multiple Choice

Q Answer Explanation
1 C The garbled characters indicate a character encoding mismatch. Latin-1/CP1252 is common for files from Windows systems in Western Europe.
2 C The $ and , symbols prevent pandas from interpreting the values as numbers, so the entire column becomes object (string) dtype.
3 B 500 total rows − 432 non-null = 68 missing
4 C The usecols parameter is the correct parameter for selecting specific columns.
5 B value_counts() excludes NaN by default. 2500 − 2487 = 13 missing values.
6 D .info() provides the comprehensive overview including non-null counts, types, and memory usage.
7 B Without dtype={"product_code": str}, pandas infers the type as integer and strips leading zeros.
8 C The correct parameter for specifying a sheet in read_excel() is sheet_name.
9 B Case-sensitive string comparison treats "Office Chairs", "office chairs", and "OFFICE CHAIRS" as three different values.
10 B Python string objects have substantial memory overhead (~50+ bytes per string object). Numeric arrays use exactly 8 bytes per value for int64/float64.
11 C nrows is the correct parameter for limiting the number of rows loaded.
12 B In .describe(), count shows the number of non-null values. 980 < 1000 means 20 missing values.
13 C na_values is the correct parameter for specifying additional missing-value representations.

Part B: Short Answer

Question 14

a) parse_dates=["order_date"] was not used when loading the file. pandas read the date values as plain strings.

b) Corrected call:

df = pd.read_csv("orders.csv", parse_dates=["order_date"])

c) With string dates, you cannot: filter rows by date range using df[df["order_date"] > "2023-06-01"], extract month or year with .dt.month, calculate the number of days between dates, or group by week/month/quarter. All time-based analysis requires actual datetime objects.


Question 15

value_counts(dropna=False) counts all values in a column, including NaN (missing) values, and includes the NaN count in the output.

Without dropna=False, the count of non-null values may add up to less than the total number of rows, with no indication that missing values exist.

Example: If a region column in a 2,500-row DataFrame has 13 missing values and you run df["region"].value_counts(), you see counts for 5 regions totaling 2,487. You might conclude "there are 2,487 transactions" without realizing 13 are missing a region. Running df["region"].value_counts(dropna=False) shows the NaN row explicitly, making the gap visible.


Question 16

  1. Load with intentdf = pd.read_csv("file.csv", parse_dates=[...], dtype={...}, na_values=[...])
  2. Check dimensionsprint(df.shape)
  3. Visual inspectionprint(df.head()) and print(df.tail())
  4. Check data typesprint(df.dtypes)
  5. Run .info()df.info()
  6. Statistical summaryprint(df.describe())
  7. Unique value analysisprint(df.nunique()) and df["col"].value_counts(dropna=False) for categorical columns
  8. Missing value analysisprint(df.isna().sum()) and print(df[df.isna().any(axis=1)])

Question 17

df.shape — Returns a tuple (rows, cols). Use it first to confirm the dataset has the expected dimensions.

df.info() — Shows column names, non-null counts, data types, and total memory usage. Use it to identify missing values and type problems. This is the most comprehensive single-call diagnostic.

df.describe() — Computes statistical summaries (count, mean, std, min, quartiles, max) for numeric columns. Use it to check value ranges and spot outliers. Can include non-numeric columns with include="all".

Use all three routinely; they answer different questions and complement each other.


Part C: Code Analysis

Question 18

a) No bugs. The code is well-written and intentional.

b) header=2 means: skip rows 0 and 1 (which contain metadata) and use row 2 (the third row) as the column header row.

c) dtype={"sku": str} prevents pandas from converting product SKU codes to integers. Many SKUs have leading zeros (e.g., "00123") that would be silently dropped if interpreted as numbers, making the SKU incorrect and unmatchable against a product catalog.

d) Cells containing "N/A", "--", or empty strings would NOT be recognized as missing values. They would be loaded as literal strings. In numeric columns, this would force the entire column to object dtype. Calculations on those columns would fail.


Question 19

a) The category column has three representations of the same category: "Office Chairs", "office chairs", and "OFFICE CHAIRS". Any group-by or pivot table will incorrectly split "Office Chairs" into three separate rows.

b) The code is working as intended — pandas is correctly reporting what is in the data. The problem is with the data, not the code.

c) The analyst should standardize the category column — converting all values to a consistent case (e.g., title case) so that all "Office Chairs" variations become "Office Chairs". This will be covered in Chapter 12: Cleaning and Preparing Data for Analysis.

d) After fixing, the output would be:

Office Chairs    857
Desks            623
Technology       589
Storage          431

Question 20

a) 3000 − 2941 = 59 rows are missing a customer value.

b) sale_date is object dtype — it was loaded as a string, not a datetime. Fix at load time: parse_dates=["sale_date"].

c) unit_price is object dtype when it should be float64. The most likely cause is that the raw data contains currency symbols ($`) or comma separators (e.g., `"$1,299.99"). These must be stripped before the column can be converted to numeric.

d) 3000 − 2876 = 124 rows are missing a discount value.

e) Yes. If transaction_id is used as a join key or unique identifier, and at some point an ID like 00001 existed, it would have been stored as integer 1, losing the leading zeros. More importantly, integer IDs cannot represent alphanumeric codes (like "TXN-10001"). If the system might ever add letter prefixes, the column should be stored as object (string).

f) Corrected load call:

df = pd.read_csv(
    "transactions.csv",
    parse_dates=["sale_date"],
    dtype={"transaction_id": str}   # preserve as string for safety
    # unit_price must be cleaned after loading — cannot fix at load time
    # with $-prefixed values; handle in cleaning step
)