Chapter 12 Quiz: Getting Data from Files

Q: You want to load only columns "name", "age", and "salary" from a CSV with 50 columns. Which parameter should you use? - (A) `columns=['name', 'age', 'salary']` - (B) `usecols=['name', 'age', 'salary']` - (C) `select=['name', 'age', 'salary']` - (D) `fields=['name', 'age', 'salary']`

Correct: (B) `usecols` is the parameter that specifies which columns to load. This is especially important for large files — loading only the columns you need saves memory and speeds up loading. The other parameter names don't exist in `pd.read_csv()`.

Q: Which library must be installed to read `.xlsx` files with pandas? - (A) xlsxwriter - (B) openpyxl - (C) xlrd - (D) pyexcel

Correct: (B) pandas uses `openpyxl` as the default engine for reading `.xlsx` (modern Excel) files. `xlrd` was the old default but now only supports the legacy `.xls` format. `xlsxwriter` is for writing Excel files, not reading them. `pyexcel` is a separate library not used by pandas.

Q: In the SQL query `SELECT name, gpa FROM students WHERE gpa > 3.5 ORDER BY gpa DESC`, what does `DESC` mean? - (A) Description — show a description of the GPA column - (B) Descending — sort from highest to lowest - (C) Destination — send results to a specific location - (D) Decrease — subtract from the GPA values

Correct: (B) `DESC` means descending order — from highest to lowest. The opposite is `ASC` (ascending, lowest to highest), which is the default if neither is specified. So `ORDER BY gpa DESC` sorts students with the highest GPA first.

Q: True or False: CSV files preserve the data types of columns (e.g., a column saved as integers will be loaded as integers automatically).

False. CSV files store everything as plain text. When pandas loads a CSV, it infers data types by examining the values — but this inference can be wrong (e.g., ZIP codes like "02134" become the integer 2134). JSON and databases preserve data types natively; CSV does not.

Q: You receive a CSV file where the first three rows contain metadata (report title, date generated, author name) and the actual data starts on row 4 with headers. Write the exact `pd.read_csv()` call to load this correctly.

```python df = pd.read_csv('file.csv', skiprows=3) ``` `skiprows=3` skips the first three lines (the metadata), and pandas reads line 4 (0-indexed line 3) as the header row. Alternatively, `header=3` would also work — it tells pandas that the header is on row 3 (0-indexed).

Contributors to Introduction to Data Science

Chapter 12 Quiz: Getting Data from Files

Instructions: This quiz tests your understanding of Chapter 12. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (9 questions, 5 points each)

Question 1. You load a CSV file created in France and see garbled characters like caf\u00c3\u00a9 instead of cafe. What is the most likely cause?

(A) The file is corrupted and needs to be re-downloaded
(B) The file uses Latin-1 or Windows-1252 encoding, but pandas is reading it as UTF-8
(C) The CSV delimiter is wrong
(D) The file has no header row

Answer

**Correct: (B)** - **(A)** is unlikely. Garbled but consistent character patterns (mojibake) almost always indicate an encoding mismatch, not corruption. - **(B)** is correct. When bytes encoded in Latin-1 are interpreted as UTF-8, accented characters get misread. The fix is `pd.read_csv('file.csv', encoding='latin-1')`. - **(C)** would produce a DataFrame with one column, not garbled characters. - **(D)** would make the first data row appear as column headers, not produce garbled text.

Question 2. What does the sep parameter in pd.read_csv() control?

(A) The character encoding of the file
(B) Which rows to skip at the beginning of the file
(C) The character used to separate values within each row
(D) Whether to include the index column in the output

Answer

**Correct: (C)** The `sep` parameter specifies the delimiter — the character that separates individual values on each line. Common values: `','` (default), `';'` (European CSVs), `'\t'` (tab-separated files), `'|'` (pipe-delimited). Don't confuse `sep` with `encoding` (which controls character encoding) or `skiprows` (which controls row skipping).

Question 3. You want to load only columns "name", "age", and "salary" from a CSV with 50 columns. Which parameter should you use?

(A) columns=['name', 'age', 'salary']
(B) usecols=['name', 'age', 'salary']
(C) select=['name', 'age', 'salary']
(D) fields=['name', 'age', 'salary']

Answer

**Correct: (B)** `usecols` is the parameter that specifies which columns to load. This is especially important for large files — loading only the columns you need saves memory and speeds up loading. The other parameter names don't exist in `pd.read_csv()`.

Question 4. What does pd.read_excel('file.xlsx', sheet_name=None) return?

(A) A single DataFrame from the last sheet
(B) A list of DataFrames, one per sheet
(C) A dictionary where keys are sheet names and values are DataFrames
(D) An error, because None is not a valid sheet name

Answer

**Correct: (C)** When `sheet_name=None`, pandas loads *all* sheets and returns a dictionary. The keys are the sheet names (strings) and the values are DataFrames. This is useful when you need to combine data from multiple sheets: `pd.concat(result.values(), ignore_index=True)`. Option (B) is close but wrong — it's a dictionary, not a list.

Question 5. Which library must be installed to read .xlsx files with pandas?

(A) xlsxwriter
(B) openpyxl
(C) xlrd
(D) pyexcel

Answer

**Correct: (B)** pandas uses `openpyxl` as the default engine for reading `.xlsx` (modern Excel) files. `xlrd` was the old default but now only supports the legacy `.xls` format. `xlsxwriter` is for *writing* Excel files, not reading them. `pyexcel` is a separate library not used by pandas.

Question 6. What is pd.json_normalize() primarily used for?

(A) Converting a DataFrame to JSON format
(B) Validating that a JSON file has correct syntax
(C) Flattening nested JSON structures into a tabular DataFrame
(D) Compressing JSON files to reduce disk space

Answer

**Correct: (C)** `json_normalize()` takes nested JSON (dictionaries within dictionaries) and flattens the hierarchy into a flat table. Nested keys become dot-separated column names (e.g., `address.city`). This is essential for working with API responses and other hierarchical data. It doesn't convert DataFrames to JSON (that's `to_json()`), validate syntax, or handle compression.

Question 7. In the SQL query SELECT name, gpa FROM students WHERE gpa > 3.5 ORDER BY gpa DESC, what does DESC mean?

(A) Description — show a description of the GPA column
(B) Descending — sort from highest to lowest
(C) Destination — send results to a specific location
(D) Decrease — subtract from the GPA values

Answer

**Correct: (B)** `DESC` means descending order — from highest to lowest. The opposite is `ASC` (ascending, lowest to highest), which is the default if neither is specified. So `ORDER BY gpa DESC` sorts students with the highest GPA first.

Question 8. What is the correct way to count the number of students in each major using SQL?

(A) SELECT major, COUNT(*) FROM students
(B) SELECT major, COUNT(*) FROM students GROUP BY major
(C) SELECT major, SUM(students) FROM students GROUP BY major
(D) SELECT COUNT(major) FROM students

Answer

**Correct: (B)** `GROUP BY major` groups rows with the same major together, and `COUNT(*)` counts the rows in each group. Option (A) is missing `GROUP BY`, which would cause an error in most SQL databases. Option (C) tries to `SUM(students)`, which doesn't make sense — `students` is the table name, not a numeric column. Option (D) counts total non-null major values across all rows without grouping.

Question 9. Which pandas function is conceptually equivalent to a SQL JOIN?

(A) pd.concat()
(B) pd.merge()
(C) pd.pivot_table()
(D) pd.melt()

Answer

**Correct: (B)** `pd.merge()` combines two DataFrames based on shared column values, just like a SQL JOIN combines two tables on a shared key. `pd.concat()` stacks DataFrames vertically or horizontally (more like SQL's UNION). `pivot_table()` and `melt()` are reshaping operations with no direct SQL JOIN equivalent.

Section 2: True/False (3 questions, 5 points each)

Question 10. True or False: CSV files preserve the data types of columns (e.g., a column saved as integers will be loaded as integers automatically).

Answer

**False.** CSV files store everything as plain text. When pandas loads a CSV, it *infers* data types by examining the values — but this inference can be wrong (e.g., ZIP codes like "02134" become the integer 2134). JSON and databases preserve data types natively; CSV does not.

Question 11. True or False: SQLite databases require a separate database server to be installed and running.

Answer

**False.** SQLite is a "serverless" database — it stores the entire database in a single file and requires no separate server process. Python includes SQLite support through the built-in `sqlite3` module. This is what makes SQLite ideal for learning, prototyping, and small-to-medium applications. Server-based databases like PostgreSQL and MySQL *do* require a running server.

Question 12. True or False: The json_normalize() function can only flatten one level of nesting.

Answer

**False.** `json_normalize()` recursively flattens all levels of nesting. A structure like `{"a": {"b": {"c": 1}}}` becomes a column named `a.b.c`. It also supports the `record_path` parameter for expanding nested arrays and `meta` for preserving parent-level fields. The depth of flattening is unlimited.

Section 3: Short Answer (4 questions, 5 points each)

Question 13. You receive a CSV file where the first three rows contain metadata (report title, date generated, author name) and the actual data starts on row 4 with headers. Write the exact pd.read_csv() call to load this correctly.

Answer

df = pd.read_csv('file.csv', skiprows=3)

`skiprows=3` skips the first three lines (the metadata), and pandas reads line 4 (0-indexed line 3) as the header row. Alternatively, `header=3` would also work — it tells pandas that the header is on row 3 (0-indexed).

Question 14. Explain in 2-3 sentences why you would use dtype={'id': str} when loading a CSV, even if the id column contains only numbers.

Answer

Without `dtype={'id': str}`, pandas will infer the column as integer type, which drops leading zeros (e.g., "007" becomes 7) and changes the representation. ID numbers, ZIP codes, phone numbers, and product codes are identifiers, not quantities — you'll never add or average them. Forcing them to remain strings preserves their exact representation and prevents silent data loss.

Question 15. What is a SQL JOIN, and when would you use one? Give an example using two hypothetical tables.

Answer

A JOIN combines rows from two tables based on a related column (a shared key). You use it when related data is spread across multiple tables. For example, if an `orders` table has `customer_id` and `amount`, and a `customers` table has `customer_id` and `name`, you would JOIN them on `customer_id` to see each customer's name alongside their orders: `SELECT c.name, o.amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id`.

Question 16. Describe the "load, explore, extract, normalize" pattern for working with JSON data. When would you use this pattern instead of simply calling pd.read_json()?

Answer

The pattern is: (1) Load raw JSON with `json.load()`, (2) Explore the structure by checking types and keys, (3) Extract the specific nested element containing your data, (4) Normalize it into a DataFrame with `pd.json_normalize()`. You use this pattern when JSON is nested, when the data you want is buried inside a wrapper object (e.g., `raw['data']['results']`), or when you need to understand the structure before deciding how to flatten it. Simple `pd.read_json()` only works for flat, table-like JSON.

Section 4: Applied Scenarios (2 questions, 5 points each)

Question 17. You're a data analyst at a hospital. The finance department sends you an Excel workbook with sheets named "January", "February", ..., "December". Each sheet has the same structure: Department, Category, Amount. You need to create a single dataset with all 12 months. Write the code and explain each step.

Answer

import pandas as pd

# Load all sheets into a dictionary
all_months = pd.read_excel('finance_2024.xlsx',
                           sheet_name=None)

# Add month identifier and collect frames
frames = []
for month_name, df in all_months.items():
    df['month'] = month_name
    frames.append(df)

# Combine into a single DataFrame
annual = pd.concat(frames, ignore_index=True)

Step 1: `sheet_name=None` loads all 12 sheets as a dictionary (keys = sheet names, values = DataFrames). Step 2: Loop through the dictionary, adding a `month` column to each DataFrame so we know which month each row came from. Step 3: `pd.concat()` stacks all 12 DataFrames vertically into one.

Question 18. Your company stores customer data in a SQLite database with two tables: customers (columns: id, name, region) and orders (columns: order_id, customer_id, product, amount, order_date). Write a SQL query to find the total order amount per region, sorted from highest to lowest. Then explain why doing this filtering in SQL (rather than loading all data into pandas first) is more efficient.

Answer

SELECT c.region, SUM(o.amount) as total_amount
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.region
ORDER BY total_amount DESC

Doing this in SQL is more efficient because: (1) The database only returns the aggregated result (one row per region) rather than every individual order — this means less data transferred and less memory used. (2) Databases are optimized for JOIN and GROUP BY operations on large tables, using indexes and query planning. (3) If the orders table has millions of rows, loading everything into pandas first would be slow and memory-intensive, while the SQL query returns just a handful of rows.

Section 5: Code Analysis (2 questions, 5 points each)

Question 19. What will this code produce, and why might it cause problems?

df = pd.read_csv('addresses.csv')
print(df['zipcode'].head())
# Output:
# 0     2134
# 1    90210
# 2     8501
# 3    10001
# 4    33139

Answer

The ZIP codes have lost their leading zeros. Row 0 should be "02134" (Boston area) and row 2 should be "08501" (New Jersey). pandas inferred the zipcode column as an integer type, which automatically drops leading zeros. This is a data integrity problem — if you use these ZIP codes to merge with another dataset or look up geographic information, the matches will fail. Fix: `df = pd.read_csv('addresses.csv', dtype={'zipcode': str})`

Question 20. This code attempts to load nested JSON but doesn't produce the expected result. What's wrong, and how would you fix it?

import pandas as pd

df = pd.read_json('api_response.json')
print(df['location'].head())
# Output:
# 0    {'city': 'Portland', 'state': 'OR'}
# 1    {'city': 'Denver', 'state': 'CO'}
# 2    {'city': 'Seattle', 'state': 'WA'}

Answer

The `location` column contains dictionaries instead of flat values because `pd.read_json()` doesn't automatically flatten nested structures. It loads the outer level into columns, but nested objects remain as dictionaries. Fix: Use `json_normalize()` instead:

import json

with open('api_response.json', 'r') as f:
    data = json.load(f)

df = pd.json_normalize(data)
# Now you get: location.city, location.state as separate columns

This flattens the nested `location` object into individual columns with dot-separated names.