Chapter 12 Exercises: Getting Data from Files

Contributors to Introduction to Data Science

Chapter 12 Exercises: Getting Data from Files

How to use these exercises: Work through the sections in order. Parts A-D focus on Chapter 12 material, building from recall to original analysis. Part E applies your skills to new scenarios. Part M mixes in concepts from earlier chapters to reinforce retention. You'll need Python, pandas, and access to sample data files for most problems.

Difficulty key: 1-star: Foundational | 2-star: Intermediate | 3-star: Advanced | 4-star: Extension

Part A: Conceptual Understanding (1-star)

These questions check whether you absorbed the core ideas from the chapter. Write clear, concise answers.

Exercise 12.1 — Encoding basics

In your own words, explain what "encoding" means in the context of loading a CSV file. Why does using the wrong encoding produce garbled characters instead of an error message?

Guidance

Encoding is the mapping between characters (like "a", "e", or accented characters like "e") and the bytes stored in the file. When you use the wrong encoding, the bytes still map to *some* characters — just the wrong ones. The computer doesn't know the characters look wrong to a human eye, so it happily displays the garbled result (mojibake) instead of raising an error. It's like reading French with a Spanish dictionary — you'll get words, just not the right ones.

Exercise 12.2 — Format comparison

Without looking at the comparison table, list three advantages of CSV over Excel and three advantages of Excel over CSV for storing data.

Guidance

**CSV advantages:** (1) Plain text — human-readable and version-control friendly; (2) Universal compatibility — every tool can read CSV; (3) Handles large files better — no row limit like Excel's ~1 million. **Excel advantages:** (1) Preserves data types and formatting; (2) Supports multiple sheets in one file; (3) More accessible for non-technical users who can open and browse it directly.

Exercise 12.3 — JSON vs. CSV

A colleague says, "JSON is better than CSV because it preserves data types." Is this statement accurate? Explain when you'd choose JSON over CSV and when you'd choose CSV over JSON.

Guidance

The statement is partially accurate — JSON does preserve types (numbers, strings, booleans, null) while CSV stores everything as text. However, "better" depends on the use case. Choose JSON when data is hierarchical/nested or when exchanging data with web APIs. Choose CSV when data is a simple flat table, when file size matters (CSV is typically smaller), or when maximum compatibility is needed. For a flat table of 10,000 rows, CSV is simpler and more efficient. For nested API response data, JSON is the natural fit.

Exercise 12.4 — SQL conceptual

Explain in plain English what each of these SQL keywords does: SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN. Use a library analogy: imagine a database of books with tables for books and authors.

Guidance

- **SELECT**: Choose which columns to display (like choosing which fields on a book card to read — title, year, genre). - **FROM**: Specify which table to look in (like saying "look in the books catalog"). - **WHERE**: Filter to only the rows matching a condition (like saying "only show me books published after 2020"). - **GROUP BY**: Group rows that share a value and compute summaries (like saying "count how many books each genre has"). - **ORDER BY**: Sort the results (like saying "sort alphabetically by title"). - **JOIN**: Combine two tables using a shared column (like connecting a book record to its author record using `author_id`).

Exercise 12.5 — When to use a database

Your team has a dataset of 50 million customer transactions spanning 5 years. Currently it's stored in 60 monthly CSV files. A colleague suggests migrating to a SQLite database. List three reasons this migration would be beneficial and one potential drawback.

Guidance

**Benefits:** (1) You can query across all 60 months with a single SQL statement instead of loading and concatenating 60 files; (2) SQL's WHERE clause lets you extract only the rows you need, saving memory; (3) JOINs let you easily combine transaction data with other tables (customer profiles, product catalogs). **Drawback:** The initial migration takes effort, and SQLite may struggle with truly concurrent access if multiple team members write to the database simultaneously (a full database server like PostgreSQL would be better for that).

Part B: Applied Practice (2-star)

These problems require writing code. Test your solutions with real or simulated data.

Exercise 12.6 — CSV with encoding issues

You receive a file called cities_brazil.csv that looks like this when loaded with default settings:

nome,estado,populaÃ§Ã£o
SÃ£o Paulo,SP,12325000

Write the read_csv call that will correctly display: São Paulo, SP, 12325000. Include a comment explaining why the default didn't work.

Guidance

# The file uses Latin-1 encoding, but pandas defaults to UTF-8.
# Latin-1 bytes get misinterpreted as UTF-8, causing mojibake.
df = pd.read_csv('cities_brazil.csv', encoding='latin-1')

After loading, verify with `df.head()` that accented characters display correctly.

Exercise 12.7 — Semicolon-delimited CSV

A European colleague sends you sales_germany.csv. When you load it, the DataFrame has one column and values like "Berlin;2500;2024-01-15". Write the corrected read_csv call. Then explain why Europeans often use semicolons instead of commas as delimiters.

Guidance

df = pd.read_csv('sales_germany.csv', sep=';')

Europeans use semicolons because many European countries use commas as decimal separators (e.g., `3,14` instead of `3.14`). Since commas are already used within numbers, they can't also serve as field delimiters — so semicolons take over that role.

Exercise 12.8 — Messy CSV headers

You receive a file with this structure:

Annual Sales Report
Generated: 2024-04-01
Division: Northeast

Product,Units,Revenue
Widget A,1500,45000
Widget B,2300,69000

Write the read_csv call to load this correctly, skipping the metadata rows. Your resulting DataFrame should have columns Product, Units, and Revenue.

Guidance

df = pd.read_csv('sales_report.csv', skiprows=4)

The metadata takes up lines 0-2 (title, date, division), line 3 is blank, and line 4 is the header. `skiprows=4` skips the first 4 lines and reads line 4 (0-indexed) as the header.

Exercise 12.9 — Excel multi-sheet loading

An Excel file budget_2024.xlsx has three sheets: "Q1", "Q2", and "Q3". Each sheet has the same columns: Department, Category, Amount. Write code that: 1. Loads all three sheets 2. Adds a column called quarter to each (values: "Q1", "Q2", "Q3") 3. Combines them into a single DataFrame

Guidance

all_sheets = pd.read_excel('budget_2024.xlsx',
                           sheet_name=None)

frames = []
for quarter_name, df in all_sheets.items():
    df['quarter'] = quarter_name
    frames.append(df)

budget = pd.concat(frames, ignore_index=True)
print(budget.shape)
print(budget.head())

Exercise 12.10 — JSON normalization

Given this JSON structure, write code to produce a flat DataFrame with columns for city, temperature value, temperature unit, and wind speed:

[
  {
    "city": "Portland",
    "weather": {
      "temp": {"value": 72, "unit": "F"},
      "wind_speed": 12
    }
  },
  {
    "city": "Denver",
    "weather": {
      "temp": {"value": 85, "unit": "F"},
      "wind_speed": 8
    }
  }
]

Guidance

import json

data = [
    {"city": "Portland",
     "weather": {"temp": {"value": 72, "unit": "F"},
                 "wind_speed": 12}},
    {"city": "Denver",
     "weather": {"temp": {"value": 85, "unit": "F"},
                 "wind_speed": 8}}
]

df = pd.json_normalize(data, sep='_')
print(df.columns.tolist())
# ['city', 'weather_temp_value', 'weather_temp_unit',
#  'weather_wind_speed']

Exercise 12.11 — JSON with record_path

Given this JSON of departments and their employees:

[
  {
    "department": "Engineering",
    "budget": 500000,
    "employees": [
      {"name": "Alice", "role": "Lead"},
      {"name": "Bob", "role": "Developer"}
    ]
  },
  {
    "department": "Marketing",
    "budget": 300000,
    "employees": [
      {"name": "Carol", "role": "Manager"}
    ]
  }
]

Write code using json_normalize with record_path and meta to produce a DataFrame with one row per employee, including their department name and budget.

Guidance

df = pd.json_normalize(
    data,
    record_path='employees',
    meta=['department', 'budget']
)
# Result:
#     name       role   department  budget
# 0  Alice       Lead  Engineering  500000
# 1    Bob  Developer  Engineering  500000
# 2  Carol    Manager    Marketing  300000

Exercise 12.12 — SQL SELECT and WHERE

You have a SQLite database bookstore.db with a table called books that has columns: title, author, genre, price, year_published. Write SQL queries (inside pd.read_sql) to:

Select all books published after 2020
Select the title and price of books in the "Science Fiction" genre, sorted by price descending
Find the average price of books by genre

Guidance

import sqlite3
conn = sqlite3.connect('bookstore.db')

# 1. Books published after 2020
recent = pd.read_sql(
    'SELECT * FROM books WHERE year_published > 2020',
    conn)

# 2. Sci-fi books sorted by price
scifi = pd.read_sql("""
    SELECT title, price FROM books
    WHERE genre = 'Science Fiction'
    ORDER BY price DESC
""", conn)

# 3. Average price by genre
avg_price = pd.read_sql("""
    SELECT genre, AVG(price) as avg_price
    FROM books
    GROUP BY genre
""", conn)

conn.close()

Part C: Real-World Scenarios (2-star to 3-star)

These problems present realistic data challenges that require judgment as well as code.

Exercise 12.13 — The mystery file ⭐⭐

You receive a file called data_export.txt with no documentation. When you open it in a text editor, you see:

1001|Johnson|Marketing|75000|2019-03-15
1002|Williams|Engineering|92000|2017-08-22
1003|Chen|Marketing|68000|2021-01-10

Write the complete read_csv call to load this file correctly. What parameters do you need to set, and why?

Guidance

df = pd.read_csv(
    'data_export.txt',
    sep='|',           # Pipe-delimited
    header=None,       # No header row
    names=['emp_id', 'last_name', 'department',
           'salary', 'hire_date'],
    parse_dates=['hire_date']
)

You need `sep='|'` because the delimiter is a pipe, `header=None` because there's no header row, `names` to provide meaningful column names, and `parse_dates` to convert the date column from string to datetime.

Exercise 12.14 — Choosing the right format ⭐⭐

For each scenario, recommend the best file format (CSV, Excel, JSON, or database) and justify your choice in 1-2 sentences:

A data pipeline that runs nightly, extracting 2 million rows of transaction data
A quarterly report that your non-technical CEO needs to review and annotate
Configuration data for a web application that includes nested settings
Survey responses collected by 15 research assistants who each maintain their own spreadsheet
A master customer database accessed by 50 employees simultaneously

Guidance

1. **Database** — 2 million rows updated nightly requires efficient querying, not reloading entire files. SQL lets the pipeline extract exactly what's needed. 2. **Excel** — CEOs expect Excel. They can open, sort, filter, and add comments without any technical tools. 3. **JSON** — Nested configuration (settings within settings) maps naturally to JSON's hierarchical structure. CSV can't represent nesting. 4. **Excel** — Each assistant uses a familiar tool (Excel), and you can load all 15 files with `pd.read_excel` and concatenate them. If standardized, CSV would also work. 5. **Database** (PostgreSQL/MySQL, not SQLite) — Simultaneous access by 50 users requires a real database server with concurrency control.

Exercise 12.15 — The encoding detective ⭐⭐⭐

You have three CSV files from different sources. Each one fails to load correctly with default settings. Describe the likely problem and the fix for each:

File A: A word like café appears as cafÃ©
File B: The DataFrame has one column, and values look like "Paris;25;France"
File C: The column "zipcode" shows values like 2134 and 8501 (but you know the real ZIP codes are 02134 and 08501)

Guidance

1. **Encoding mismatch.** The file uses Latin-1 or CP1252 encoding but pandas defaults to UTF-8. Fix: `encoding='latin-1'` 2. **Wrong delimiter.** The file uses semicolons, not commas. Fix: `sep=';'` 3. **Type inference error.** Pandas interpreted ZIP codes as integers, dropping leading zeros. Fix: `dtype={'zipcode': str}`

Exercise 12.16 — Multi-format integration ⭐⭐⭐

You're analyzing city air quality. You have three data sources:

air_quality.csv: Daily readings with columns station_id, date, pm25, ozone
stations.xlsx: Station info with columns station_id, name, latitude, longitude (header on row 3 because of a title)
weather.json: Nested JSON with station_id, date, and weather data (temperature, humidity, wind.speed, wind.direction)

Write the complete code to load all three files and merge them into a single DataFrame with one row per station per day.

Guidance

import pandas as pd
import json

# Load CSV
air = pd.read_csv('air_quality.csv',
                   parse_dates=['date'])

# Load Excel (header on row 3, 0-indexed)
stations = pd.read_excel('stations.xlsx', header=3)

# Load nested JSON
with open('weather.json', 'r') as f:
    weather_raw = json.load(f)
weather = pd.json_normalize(weather_raw, sep='_')

# Merge air quality with station info
merged = air.merge(stations, on='station_id', how='left')

# Merge with weather
merged = merged.merge(
    weather, on=['station_id', 'date'], how='left')

print(merged.shape)
print(merged.columns.tolist())

Exercise 12.17 — SQL joins ⭐⭐⭐

A SQLite database school.db has two tables:

students: student_id, name, grade_level
test_scores: student_id, subject, score, test_date

Write SQL queries to: 1. List each student's name alongside their test scores (all subjects) 2. Find the average test score per grade level 3. Find students who scored above 90 in Math

Guidance

import sqlite3
conn = sqlite3.connect('school.db')

# 1. Students with their scores
q1 = """
SELECT s.name, t.subject, t.score
FROM students s
JOIN test_scores t ON s.student_id = t.student_id
"""

# 2. Average score per grade level
q2 = """
SELECT s.grade_level, AVG(t.score) as avg_score
FROM students s
JOIN test_scores t ON s.student_id = t.student_id
GROUP BY s.grade_level
"""

# 3. High math scorers
q3 = """
SELECT s.name, t.score
FROM students s
JOIN test_scores t ON s.student_id = t.student_id
WHERE t.subject = 'Math' AND t.score > 90
"""

df1 = pd.read_sql(q1, conn)
df2 = pd.read_sql(q2, conn)
df3 = pd.read_sql(q3, conn)
conn.close()

Part D: Synthesis and Extension (3-star to 4-star)

These problems require combining concepts from multiple sections.

Exercise 12.18 — Build a data loading report ⭐⭐⭐

Write a function called loading_report(filepath) that loads a file (detecting format from the extension), and prints a summary report including: - File format detected - Number of rows and columns - Column names and inferred dtypes - Number of missing values per column - First 3 rows

Test it with at least two different file formats.

Guidance

Combine the `load_data()` function from Section 12.6 with `df.info()`, `df.isnull().sum()`, and `df.head(3)`. The function should detect CSV, Excel, and JSON at minimum. Print each section with a clear heading. This is a practical utility function you'll reuse in real projects.

Exercise 12.19 — SQL vs. pandas comparison ⭐⭐⭐

Using any dataset, perform the same analysis two ways: once using SQL queries via pd.read_sql(), and once using pure pandas operations. Specifically: 1. Filter rows by a condition 2. Group by a column and compute an aggregate 3. Sort the results

Compare the two approaches in terms of readability and which felt more natural. When might you prefer SQL over pandas, or vice versa?

Guidance

Create a simple SQLite database from a DataFrame (`df.to_sql('tablename', conn, index=False)`), then query it. The SQL approach is often more concise for simple queries but less flexible for complex transformations. Pandas shines for chaining operations, creating new columns, and visualization — tasks that are awkward in SQL alone. The professional approach: use SQL to extract and filter at the database level, then use pandas for further analysis.

Exercise 12.20 — The format converter ⭐⭐⭐⭐

Write a function convert_format(input_path, output_path) that reads data from any supported format and writes it to any other supported format. It should handle CSV, Excel, and JSON. Include appropriate error handling for unsupported formats and missing data. Test by converting a CSV to Excel and back to CSV — are the results identical? What might be lost in translation?

Guidance

Use file extension detection for both input and output. Writers: `df.to_csv()`, `df.to_excel()`, `df.to_json()`. Round-tripping (CSV -> Excel -> CSV) may lose: (1) leading zeros in numeric-looking strings, (2) date formatting precision, (3) column order (sometimes). Test with edge cases like NaN values and mixed-type columns.

Exercise 12.21 — Chunked loading for large files ⭐⭐⭐⭐

Research the chunksize parameter of pd.read_csv(). Then write code that: 1. Loads a large CSV in chunks of 10,000 rows 2. Computes the mean of a numeric column across all chunks (without loading the entire file into memory) 3. Explain why this approach uses less memory than loading the whole file

Guidance

total = 0
count = 0
for chunk in pd.read_csv('large_file.csv',
                          chunksize=10000):
    total += chunk['value'].sum()
    count += chunk['value'].count()

overall_mean = total / count

This uses less memory because only 10,000 rows are in memory at any time. Once a chunk is processed, it's discarded before the next one is loaded. For a file with 10 million rows, this uses roughly 1/1000th the memory of loading the whole file.

Part E: New Dataset Challenge (2-star to 3-star)

Apply your skills to an unfamiliar scenario.

Exercise 12.22 — The multi-source weather project ⭐⭐

You're building a weather analysis for three cities. Your data arrives in three formats: - portland_weather.csv (comma-separated, UTF-8) - berlin_weather.csv (semicolon-separated, Latin-1 encoding) - tokyo_weather.json (nested JSON with temperature and precipitation objects)

Write the loading code for all three files, standardize the column names, and combine them into a single DataFrame. Add a city column to identify the source.

Guidance

Load each with the appropriate parameters. After loading, rename columns to match (all three should have the same column names). Add the city column with `df['city'] = 'Portland'`, etc. Concatenate with `pd.concat([portland, berlin, tokyo], ignore_index=True)`.

Part M: Mixed Review (Chapters 1-11)

These problems revisit concepts from earlier chapters to strengthen long-term retention.

Exercise 12.23 — Review: Boolean indexing (Chapter 7) ⭐⭐

After loading a CSV of employee salaries, write pandas code to find all employees in the "Engineering" department with salaries above $80,000. Then compute the mean salary of this filtered group.

Guidance

eng_high = df[(df['department'] == 'Engineering') &
              (df['salary'] > 80000)]
print(eng_high['salary'].mean())

Remember: use `&` (not `and`) for combining conditions in pandas, and wrap each condition in parentheses.

Exercise 12.24 — Review: String methods (Chapter 10) ⭐⭐

You load a CSV where the country column has inconsistent formatting: " United States ", "united states", "UNITED STATES". Write a chain of string operations to standardize all values to title case with no extra whitespace.

Guidance

df['country'] = (df['country']
                 .str.strip()
                 .str.title())

`strip()` removes leading/trailing whitespace, and `title()` converts to title case ("United States").

Exercise 12.25 — Review: Merging (Chapter 9) meets new loading ⭐⭐⭐

Load two files — a CSV of product sales and an Excel file of product details — and merge them on product_id. Identify which products appear in the sales data but are missing from the product details file, and which products have details but no sales. (Hint: use different how parameters in merge() to investigate.)

Guidance

sales = pd.read_csv('product_sales.csv')
details = pd.read_excel('product_details.xlsx')

# Full outer merge to see everything
full = sales.merge(details, on='product_id', how='outer',
                   indicator=True)

# Products with sales but no details
no_details = full[full['_merge'] == 'left_only']

# Products with details but no sales
no_sales = full[full['_merge'] == 'right_only']

The `indicator=True` parameter adds a `_merge` column showing whether each row came from the left, right, or both DataFrames.