Case Study 1: The Data Format Zoo — Consolidating Reports from Three Departments
Tier 3 — Illustrative/Composite Example: This case study uses a fictional but realistic corporate scenario modeled on the kinds of data integration challenges that analysts routinely face when departments within an organization use different tools and formats. The company name, department structures, and all data values are invented for pedagogical purposes.
The Setting
Marcus — our small business analyst from Chapter 1 — has landed his first corporate gig. He's a junior data analyst at Greenfield Industries, a mid-sized manufacturing company with about 800 employees across three divisions. His manager, Dana, has given him what she calls a "simple" first assignment.
"We need a consolidated headcount and compensation report," Dana says, sliding a USB drive across the table. "HR, Finance, and Operations each track their own employee data. I need it all in one place by Friday."
Marcus plugs in the USB drive and finds three files:
hr_employees.csvfrom the Human Resources departmentfinance_compensation.xlsxfrom the Finance departmentoperations_roster.jsonfrom the Operations department
Three files. Three formats. Three departments that apparently never agreed on a standard. Welcome to corporate data work.
First Contact: The CSV from HR
Marcus starts with the CSV, figuring it'll be the easiest. He opens his Jupyter notebook:
import pandas as pd
hr = pd.read_csv('hr_employees.csv')
print(hr.shape)
print(hr.head())
(812, 6)
emp_id name department hire_date status location
0 1001 GarcÃa, MarÃa HR 2019-03-15 Active Chicago
1 1002 Johnson, Tyler Marketing 2020-07-22 Active Denver
2 1003 Chen, Wei-Lin Engineering 2018-11-08 Active Chicago
Marcus spots the problem immediately: Garcia, Maria has turned into Garcia, Maria. He recognizes this from Chapter 12 — it's an encoding issue. The file was probably exported from an HR system that uses Latin-1 encoding.
hr = pd.read_csv('hr_employees.csv', encoding='latin-1')
print(hr.head(1))
emp_id name department hire_date status location
0 1001 Garcia, Maria HR 2019-03-15 Active Chicago
The accented characters display correctly now. Marcus adds a comment to his notebook: "HR system exports Latin-1 encoded CSVs. Always use encoding='latin-1' when loading."
He checks the basics:
print(f"Rows: {hr.shape[0]}")
print(f"Columns: {hr.columns.tolist()}")
print(f"Missing values:\n{hr.isnull().sum()}")
Rows: 812
Columns: ['emp_id', 'name', 'department', 'hire_date',
'status', 'location']
Missing values:
emp_id 0
name 0
department 2
hire_date 0
status 0
location 14
Two missing departments and 14 missing locations. Noted. He'll deal with those after combining everything.
The Excel File from Finance
Next up: the Finance department's Excel file.
finance = pd.read_excel('finance_compensation.xlsx')
The result looks strange — the first few rows contain the text "Greenfield Industries — Q4 Compensation Report" and "Prepared by: Finance Dept — Confidential". The header row doesn't start until row 4.
finance = pd.read_excel(
'finance_compensation.xlsx',
skiprows=4
)
print(finance.shape)
print(finance.columns.tolist())
(815, 5)
['Employee ID', 'Annual Salary', 'Bonus Pct',
'Benefits Tier', 'Pay Grade']
Good — 815 rows and clean column headers. But Marcus notices the column name is Employee ID, not emp_id like in the HR data. He'll need to rename it before merging. He also notices 815 rows versus HR's 812 — a discrepancy that could mean terminated employees are still in Finance's records, or new hires haven't been added to HR yet. He makes a note.
But there's another problem. The workbook might have multiple sheets:
all_sheets = pd.read_excel(
'finance_compensation.xlsx',
sheet_name=None
)
print(f"Sheets: {list(all_sheets.keys())}")
Sheets: ['Current', 'Terminated', 'Notes']
Three sheets. He loaded only "Current" by default (the first sheet), which is what he wants. But he notes the "Terminated" sheet — that explains the row count difference.
finance = finance.rename(
columns={'Employee ID': 'emp_id'}
)
The JSON from Operations
Finally, the Operations department's JSON file. Marcus knows from Chapter 12 that JSON can be nested, so he opens it cautiously:
import json
with open('operations_roster.json', 'r') as f:
raw = json.load(f)
print(type(raw))
print(len(raw))
<class 'list'>
450
It's a list of 450 records. Let's look at one:
print(json.dumps(raw[0], indent=2))
{
"id": 1003,
"shift": "Day",
"certifications": ["forklift", "hazmat"],
"supervisor": {
"name": "Rivera, Carlos",
"id": 2001
},
"performance": {
"last_review": "2024-01-15",
"rating": 4.2
}
}
Nested data. The supervisor and performance fields contain sub-objects, and certifications is a list. A simple pd.read_json() would leave dictionaries in the cells. Marcus reaches for json_normalize:
ops = pd.json_normalize(raw, sep='_')
print(ops.columns.tolist())
['id', 'shift', 'certifications',
'supervisor_name', 'supervisor_id',
'performance_last_review', 'performance_rating']
The nested fields have been flattened. supervisor.name became supervisor_name, and performance.rating became performance_rating. The certifications column still contains lists (e.g., ['forklift', 'hazmat']), but that's okay for now — Marcus can handle that later if needed.
He renames the ID column to match:
ops = ops.rename(columns={'id': 'emp_id'})
The Merge
Now Marcus has three DataFrames with a shared emp_id column. Time to combine them:
# Start with HR as the base (most complete employee info)
combined = hr.merge(
finance[['emp_id', 'Annual Salary', 'Bonus Pct',
'Benefits Tier', 'Pay Grade']],
on='emp_id',
how='left'
)
combined = combined.merge(
ops[['emp_id', 'shift', 'performance_rating']],
on='emp_id',
how='left'
)
print(f"Combined shape: {combined.shape}")
print(f"Missing values:\n{combined.isnull().sum()}")
Combined shape: (812, 11)
Missing values:
emp_id 0
name 0
department 2
hire_date 0
status 0
location 14
Annual Salary 8
Bonus Pct 23
Benefits Tier 8
Pay Grade 8
shift 362
performance_rating 362
The 362 missing values in shift and performance_rating make sense — Operations only has 450 employees, so about 362 employees from other departments (Marketing, HR, Finance, etc.) don't have Operations-specific data.
The Report
Marcus creates a summary for Dana:
summary = combined.groupby('department').agg(
headcount=('emp_id', 'count'),
avg_salary=('Annual Salary', 'mean'),
avg_rating=('performance_rating', 'mean')
).round(2)
print(summary)
He documents everything in his notebook: which files came in which format, what parameters he needed for each, what data quality issues he found, and how he resolved them.
What Marcus Learned
This "simple" assignment taught Marcus several things that the textbook examples didn't:
-
Real data comes in mixed formats. You can't control what format other departments use. You need to be comfortable with all of them.
-
The first load almost never works. The HR file needed encoding, the Finance file needed skiprows, and the Operations file needed json_normalize. Expect to iterate.
-
Column names are never consistent.
emp_idvs.Employee IDvs.id— the same concept had three different names across three files. Renaming and standardizing is part of the job. -
Row counts tell a story. HR had 812 rows, Finance had 815, and Operations had 450. Understanding why these numbers differ is as important as the merge itself.
-
Document your loading decisions. Six months from now, when Marcus needs to refresh this report, he'll be glad his notebook explains that the HR file needs
encoding='latin-1'and the Finance file needsskiprows=4.
When Marcus presents the consolidated report on Friday, Dana is impressed — not just by the output, but by the documentation. "Most analysts would have just emailed me a spreadsheet," she says. "You gave me a reproducible process." Marcus makes a mental note: the notebook isn't just for him. It's for the next person who needs to do this work.
Discussion Questions
-
Why did Marcus use
how='left'in both merges instead ofhow='inner'? What would have been different with an inner join? -
The Operations JSON had a
certificationsfield containing lists like['forklift', 'hazmat']. How would you transform this into a format suitable for analysis? (Hint: think about what you learned in Chapter 9 about reshaping.) -
If Marcus needs to run this consolidation every month, what would you suggest to make the process more efficient and less error-prone?