Case Study 1: The Data Format Zoo — Consolidating Reports from Three Departments

Contributors to Introduction to Data Science

Case Study 1: The Data Format Zoo — Consolidating Reports from Three Departments

Tier 3 — Illustrative/Composite Example: This case study uses a fictional but realistic corporate scenario modeled on the kinds of data integration challenges that analysts routinely face when departments within an organization use different tools and formats. The company name, department structures, and all data values are invented for pedagogical purposes.

The Setting

Marcus — our small business analyst from Chapter 1 — has landed his first corporate gig. He's a junior data analyst at Greenfield Industries, a mid-sized manufacturing company with about 800 employees across three divisions. His manager, Dana, has given him what she calls a "simple" first assignment.

"We need a consolidated headcount and compensation report," Dana says, sliding a USB drive across the table. "HR, Finance, and Operations each track their own employee data. I need it all in one place by Friday."

Marcus plugs in the USB drive and finds three files:

hr_employees.csv from the Human Resources department
finance_compensation.xlsx from the Finance department
operations_roster.json from the Operations department

Three files. Three formats. Three departments that apparently never agreed on a standard. Welcome to corporate data work.

First Contact: The CSV from HR

Marcus starts with the CSV, figuring it'll be the easiest. He opens his Jupyter notebook:

import pandas as pd

hr = pd.read_csv('hr_employees.csv')
print(hr.shape)
print(hr.head())

(812, 6)
   emp_id          name department  hire_date  status  location
0    1001  GarcÃa, MarÃa         HR 2019-03-15  Active  Chicago
1    1002   Johnson, Tyler  Marketing 2020-07-22  Active   Denver
2    1003     Chen, Wei-Lin Engineering 2018-11-08  Active  Chicago

Marcus spots the problem immediately: García, María has turned into GarcÃa, MarÃa. He recognizes this from Chapter 12 — it's an encoding issue. The file was probably exported from an HR system that uses Latin-1 encoding.

hr = pd.read_csv('hr_employees.csv', encoding='latin-1')
print(hr.head(1))

   emp_id           name department  hire_date  status  location
0    1001  García, María         HR 2019-03-15  Active  Chicago

The accented characters display correctly now. Marcus adds a comment to his notebook: "HR system exports Latin-1 encoded CSVs. Always use encoding='latin-1' when loading."

He checks the basics:

print(f"Rows: {hr.shape[0]}")
print(f"Columns: {hr.columns.tolist()}")
print(f"Missing values:\n{hr.isnull().sum()}")

Rows: 812
Columns: ['emp_id', 'name', 'department', 'hire_date',
          'status', 'location']
Missing values:
emp_id        0
name          0
department    2
hire_date     0
status        0
location     14

Two missing departments and 14 missing locations. Noted. He'll deal with those after combining everything.

The Excel File from Finance

Next up: the Finance department's Excel file.

finance = pd.read_excel('finance_compensation.xlsx')

The result looks strange — the first few rows contain the text "Greenfield Industries — Q4 Compensation Report" and "Prepared by: Finance Dept — Confidential". The header row doesn't start until row 4.

finance = pd.read_excel(
    'finance_compensation.xlsx',
    skiprows=4
)
print(finance.shape)
print(finance.columns.tolist())

(815, 5)
['Employee ID', 'Annual Salary', 'Bonus Pct',
 'Benefits Tier', 'Pay Grade']

Good — 815 rows and clean column headers. But Marcus notices the column name is Employee ID, not emp_id like in the HR data. He'll need to rename it before merging. He also notices 815 rows versus HR's 812 — a discrepancy that could mean terminated employees are still in Finance's records, or new hires haven't been added to HR yet. He makes a note.

But there's another problem. The workbook might have multiple sheets:

all_sheets = pd.read_excel(
    'finance_compensation.xlsx',
    sheet_name=None
)
print(f"Sheets: {list(all_sheets.keys())}")

Sheets: ['Current', 'Terminated', 'Notes']

Three sheets. He loaded only "Current" by default (the first sheet), which is what he wants. But he notes the "Terminated" sheet — that explains the row count difference.

finance = finance.rename(
    columns={'Employee ID': 'emp_id'}
)

The JSON from Operations

Finally, the Operations department's JSON file. Marcus knows from Chapter 12 that JSON can be nested, so he opens it cautiously:

import json

with open('operations_roster.json', 'r') as f:
    raw = json.load(f)

print(type(raw))
print(len(raw))

<class 'list'>
450

It's a list of 450 records. Let's look at one:

print(json.dumps(raw[0], indent=2))

{
  "id": 1003,
  "shift": "Day",
  "certifications": ["forklift", "hazmat"],
  "supervisor": {
    "name": "Rivera, Carlos",
    "id": 2001
  },
  "performance": {
    "last_review": "2024-01-15",
    "rating": 4.2
  }
}

Nested data. The supervisor and performance fields contain sub-objects, and certifications is a list. A simple pd.read_json() would leave dictionaries in the cells. Marcus reaches for json_normalize:

ops = pd.json_normalize(raw, sep='_')
print(ops.columns.tolist())

['id', 'shift', 'certifications',
 'supervisor_name', 'supervisor_id',
 'performance_last_review', 'performance_rating']

The nested fields have been flattened. supervisor.name became supervisor_name, and performance.rating became performance_rating. The certifications column still contains lists (e.g., ['forklift', 'hazmat']), but that's okay for now — Marcus can handle that later if needed.

He renames the ID column to match:

ops = ops.rename(columns={'id': 'emp_id'})

The Merge

Now Marcus has three DataFrames with a shared emp_id column. Time to combine them:

# Start with HR as the base (most complete employee info)
combined = hr.merge(
    finance[['emp_id', 'Annual Salary', 'Bonus Pct',
             'Benefits Tier', 'Pay Grade']],
    on='emp_id',
    how='left'
)

combined = combined.merge(
    ops[['emp_id', 'shift', 'performance_rating']],
    on='emp_id',
    how='left'
)

print(f"Combined shape: {combined.shape}")
print(f"Missing values:\n{combined.isnull().sum()}")

Combined shape: (812, 11)
Missing values:
emp_id                0
name                  0
department            2
hire_date             0
status                0
location             14
Annual Salary         8
Bonus Pct            23
Benefits Tier         8
Pay Grade             8
shift               362
performance_rating  362

The 362 missing values in shift and performance_rating make sense — Operations only has 450 employees, so about 362 employees from other departments (Marketing, HR, Finance, etc.) don't have Operations-specific data.

The Report

Marcus creates a summary for Dana:

summary = combined.groupby('department').agg(
    headcount=('emp_id', 'count'),
    avg_salary=('Annual Salary', 'mean'),
    avg_rating=('performance_rating', 'mean')
).round(2)

print(summary)

He documents everything in his notebook: which files came in which format, what parameters he needed for each, what data quality issues he found, and how he resolved them.

What Marcus Learned

This "simple" assignment taught Marcus several things that the textbook examples didn't:

Real data comes in mixed formats. You can't control what format other departments use. You need to be comfortable with all of them.
The first load almost never works. The HR file needed encoding, the Finance file needed skiprows, and the Operations file needed json_normalize. Expect to iterate.
Column names are never consistent. emp_id vs. Employee ID vs. id — the same concept had three different names across three files. Renaming and standardizing is part of the job.
Row counts tell a story. HR had 812 rows, Finance had 815, and Operations had 450. Understanding why these numbers differ is as important as the merge itself.
Document your loading decisions. Six months from now, when Marcus needs to refresh this report, he'll be glad his notebook explains that the HR file needs encoding='latin-1' and the Finance file needs skiprows=4.

When Marcus presents the consolidated report on Friday, Dana is impressed — not just by the output, but by the documentation. "Most analysts would have just emailed me a spreadsheet," she says. "You gave me a reproducible process." Marcus makes a mental note: the notebook isn't just for him. It's for the next person who needs to do this work.

Discussion Questions

Why did Marcus use how='left' in both merges instead of how='inner'? What would have been different with an inner join?
The Operations JSON had a certifications field containing lists like ['forklift', 'hazmat']. How would you transform this into a format suitable for analysis? (Hint: think about what you learned in Chapter 9 about reshaping.)
If Marcus needs to run this consolidation every month, what would you suggest to make the process more efficient and less error-prone?