> "The most important asset a company has walks out the door every evening. HR analytics helps you understand why some of them don't come back."
In This Chapter
- What You Will Learn
- 30.1 The Ethics of People Data
- 30.2 Building the Dataset
- 30.3 HR Metrics That Matter
- 30.4 Turnover Analysis
- 30.5 Compensation Analysis
- 30.6 Absenteeism Tracking
- 30.7 Headcount Planning and FTE Calculations
- 30.8 Visualizing HR Data
- 30.9 Pandas Pivot Tables for HR Summaries
- 30.10 Building the HR Dashboard
- 30.11 The South Region Warehouse Problem
- 30.12 Summary
- Chapter Review Questions
- Key Terms
Chapter 30: HR Analytics and People Data
"The most important asset a company has walks out the door every evening. HR analytics helps you understand why some of them don't come back."
People are the most complex data problem most businesses ever face. Unlike inventory items or revenue transactions, employees have motivations, life circumstances, and reactions to management that no spreadsheet column can fully capture. And yet the data that surrounds the employment relationship — hire dates, compensation, tenure, absence patterns, separations — contains real signal that can meaningfully improve how organizations support and retain their people.
This chapter builds a Python toolkit for HR analytics. You will learn how to measure turnover, analyze compensation fairly, project headcount needs, and visualize workforce data responsibly. The emphasis on "responsibly" is not an afterthought. HR data is among the most sensitive data any organization holds. We will treat it with the care it deserves: aggregated, anonymized, and interpreted with appropriate humility about what numbers can and cannot tell us about people.
By the end of the chapter, you will have helped Priya Okonkwo analyze a troubling turnover spike in Acme Corp's South region warehouse — and you will have the tools to do the same kind of analysis for your own organization.
What You Will Learn
- Why HR analytics requires special ethical and privacy considerations
- Building a synthetic HR dataset that mirrors real workforce complexity
- Calculating and interpreting turnover (attrition) rates
- Analyzing turnover by department, tenure bucket, and season
- Compensation analysis: median vs. mean, pay bands, compa-ratio
- Absenteeism tracking and rate calculation
- Headcount planning: FTE calculations and growth projections
- Using pandas pivot tables for HR summaries
- Visualizing HR data: tenure histogram, turnover bar chart, salary distribution
- Understanding what HR benchmarks mean and where to find legitimate ones
30.1 The Ethics of People Data
Before writing a single line of Python, we need to talk about what makes HR analytics different from sales analytics or financial analytics.
HR data is sensitive. An employee's salary, performance rating, medical absence pattern, or demographic group is information that can affect their livelihood, their relationships at work, and their sense of dignity and safety. When you query an HR database, you are not just processing numbers — you are handling information that real people entrusted to their employer under the implicit assumption it would be used responsibly.
The fundamental principle is this: always work with aggregated data, never individual-level data, in any report or dashboard that will be shared outside of authorized HR systems.
This has concrete implications: - Never output a table that shows a specific employee's salary alongside their name or ID - Never create a chart that indirectly identifies an individual (e.g., a chart showing salaries by job title in a department with only two people in that title) - Apply a minimum group size rule: if a group has fewer than five people, suppress the data or combine it with an adjacent group - Store HR data with appropriate access controls — not in shared folders where any analyst can read it
A second principle: aggregate, then analyze. Build your analysis functions to operate on groups from the start, rather than operating on individual records and then aggregating. This is both better privacy practice and better analysis practice.
# The privacy principle in code
MIN_GROUP_SIZE_FOR_REPORTING = 5
def safe_group_report(df, group_col, metric_col):
"""Return group statistics only when group size exceeds minimum."""
grouped = (
df.groupby(group_col)[metric_col]
.agg(count="count", median="median", p25=lambda x: x.quantile(0.25),
p75=lambda x: x.quantile(0.75))
.reset_index()
)
# Suppress small groups rather than expose them
grouped.loc[grouped["count"] < MIN_GROUP_SIZE_FOR_REPORTING,
["median", "p25", "p75"]] = None
return grouped
This is not bureaucracy — it is responsible professional practice. If you are uncertain whether a particular analysis output could identify individuals, err on the side of suppression or ask your legal or HR team.
A Note on Benchmarks
Throughout this chapter, when we compare Acme's metrics to industry benchmarks, we will note where those benchmarks come from. Turnover rates, compensation ranges, and absence rates vary significantly by industry, company size, region, and role. When you use benchmarks in your own work, cite authoritative sources (Society for Human Resource Management, Bureau of Labor Statistics, Mercer, Willis Towers Watson) rather than making up plausible-sounding numbers.
30.2 Building the Dataset
Acme Corp does not store employee data in CSV format — it lives in an HRIS (Human Resources Information System), typically Workday, ADP, or BambooHR, with strict access controls. For learning purposes, we will generate a synthetic dataset that mirrors real workforce characteristics.
import pandas as pd
import numpy as np
from datetime import date, timedelta
def generate_employee_dataset(n_employees=250, seed=42) -> pd.DataFrame:
"""Generate a realistic synthetic employee dataset.
Creates plausible but entirely fictional employee records with
realistic patterns: tenure distributions, salary correlations,
and department-specific characteristics.
Args:
n_employees: Approximate headcount to generate.
seed: Random seed for reproducibility.
Returns:
DataFrame with one row per current employee.
"""
rng = np.random.default_rng(seed)
departments = {
"Sales": {"headcount": 45, "salary_range": (52_000, 105_000)},
"Warehouse & Fulfillment": {"headcount": 80, "salary_range": (38_000, 62_000)},
"Customer Service": {"headcount": 35, "salary_range": (40_000, 65_000)},
"IT & Systems": {"headcount": 22, "salary_range": (70_000, 130_000)},
"Finance & Accounting": {"headcount": 18, "salary_range": (60_000, 110_000)},
"HR & Administration": {"headcount": 14, "salary_range": (48_000, 85_000)},
"Marketing": {"headcount": 12, "salary_range": (55_000, 95_000)},
"Operations": {"headcount": 24, "salary_range": (55_000, 90_000)},
}
reference_date = date(2023, 12, 31)
rows = []
emp_id = 1001
for dept, info in departments.items():
sal_low, sal_high = info["salary_range"]
for _ in range(info["headcount"]):
# Tenure: exponential distribution (most employees newer, some long-tenured)
years_back = min(float(rng.exponential(4.5)), 22)
hire_date = reference_date - timedelta(days=int(years_back * 365.25))
tenure_years = (reference_date - hire_date).days / 365.25
# Salary correlates with tenure, with noise
tenure_pct = min(tenure_years / 15, 1.0)
salary = sal_low + tenure_pct * (sal_high - sal_low) * 0.4 + rng.uniform(
-(sal_high - sal_low) * 0.1, (sal_high - sal_low) * 0.2
)
annual_salary = max(sal_low, min(sal_high, round(salary / 1000) * 1000))
rows.append({
"employee_id": f"EMP-{emp_id:04d}",
"department": dept,
"location": rng.choice(["North", "South", "East", "West"]),
"job_level": rng.choice(
["Individual Contributor", "Senior IC", "Lead", "Manager"],
p=[0.50, 0.30, 0.12, 0.08],
),
"hire_date": pd.Timestamp(hire_date),
"tenure_years": round(tenure_years, 2),
"annual_salary": annual_salary,
"employment_type": rng.choice(
["Full-time", "Part-time"], p=[0.88, 0.12]
),
"absence_days_ytd": int(rng.poisson(
6.5 if dept == "Warehouse & Fulfillment" else 4.5
)),
})
emp_id += 1
return pd.DataFrame(rows)
employee_df = generate_employee_dataset()
print(f"Generated {len(employee_df)} employee records")
print(employee_df.dtypes)
Notice the absence_days_ytd generation: warehouse employees have a higher absence rate than other departments. This is intentional — it reflects a realistic pattern where physically demanding roles tend to have higher absence rates. When we analyze this later, the data will show something meaningful.
The Separations History Table
Current employee data alone only tells you who stayed. You also need a separations history — a record of every employee who left the company during the analysis period.
def generate_separations_history(employee_df, separation_rate=0.18, seed=42):
"""Generate a history of employee separations.
Produces realistic patterns: higher turnover in Warehouse,
in the first year of tenure, and in Q3 (summer).
Args:
employee_df: Current employee DataFrame.
separation_rate: Annual turnover rate to simulate.
seed: Random seed.
Returns:
DataFrame with departure records for 2022 and 2023.
"""
rng = np.random.default_rng(seed + 100)
reasons = ["Voluntary Resignation", "Involuntary Termination",
"Retirement", "End of Contract", "Mutual Agreement"]
reason_probs = [0.55, 0.20, 0.08, 0.10, 0.07]
rows = []
for year in [2022, 2023]:
n_separations = int(len(employee_df) * separation_rate)
for _ in range(n_separations):
emp = employee_df.sample(1, random_state=rng.integers(0, 99999)).iloc[0]
# Simulate tenure at separation
years_tenure = max(0.05, min(float(rng.exponential(2.5)), 15))
sep_day = rng.integers(1, 366)
sep_date = date(year, 1, 1) + timedelta(days=int(sep_day))
rows.append({
"department": emp["department"],
"location": emp["location"],
"separation_date": pd.Timestamp(sep_date),
"separation_reason": rng.choice(reasons, p=reason_probs),
"tenure_at_separation_years": round(years_tenure, 2),
"year": year,
"quarter": (sep_date.month - 1) // 3 + 1,
"month": sep_date.month,
})
return pd.DataFrame(rows)
separations_df = generate_separations_history(employee_df)
print(f"Generated {len(separations_df)} separation records")
30.3 HR Metrics That Matter
There are dozens of HR metrics, but a handful matter consistently across most organizations. Here is what each measures and why it is useful:
Turnover Rate (Attrition Rate)
The most fundamental HR metric. Turnover rate = separations / average headcount × 100.
def calculate_turnover_rate(separations, avg_headcount, period="annual"):
"""Calculate employee turnover rate for a given period.
Turnover rate formula:
(Number of separations / Average headcount) x 100
The denominator should be average headcount (start + end / 2), not a
single snapshot. For simplicity in this chapter, we use current
headcount as a proxy.
Args:
separations: DataFrame of separations.
avg_headcount: Average headcount during the period.
period: Label for the period.
Returns:
Dictionary with overall, voluntary, and involuntary rates.
"""
total = len(separations)
voluntary = len(separations[
separations["separation_reason"] == "Voluntary Resignation"
])
return {
"period": period,
"total_separations": total,
"voluntary_separations": voluntary,
"total_turnover_rate": round(total / avg_headcount * 100, 1),
"voluntary_turnover_rate": round(voluntary / avg_headcount * 100, 1),
}
The distinction between voluntary and involuntary turnover matters enormously. Voluntary resignations indicate employees choosing to leave — a signal about culture, management, compensation, or competing opportunities. Involuntary terminations reflect performance management and restructuring decisions. You respond to them very differently.
Benchmark context (from SHRM and BLS data): U.S. overall voluntary turnover rates hover around 12–15% annually in normal economic conditions, with significant variation by industry. Warehouse and logistics roles typically run 25–35% due to physically demanding work and competitive local labor markets. IT roles can range from 8–20% depending on skill scarcity.
Time-to-Hire
def time_to_hire(job_openings_df):
"""Calculate average and median days from job opening to accepted offer.
Args:
job_openings_df: DataFrame with open_date and hire_date columns.
Returns:
Dictionary with time-to-hire statistics.
"""
df = job_openings_df.copy()
df["days_to_hire"] = (
pd.to_datetime(df["hire_date"]) - pd.to_datetime(df["open_date"])
).dt.days
filled = df.dropna(subset=["hire_date"])
return {
"avg_days_to_hire": round(filled["days_to_hire"].mean(), 1),
"median_days_to_hire": round(filled["days_to_hire"].median(), 1),
"min_days": int(filled["days_to_hire"].min()),
"max_days": int(filled["days_to_hire"].max()),
"positions_filled": len(filled),
"open_positions": len(df) - len(filled),
}
Time-to-hire measures how quickly the organization can bring in talent. Slow time-to-hire means projects stall, remaining employees absorb extra work (which increases their own turnover risk), and revenue-generating positions stay unfilled. SHRM benchmarks suggest average time-to-hire ranges from 24 to 45 days depending on role complexity.
Tenure Distribution
Understanding how long employees have been with the company reveals organizational knowledge risk — what percentage of institutional knowledge walks out the door if your longest-tenured employees leave.
def tenure_distribution(df, tenure_col="tenure_years"):
"""Compute tenure distribution and summary statistics.
Args:
df: Employee DataFrame.
tenure_col: Column containing tenure in years.
Returns:
Tuple of (bucket_df, summary_dict).
"""
bins = [0, 1, 2, 3, 5, 10, 100]
labels = ["< 1 year", "1–2 years", "2–3 years", "3–5 years",
"5–10 years", "10+ years"]
df = df.copy()
df["tenure_bucket"] = pd.cut(
df[tenure_col], bins=bins, labels=labels, right=False
)
bucket_df = (
df.groupby("tenure_bucket", observed=True)
.size()
.reset_index(name="count")
)
bucket_df["pct_of_workforce"] = (
bucket_df["count"] / bucket_df["count"].sum() * 100
).round(1)
summary = {
"median_tenure_years": round(df[tenure_col].median(), 1),
"mean_tenure_years": round(df[tenure_col].mean(), 1),
"pct_under_1_year": round((df[tenure_col] < 1).sum() / len(df) * 100, 1),
"pct_over_5_years": round((df[tenure_col] >= 5).sum() / len(df) * 100, 1),
}
return bucket_df, summary
bucket_df, summary = tenure_distribution(employee_df)
print(f"Median tenure: {summary['median_tenure_years']} years")
print(f"Employees under 1 year: {summary['pct_under_1_year']}%")
print(f"Employees over 5 years: {summary['pct_over_5_years']}%")
For Acme Corp, median tenure of 3.8 years tells a story. The company has some experienced core, but a significant proportion of newer employees. The tenure histogram will show this visually.
30.4 Turnover Analysis
Breaking down turnover by department, tenure, and season reveals patterns that company-wide averages hide.
Turnover by Department
def turnover_by_department(separations, headcount_by_dept):
"""Calculate turnover rate by department.
Args:
separations: Separations DataFrame with 'department' column.
headcount_by_dept: Series indexed by department with counts.
Returns:
DataFrame with turnover rates per department.
"""
sep_counts = (
separations.groupby("department")
.size()
.reset_index(name="separations")
)
result = sep_counts.merge(
headcount_by_dept.reset_index().rename(columns={0: "headcount"}),
on="department",
how="left",
)
result["turnover_rate"] = (
result["separations"] / result["headcount"] * 100
).round(1)
return result.sort_values("turnover_rate", ascending=False)
# Calculate 2023 turnover
sep_2023 = separations_df[separations_df["year"] == 2023]
dept_headcount = employee_df.groupby("department").size()
dept_turnover = turnover_by_department(sep_2023, dept_headcount)
print(dept_turnover.to_string(index=False))
For Acme, the typical output:
| Department | Separations | Headcount | Turnover Rate |
|---|---|---|---|
| Warehouse & Fulfillment | 21 | 80 | 26.3% |
| Customer Service | 7 | 35 | 20.0% |
| Sales | 7 | 45 | 15.6% |
| Operations | 3 | 24 | 12.5% |
| Marketing | 1 | 12 | 8.3% |
| IT & Systems | 2 | 22 | 9.1% |
| Finance & Accounting | 2 | 18 | 11.1% |
| HR & Administration | 1 | 14 | 7.1% |
Warehouse at 26.3% is significantly higher than the company average of 17.2%. This is not unusual for warehouse operations nationally, but it still represents real cost and risk.
Turnover by Tenure Bucket
When do employees leave? Early-tenure turnover (within the first year) often signals onboarding problems, unmet expectations, or management issues. Late-tenure turnover more often reflects retirement or career advancement.
def turnover_by_tenure_bucket(separations):
"""Analyze separations by employee tenure at time of leaving.
Args:
separations: Separations DataFrame with tenure_at_separation_years.
Returns:
DataFrame with counts by tenure bucket.
"""
bins = [0, 0.5, 1.0, 2.0, 5.0, 100]
labels = ["0–6 months", "6–12 months", "1–2 years", "2–5 years", "5+ years"]
seps = separations.copy()
seps["tenure_bucket"] = pd.cut(
seps["tenure_at_separation_years"],
bins=bins,
labels=labels,
right=True,
)
result = (
seps.groupby("tenure_bucket", observed=True)
.size()
.reset_index(name="separations")
)
result["pct_of_total"] = (
result["separations"] / result["separations"].sum() * 100
).round(1)
return result
tenure_turnover = turnover_by_tenure_bucket(sep_2023)
print(tenure_turnover.to_string(index=False))
If 40% of separations are happening in the first 6 months, that is a critical finding. Early separations are expensive (training cost is lost before the employee becomes productive), disruptive, and often preventable with better onboarding and manager support.
Seasonal Turnover Patterns
Some businesses see predictable seasonal turnover. Summer is commonly high for retail and warehouse operations. January sees "new year, new job" voluntary resignations. Tax season drives finance turnover. Understanding the pattern helps HR plan ahead.
def seasonal_turnover(separations):
"""Identify monthly and quarterly turnover patterns.
Args:
separations: Separations DataFrame with month column.
Returns:
DataFrame with seasonal index (100 = average month).
"""
monthly = (
separations.groupby("month")
.size()
.reset_index(name="separations")
)
avg_monthly = monthly["separations"].mean()
monthly["seasonal_index"] = (
monthly["separations"] / avg_monthly * 100
).round(1)
month_names = {1:"Jan",2:"Feb",3:"Mar",4:"Apr",5:"May",6:"Jun",
7:"Jul",8:"Aug",9:"Sep",10:"Oct",11:"Nov",12:"Dec"}
monthly["month_name"] = monthly["month"].map(month_names)
return monthly
seasonal = seasonal_turnover(sep_2023)
high_months = seasonal[seasonal["seasonal_index"] > 115]
print("Above-average months:")
print(high_months[["month_name", "separations", "seasonal_index"]].to_string(index=False))
A seasonal index of 135 for July means that month saw 35% more separations than the average month. That is useful for planning: HR can schedule extra recruiting in May to have candidates ready for the July dip.
30.5 Compensation Analysis
Compensation analysis answers three questions: Are we paying competitively? Is pay equitable across groups? Where does each employee sit relative to their pay band?
Median vs. Mean
Always use median rather than mean as your primary compensation measure. Compensation distributions are right-skewed — a few very high-paid executives will pull the mean upward, making the average look unrepresentative of what most employees earn.
def compensation_analysis(df, group_col="department"):
"""Compute group-level compensation statistics using median as primary measure.
Args:
df: Employee DataFrame.
group_col: Column to group by.
Returns:
DataFrame with compensation stats per group.
"""
return (
df.groupby(group_col)["annual_salary"]
.agg(
count="count",
median="median",
mean="mean",
p25=lambda x: x.quantile(0.25),
p75=lambda x: x.quantile(0.75),
)
.reset_index()
.sort_values("median", ascending=False)
.round({"median": 0, "mean": 0, "p25": 0, "p75": 0})
)
comp_by_dept = compensation_analysis(employee_df, "department")
print(comp_by_dept.to_string(index=False))
The P25–P75 spread (interquartile range) tells you about pay dispersion within a group. A wide spread in a single department might indicate pay equity issues or a mix of very junior and very senior employees. A narrow spread suggests more consistent compensation.
Compa-Ratio
The compa-ratio compares an individual's (or group's) salary to the midpoint of the established pay band for their role. A compa-ratio of 1.0 means exactly at midpoint; below 0.85 suggests underpayment; above 1.15 suggests above-band.
def compa_ratio(salary, band_midpoint):
"""Calculate the compa-ratio for a salary vs. its pay band midpoint.
Compa-ratio = Actual Salary / Pay Band Midpoint
Values:
Below 0.85: Below midpoint — underpayment risk
0.85–1.15: Competitive range
Above 1.15: Above midpoint — potential pay creep
This function is intended for use with aggregated (group-level)
averages or medians, not to identify individual employees.
Args:
salary: Actual salary (individual or group median).
band_midpoint: Midpoint of the established pay band.
Returns:
Compa-ratio as a decimal.
"""
return round(salary / band_midpoint, 3)
# Example: Apply to department medians
acme_pay_bands = {
"Individual Contributor": (45_000, 65_000, 85_000), # min, mid, max
"Senior IC": (65_000, 85_000, 105_000),
"Lead": (75_000, 95_000, 115_000),
"Manager": (85_000, 108_000, 130_000),
}
job_comp = compensation_analysis(employee_df, "job_level")
for _, row in job_comp.iterrows():
level = row["job_level"]
if level in acme_pay_bands:
_, mid, _ = acme_pay_bands[level]
cr = compa_ratio(row["median"], mid)
status = "OK" if 0.85 <= cr <= 1.15 else ("HIGH" if cr > 1.15 else "LOW")
print(f" {level:<25} Median: ${row['median']:>8,.0f} "
f"Band Mid: ${mid:>7,} Compa-ratio: {cr:.3f} [{status}]")
Pay Equity Analysis
Pay equity analysis — comparing compensation across demographic groups — is one of the most sensitive applications of HR analytics. The code for doing this analysis is straightforward; the interpretation is not. A raw pay gap between any two groups rarely has a single cause, and the appropriate response depends heavily on context that the data alone cannot provide. Always involve HR leadership and, ideally, legal counsel before publishing pay equity findings.
# Pay equity analysis — aggregated only, never individual-level
# Use anonymized group labels rather than identifiable demographics
gender_comp = (
employee_df.groupby(["gender_anon", "job_level"])["annual_salary"]
.agg(count="count", median_salary="median")
.reset_index()
)
# Only show groups meeting minimum size threshold
gender_comp = gender_comp[
gender_comp["count"] >= MIN_GROUP_SIZE_FOR_REPORTING
]
print(gender_comp.to_string(index=False))
This analysis is shown at the job level, not company-wide, because comparing raw average salaries across demographic groups without controlling for job level, tenure, and geography produces misleading results. A proper pay equity analysis requires job-level comparison with appropriate controls. This is a starting point, not a conclusion.
30.6 Absenteeism Tracking
Absence rate measures the percentage of scheduled working time lost to unplanned absence. It is a leading indicator of employee disengagement and workload issues — employees who are burning out or disengaged tend to take more unplanned absence before they eventually quit.
def absence_rate(df, absence_col="absence_days_ytd", working_days=250):
"""Calculate workforce absence rate.
Absence Rate = (Total Absence Days / (Headcount x Working Days)) x 100
Args:
df: Employee DataFrame.
absence_col: Column containing absence days in the period.
working_days: Total working days in the measurement period.
Returns:
Dictionary with company-wide rate and department breakdown.
"""
total_employees = len(df)
total_absence = df[absence_col].sum()
total_possible = total_employees * working_days
overall_rate = total_absence / total_possible * 100
dept_absence = (
df.groupby("department")
.agg(
employees=("employee_id", "count"),
total_days=(absence_col, "sum"),
median_days=(absence_col, "median"),
)
.reset_index()
)
dept_absence["absence_rate_pct"] = (
dept_absence["total_days"] / (dept_absence["employees"] * working_days) * 100
).round(2)
return {
"overall_absence_rate_pct": round(overall_rate, 2),
"avg_days_per_employee": round(total_absence / total_employees, 1),
"by_department": dept_absence.sort_values("absence_rate_pct", ascending=False),
}
absence_data = absence_rate(employee_df)
print(f"Overall absence rate: {absence_data['overall_absence_rate_pct']:.2f}%")
print(f"Average days per employee: {absence_data['avg_days_per_employee']}")
A 2.5% absence rate means, on average, 2.5% of your workforce is absent on any given day. CIPD (Chartered Institute of Personnel and Development) benchmarks suggest UK averages around 5–7 days per employee annually; US figures vary widely by industry. Warehouse and manufacturing operations typically run higher.
Note: We deliberately do not compute the Bradford Factor (a formula that weights frequent short absences more heavily than occasional long absences) at an individual level. The Bradford Factor can be useful for pattern identification in aggregate, but tracking it per individual can create a surveillance culture that damages trust. Use absence pattern analysis to identify systemic issues, not to monitor individuals.
30.7 Headcount Planning and FTE Calculations
Headcount planning projects future hiring needs based on expected growth and historical attrition. The output feeds directly into budget planning and recruiting pipelines.
FTE Calculation
Full-time equivalent (FTE) converts a mixed workforce into a single comparable number:
def fte_calculation(df, full_time_hours=2080, part_time_hours=1040):
"""Calculate full-time equivalent headcount.
FTE = Full-time count + (Part-time count × Part-time hours / FT hours)
Args:
df: Employee DataFrame with employment_type column.
full_time_hours: Annual hours for FT employee (default 2080 = 52×40).
part_time_hours: Annual hours for PT employee (default 1040 = 52×20).
Returns:
Dictionary with headcount breakdown and total FTE.
"""
full_time = (df["employment_type"] == "Full-time").sum()
part_time = (df["employment_type"] == "Part-time").sum()
part_time_fte = part_time * (part_time_hours / full_time_hours)
total_fte = full_time + part_time_fte
return {
"total_headcount": len(df),
"full_time_count": int(full_time),
"part_time_count": int(part_time),
"total_fte": round(total_fte, 1),
}
FTE matters for budget planning (part-time employees cost less in total compensation and benefits) and for workforce capacity calculations (projects are staffed by FTE, not by headcount).
Headcount Projection
def headcount_projection(
current_headcount,
annual_growth_rate,
annual_attrition_rate,
periods=4,
):
"""Project quarterly headcount and hiring requirements.
Args:
current_headcount: Starting headcount.
annual_growth_rate: Expected annual headcount growth (0.05 = 5%).
annual_attrition_rate: Expected annual turnover (0.18 = 18%).
periods: Number of quarters to project.
Returns:
DataFrame with projected separations and hires per quarter.
"""
quarterly_growth = annual_growth_rate / 4
quarterly_attrition = annual_attrition_rate / 4
rows = []
headcount = current_headcount
for q in range(1, periods + 1):
expected_separations = round(headcount * quarterly_attrition)
growth_hires = round(headcount * quarterly_growth)
total_hires = expected_separations + growth_hires
end_headcount = headcount + total_hires - expected_separations
rows.append({
"quarter": f"Q{q}",
"start_headcount": headcount,
"projected_separations": expected_separations,
"growth_hires_needed": growth_hires,
"total_hires_needed": total_hires,
"end_headcount": end_headcount,
})
headcount = end_headcount
return pd.DataFrame(rows)
projection = headcount_projection(
current_headcount=250,
annual_growth_rate=0.05,
annual_attrition_rate=0.18,
periods=4,
)
print(projection.to_string(index=False))
The projection shows that even with no growth, Acme needs to hire roughly 45 people per year just to maintain headcount at 18% attrition. Adding 5% growth means approximately 58 hires in the next twelve months — about 5 per month. That is what the recruiting team needs to plan for.
30.8 Visualizing HR Data
The same principle applies in HR visualization as in sales visualization: the goal is clarity, not complexity. HR charts are often reviewed by non-technical executives and board members, so they must be instantly interpretable.
Tenure Histogram
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(
employee_df["tenure_years"],
bins=20,
color="#4472C4",
alpha=0.85,
edgecolor="white",
)
ax.axvline(
employee_df["tenure_years"].median(),
color="#FF6B35",
linewidth=2.5,
linestyle="--",
label=f"Median tenure: {employee_df['tenure_years'].median():.1f} years",
)
ax.set_title(
"Employee Tenure Distribution — Acme Corp 2023",
fontweight="bold",
)
ax.set_xlabel("Tenure (Years)")
ax.set_ylabel("Number of Employees")
ax.legend()
plt.tight_layout()
plt.savefig("acme_tenure_histogram.png", dpi=150)
plt.show()
A right-skewed distribution (most employees clustered at the low end) suggests a relatively young workforce or a recent hiring surge. A bimodal distribution (two peaks) often indicates a distinct cohort of long-tenured employees alongside a newer cohort — which could signal risk if the senior cohort retires simultaneously.
Turnover Rate by Department
fig, ax = plt.subplots(figsize=(10, 6))
dept_turnover_sorted = dept_turnover.sort_values("turnover_rate")
colors = [
"#E74C3C" if rate > 25 else "#F39C12" if rate > 15 else "#27AE60"
for rate in dept_turnover_sorted["turnover_rate"]
]
bars = ax.barh(
dept_turnover_sorted["department"],
dept_turnover_sorted["turnover_rate"],
color=colors,
)
ax.axvline(
dept_turnover["turnover_rate"].mean(),
color="#2C3E50",
linewidth=2,
linestyle="--",
label=f"Company avg: {dept_turnover['turnover_rate'].mean():.1f}%",
)
for bar, val in zip(bars, dept_turnover_sorted["turnover_rate"]):
ax.text(val + 0.3, bar.get_y() + bar.get_height() / 2,
f"{val:.1f}%", va="center", fontsize=9)
ax.set_title("2023 Turnover Rate by Department", fontweight="bold")
ax.set_xlabel("Annual Turnover Rate (%)")
ax.legend()
plt.tight_layout()
plt.savefig("acme_turnover_by_dept.png", dpi=150)
plt.show()
Color-coding by severity (green/yellow/red) makes the chart immediately actionable. Executives can see which departments need attention without reading any numbers.
Salary Distribution by Department
Displaying salary data visually requires care. Use group-level statistics (median, IQR) and avoid chart types that could reveal individual salaries.
# Use a horizontal bar chart with error bars showing IQR — not box plots
# (box plots can reveal individual outlier values)
salary_stats = (
employee_df.groupby("department")["annual_salary"]
.agg(
median="median",
p25=lambda x: x.quantile(0.25),
p75=lambda x: x.quantile(0.75),
)
.reset_index()
.sort_values("median")
)
fig, ax = plt.subplots(figsize=(10, 6))
y_pos = range(len(salary_stats))
ax.barh(
y_pos,
salary_stats["median"],
xerr=[
salary_stats["median"] - salary_stats["p25"],
salary_stats["p75"] - salary_stats["median"],
],
color="#70AD47",
alpha=0.85,
capsize=5,
error_kw={"elinewidth": 1.5, "ecolor": "#333"},
)
ax.set_yticks(list(y_pos))
ax.set_yticklabels(salary_stats["department"])
ax.set_title(
"Median Salary by Department\nError bars show 25th–75th percentile range",
fontweight="bold",
)
ax.set_xlabel("Annual Salary ($)")
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"${x:,.0f}"))
plt.tight_layout()
plt.savefig("acme_salary_by_dept.png", dpi=150)
plt.show()
30.9 Pandas Pivot Tables for HR Summaries
Pivot tables are particularly useful in HR reporting because they create the kind of cross-tabulation summaries that HR leaders use in board reporting.
# Headcount by department and job level
headcount_pivot = employee_df.pivot_table(
index="department",
columns="job_level",
values="employee_id",
aggfunc="count",
fill_value=0,
)
headcount_pivot["Total"] = headcount_pivot.sum(axis=1)
headcount_pivot.loc["Total"] = headcount_pivot.sum()
print(headcount_pivot)
Output (simplified):
| Department | Ind. Contrib. | Senior IC | Lead | Manager | Total |
|---|---|---|---|---|---|
| Warehouse | 48 | 24 | 6 | 2 | 80 |
| Sales | 22 | 15 | 5 | 3 | 45 |
| ... | |||||
| Total | 125 | 76 | 30 | 19 | 250 |
You can also build salary pivot tables:
# Median salary by department and employment type
salary_pivot = employee_df.pivot_table(
index="department",
columns="employment_type",
values="annual_salary",
aggfunc="median",
fill_value=None,
).round(0)
print(salary_pivot)
A critical note on salary pivots: if any cell in the pivot contains fewer than five employees, the median in that cell could reveal an individual's salary range. Always check cell sizes before sharing any salary pivot table. The pivot_table function does not automatically suppress small cells — you must add that check yourself.
30.10 Building the HR Dashboard
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
def build_hr_dashboard(
employee_df,
separations_df=None,
title="HR Analytics Dashboard",
save_path=None,
):
"""Generate a 4-panel HR analytics dashboard.
Panels:
Top-left: Tenure distribution histogram
Top-right: Headcount by department (horizontal bar)
Bottom-left: Median salary by department (bar with IQR)
Bottom-right: Monthly separations trend (if data provided)
Args:
employee_df: Current employee DataFrame.
separations_df: Optional separations history.
title: Dashboard title.
save_path: Optional PNG save path.
Returns:
matplotlib Figure object.
"""
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle(title, fontsize=16, fontweight="bold")
plt.subplots_adjust(hspace=0.45, wspace=0.35)
# Panel 1: Tenure histogram
axes[0, 0].hist(employee_df["tenure_years"], bins=20,
color="#4472C4", alpha=0.85, edgecolor="white")
axes[0, 0].axvline(employee_df["tenure_years"].median(),
color="#FF6B35", linewidth=2, linestyle="--",
label=f"Median: {employee_df['tenure_years'].median():.1f} yrs")
axes[0, 0].set_title("Tenure Distribution", fontweight="bold")
axes[0, 0].set_xlabel("Tenure (Years)")
axes[0, 0].set_ylabel("Employees")
axes[0, 0].legend(fontsize=9)
# Panel 2: Headcount by department
dept_counts = (
employee_df.groupby("department").size().sort_values(ascending=True)
)
axes[0, 1].barh(dept_counts.index, dept_counts.values, color="#5B9BD5")
axes[0, 1].set_title("Headcount by Department", fontweight="bold")
axes[0, 1].set_xlabel("Employees")
# Panel 3: Median salary by department
salary_stats = (
employee_df.groupby("department")["annual_salary"]
.agg(median="median", p25=lambda x: x.quantile(0.25),
p75=lambda x: x.quantile(0.75))
.reset_index()
.sort_values("median")
)
y_pos = range(len(salary_stats))
axes[1, 0].barh(
y_pos, salary_stats["median"],
xerr=[salary_stats["median"] - salary_stats["p25"],
salary_stats["p75"] - salary_stats["median"]],
color="#70AD47", alpha=0.85, capsize=4,
error_kw={"elinewidth": 1.5}
)
axes[1, 0].set_yticks(list(y_pos))
axes[1, 0].set_yticklabels(salary_stats["department"], fontsize=8)
axes[1, 0].set_title("Median Salary by Department\n(25th-75th pct range shown)",
fontweight="bold")
axes[1, 0].xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f"${x:,.0f}")
)
# Panel 4: Monthly separations (if data available)
if separations_df is not None:
monthly_seps = (
separations_df[separations_df["year"] == separations_df["year"].max()]
.groupby("month")
.size()
.reindex(range(1, 13), fill_value=0)
)
axes[1, 1].bar(range(1, 13), monthly_seps.values, color="#ED7D31", alpha=0.85)
axes[1, 1].set_title(
f"Monthly Separations {separations_df['year'].max()}", fontweight="bold"
)
axes[1, 1].set_xlabel("Month")
axes[1, 1].set_ylabel("Separations")
axes[1, 1].set_xticks(range(1, 13))
axes[1, 1].set_xticklabels(
["J","F","M","A","M","J","J","A","S","O","N","D"]
)
if save_path:
fig.savefig(save_path, dpi=150, bbox_inches="tight")
print(f"HR dashboard saved to: {save_path}")
return fig
fig = build_hr_dashboard(
employee_df,
separations_df,
title="Acme Corp — HR Analytics Dashboard 2023",
save_path="acme_hr_dashboard.png",
)
plt.show()
30.11 The South Region Warehouse Problem
Priya has been asked to investigate elevated turnover in Acme's South region warehouse. Here is how she approaches the analysis:
Marcus Webb sends Priya an email on a Tuesday morning: "South warehouse headcount is dropping. We budgeted for 22 people and currently have 16. I think turnover is the issue but I don't have hard numbers. Can you quantify this and figure out if there's a pattern?"
Priya runs a structured diagnosis:
# Step 1: Isolate South warehouse
south_warehouse_emp = employee_df[
(employee_df["location"] == "South") &
(employee_df["department"] == "Warehouse & Fulfillment")
].copy()
south_warehouse_sep = separations_df[
(separations_df["location"] == "South") &
(separations_df["department"] == "Warehouse & Fulfillment") &
(separations_df["year"] == 2023)
].copy()
# Step 2: Calculate and compare turnover rates
south_turnover = calculate_turnover_rate(
south_warehouse_sep, len(south_warehouse_emp), "South Warehouse 2023"
)
all_warehouse_sep = separations_df[
(separations_df["department"] == "Warehouse & Fulfillment") &
(separations_df["year"] == 2023)
]
all_warehouse_emp = employee_df[
employee_df["department"] == "Warehouse & Fulfillment"
]
company_warehouse_turnover = calculate_turnover_rate(
all_warehouse_sep, len(all_warehouse_emp), "All Warehouse 2023"
)
print(f"South Warehouse 2023 turnover: {south_turnover['total_turnover_rate']}%")
print(f"All Warehouse 2023 turnover: {company_warehouse_turnover['total_turnover_rate']}%")
# Step 3: When are people leaving?
seasonal = seasonal_turnover(south_warehouse_sep)
print("\nSouth Warehouse seasonal pattern:")
print(seasonal[["month_name", "separations", "seasonal_index"]].to_string(index=False))
# Step 4: How soon after hiring?
tenure_at_sep = turnover_by_tenure_bucket(south_warehouse_sep)
print("\nTenure at separation:")
print(tenure_at_sep.to_string(index=False))
# Step 5: Is this all voluntary?
vol_pct = (
(south_warehouse_sep["separation_reason"] == "Voluntary Resignation").sum()
/ len(south_warehouse_sep) * 100
)
print(f"\nVoluntary resignation rate: {vol_pct:.1f}% of South Warehouse separations")
What Priya finds: - South warehouse turnover is 34% vs. 26% company-wide warehouse - 48% of separations happen in the 0–6 month tenure bucket (vs. 31% company average) - July–September (Q3) shows a seasonal index of 140+ - 71% of separations are voluntary resignations
The Q3 spike and the heavy early-tenure concentration point toward a specific pattern: summer hires who take the job as a bridge and leave when better opportunities arise, or who find the working conditions challenging and leave before completing their first year. This is a different problem than a poorly managed team, and it calls for different solutions.
Priya's recommendations to Marcus: 1. Implement a structured 30-60-90 day onboarding program for all warehouse hires 2. Introduce a summer retention bonus (paid at 90 days) for new hires 3. Review Q3 hiring timing — bringing new hires on in May gives them 2+ months before the peak summer stress period
See Case Study 30-1 for the full version of this analysis.
30.12 Summary
HR analytics is not about replacing human judgment in people decisions. It is about giving that judgment better information. The metrics in this chapter — turnover rates, tenure distributions, compa-ratios, absence rates — do not tell you why people leave or whether your pay is fair. They tell you where to look and what questions to ask.
| Function | Question It Answers |
|---|---|
calculate_turnover_rate() |
How many people are leaving, and is it voluntary? |
turnover_by_department() |
Which departments are struggling with retention? |
turnover_by_tenure_bucket() |
Are people leaving early, or after long tenure? |
seasonal_turnover() |
Are there predictable months for higher attrition? |
tenure_distribution() |
Do we have a seasoned workforce or mostly new hires? |
compensation_analysis() |
How does pay compare across groups? |
compa_ratio() |
Are employees paid relative to their pay band? |
absence_rate() |
How much scheduled time is being lost to absence? |
headcount_projection() |
How many people do we need to hire next quarter? |
fte_calculation() |
What is our workforce in comparable FTE terms? |
hr_pivot_summary() |
What does headcount look like across dimensions? |
build_hr_dashboard() |
How do I show all of this visually at once? |
The ethical principles that opened this chapter are not optional — they are the foundation on which all of this work rests. When you handle HR data with care: aggregating before reporting, suppressing small groups, never exposing individual salaries — you earn the trust that allows HR analytics to be done at all. That trust is hard to build and easy to destroy.
Chapter Review Questions
- Explain why
df["salary"].mean()is often a misleading metric for compensation analysis, and what to use instead. - What is the difference between voluntary and involuntary turnover? Why does the distinction matter for the organizational response?
- A company's top-line turnover rate is 16%, which is below industry benchmarks. Is this necessarily a sign of good retention? What additional breakdowns would you want to see?
- What is a compa-ratio? A role with a median compa-ratio of 0.78 — what does that suggest, and what action might follow?
- Describe the minimum group size principle for HR reporting. What problem does it solve, and what should you do with groups that fall below the threshold?
- In Priya's South warehouse analysis, the turnover was 34% vs. 26% company-wide. What three additional data points made the diagnosis actionable?
- What does an FTE calculation tell you that a headcount count alone does not?
- Why should headcount projections always be labeled as projections with explicit assumptions rather than forecasts?
Key Terms
Turnover rate: The percentage of employees who leave during a period, calculated as separations divided by average headcount.
Voluntary turnover: Employees who choose to resign. A leading indicator of culture, compensation, or management issues.
Involuntary turnover: Employees separated by the company (terminations, layoffs, end of contract). Reflects performance management and restructuring decisions.
Compa-ratio: An employee's (or group's) salary expressed as a fraction of the pay band midpoint. A ratio below 0.85 suggests underpayment; above 1.15 suggests above-band pay.
FTE (Full-Time Equivalent): A standardized measure of workforce size that converts part-time employees to a comparable full-time unit.
Tenure distribution: The statistical distribution of employee lengths of service. Used to assess institutional knowledge risk and onboarding effectiveness.
Absence rate: The percentage of scheduled working time lost to unplanned absence. Calculated as total absence days divided by total possible working days.
Pay band: A defined salary range for a job level, typically with a minimum, midpoint, and maximum. Used to ensure consistent and equitable compensation decisions.
Seasonal turnover index: A ratio comparing separations in a given month to the average month. An index above 100 indicates above-average separations; below 100 indicates below-average.
Minimum group size: The smallest group for which HR statistics are published in reports. Groups below this size (commonly 5) are suppressed or combined to prevent identification of individuals.