Case Study 1: Are Test Scores Really Normally Distributed? Jordan Investigates

Contributors to Introduction to Data Science

Case Study 1: Are Test Scores Really Normally Distributed? Jordan Investigates

Tier 3 — Illustrative/Composite Example: This case study uses a fictional university ("Lakewood State University") and simulated data to explore the question of whether exam scores follow a normal distribution. The statistical patterns described — grade inflation, bimodal distributions, ceiling effects — are well-documented phenomena in educational research. All specific figures, course names, and character details are invented for pedagogical purposes.

The Setting

Jordan has been carrying a nagging question since Chapter 19: are the grading patterns at their university fair? They've been thinking about it informally — comparing notes with friends, noticing that some professors seem to give mostly As while others rarely give anything above a B+. But now Jordan has the tools to investigate properly.

Their university, Lakewood State, publishes anonymized grade distributions by course and section. Jordan downloaded the data back in Chapter 12 and has been cleaning and exploring it throughout the course. Now, armed with distribution theory from Chapter 21, they want to test a common assumption:

Are exam scores normally distributed?

This isn't just an academic exercise. If scores are normally distributed, then z-scores and percentiles are meaningful — a z-score of 1.5 really does mean "better than 93% of students." If scores are NOT normally distributed, those percentile interpretations could be wrong, and any grading policies based on "the curve" could be unfair.

The Investigation

Step 1: What Should Normal Test Scores Look Like?

Jordan starts by establishing what normal test scores WOULD look like, so they have a baseline for comparison.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# What perfectly normal test scores would look like
np.random.seed(42)
ideal_normal = np.random.normal(75, 12, 500)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Histogram with normal overlay
axes[0].hist(ideal_normal, bins=30, density=True, color='steelblue',
             edgecolor='white', alpha=0.8, label='Simulated scores')
x = np.linspace(30, 120, 100)
axes[0].plot(x, stats.norm.pdf(x, 75, 12), 'r-', linewidth=2,
             label='Normal(75, 12)')
axes[0].set_title('Perfectly Normal Test Scores\n(Simulated)')
axes[0].set_xlabel('Score')
axes[0].legend()

# Q-Q plot
stats.probplot(ideal_normal, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot\n(Points should follow the line)')

plt.tight_layout()
plt.savefig('ideal_normal_scores.png', dpi=150, bbox_inches='tight')
plt.show()

Good — Jordan can see what "normal" looks like. The histogram is bell-shaped, and the Q-Q plot shows points right on the diagonal line. This is the benchmark.

Step 2: Analyzing Real Course Data

Jordan pulls data from several courses and examines each one.

# Simulated grade distributions for different courses
# Based on patterns commonly observed in educational research
np.random.seed(21)

courses = {}

# Course 1: Intro Chemistry — classic bell curve (well-designed exam)
courses['Intro Chemistry'] = {
    'scores': np.clip(np.random.normal(72, 14, 180), 0, 100),
    'expected': 'Should be roughly normal — large class, well-calibrated exam'
}

# Course 2: Advanced Writing — left-skewed (most students do well)
courses['Advanced Writing'] = {
    'scores': np.clip(100 - np.random.exponential(8, 45), 0, 100),
    'expected': 'Likely left-skewed — most students write competently'
}

# Course 3: Organic Chemistry — bimodal (some get it, some don\'t)
scores_pass = np.random.normal(78, 8, 60)
scores_fail = np.random.normal(45, 10, 40)
courses['Organic Chemistry'] = {
    'scores': np.clip(np.concatenate([scores_pass, scores_fail]), 0, 100),
    'expected': 'May be bimodal — threshold concept divides students'
}

# Course 4: Multiple Choice Intro — right-skewed (some just guess)
courses['Intro Psychology'] = {
    'scores': np.clip(np.random.normal(68, 18, 250), 0, 100),
    'expected': 'Large class — CLT suggests means would be normal'
}

# Analyze each course
fig, axes = plt.subplots(4, 3, figsize=(16, 16))

for idx, (course_name, course_data) in enumerate(courses.items()):
    scores = course_data['scores']

    # Histogram
    ax_hist = axes[idx, 0]
    ax_hist.hist(scores, bins=20, density=True, color='steelblue',
                 edgecolor='white', alpha=0.8)
    x = np.linspace(scores.min()-5, scores.max()+5, 100)
    ax_hist.plot(x, stats.norm.pdf(x, scores.mean(), scores.std()),
                 'r-', linewidth=2)
    ax_hist.set_title(f'{course_name}\nn={len(scores)}, mean={scores.mean():.1f}')
    ax_hist.set_xlabel('Score')

    # Q-Q plot
    ax_qq = axes[idx, 1]
    stats.probplot(scores, dist="norm", plot=ax_qq)
    ax_qq.set_title('Q-Q Plot')

    # Summary statistics
    ax_text = axes[idx, 2]
    ax_text.axis('off')
    skew = stats.skew(scores)
    kurt = stats.kurtosis(scores)
    _, p_shapiro = stats.shapiro(scores[:50] if len(scores) > 50 else scores)

    summary = (
        f"Mean:     {scores.mean():.1f}\n"
        f"Median:   {np.median(scores):.1f}\n"
        f"Std Dev:  {scores.std():.1f}\n"
        f"Skewness: {skew:.2f}\n"
        f"Kurtosis: {kurt:.2f}\n"
        f"Shapiro p: {p_shapiro:.4f}\n"
        f"Normal?   {'Yes' if p_shapiro > 0.05 else 'No'}\n\n"
        f"Expected:\n{course_data['expected']}"
    )
    ax_text.text(0.1, 0.9, summary, transform=ax_text.transAxes,
                 verticalalignment='top', fontsize=9, fontfamily='monospace',
                 bbox=dict(boxstyle='round', facecolor='lightyellow'))

plt.suptitle("Jordan's Grade Distribution Analysis: Four Courses", fontsize=14)
plt.tight_layout()
plt.savefig('course_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

Step 3: What Jordan Finds

Jordan documents their findings for each course:

Intro Chemistry (n=180): This one is closest to normal. The histogram is roughly bell-shaped, the Q-Q plot points follow the line, and the Shapiro-Wilk test doesn't reject normality. With 180 students and a well-calibrated exam, the normal assumption is reasonable.

Advanced Writing (n=45): Left-skewed. Most students score in the 85-100 range, with a tail extending downward. The Q-Q plot curves away from the line at the left end. This makes sense — in a writing course, most students who are still enrolled by the advanced level can write competently. The exam has a ceiling effect.

Organic Chemistry (n=100): Bimodal! There are two clear clusters — one around 78 and another around 45. The Q-Q plot shows a distinctive S-shape. Jordan hypothesizes that organic chemistry has a threshold concept (understanding reaction mechanisms) that divides students into "got it" and "didn't get it" groups. The mean of 65 describes neither group.

Intro Psychology (n=250): Roughly normal but with a slight left skew (the exam is a bit easy, bunching students near the top). The Q-Q plot is mostly linear but curves at the upper end. With 250 students, even small deviations from normality are detectable by the Shapiro-Wilk test.

Step 4: What About "Grading on a Curve"?

Jordan now asks a pointed question: some professors "grade on a curve," assigning letter grades based on the assumption that scores are normally distributed. How fair is this?

# Simulating "curve grading" on bimodal data
scores = courses['Organic Chemistry']['scores']

# Traditional curve: assign grades based on z-scores
mean, std = scores.mean(), scores.std()
z_scores = (scores - mean) / std

def assign_curved_grade(z):
    if z >= 1.5:   return 'A'
    elif z >= 0.5:  return 'B'
    elif z >= -0.5: return 'C'
    elif z >= -1.5: return 'D'
    else:           return 'F'

curved_grades = [assign_curved_grade(z) for z in z_scores]

# Count grades
from collections import Counter
grade_counts = Counter(curved_grades)
print("Curved grades for Organic Chemistry:")
for grade in ['A', 'B', 'C', 'D', 'F']:
    count = grade_counts.get(grade, 0)
    print(f"  {grade}: {count} students ({count/len(scores)*100:.1f}%)")

# Visualize the problem
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Show the bimodal distribution with curve boundaries
axes[0].hist(scores, bins=25, color='steelblue', edgecolor='white', alpha=0.8)
boundaries = [mean + z * std for z in [-1.5, -0.5, 0.5, 1.5]]
colors_bg = ['#ff6b6b', '#ffd93d', '#95e1d3', '#a8d8ea', '#6c5ce7']
labels_bg = ['F', 'D', 'C', 'B', 'A']
for i, (lo, hi) in enumerate(zip([-np.inf] + boundaries, boundaries + [np.inf])):
    axes[0].axvspan(max(lo, 0), min(hi, 100), alpha=0.1, color=colors_bg[i])

axes[0].set_title('Curve Grading on Bimodal Data\n(Grade boundaries shown)')
axes[0].set_xlabel('Score')

# Show what fair grading might look like
# Based on mastery: > 70 = pass with differentiation
fixed_grades = []
for s in scores:
    if s >= 90:   fixed_grades.append('A')
    elif s >= 80: fixed_grades.append('B')
    elif s >= 70: fixed_grades.append('C')
    elif s >= 60: fixed_grades.append('D')
    else:         fixed_grades.append('F')

fixed_counts = Counter(fixed_grades)
x = range(5)
axes[1].bar([i-0.2 for i in x],
            [grade_counts.get(g, 0) for g in ['A', 'B', 'C', 'D', 'F']],
            width=0.35, color='steelblue', label='Curved', alpha=0.8)
axes[1].bar([i+0.2 for i in x],
            [fixed_counts.get(g, 0) for g in ['A', 'B', 'C', 'D', 'F']],
            width=0.35, color='coral', label='Fixed scale', alpha=0.8)
axes[1].set_xticks(x)
axes[1].set_xticklabels(['A', 'B', 'C', 'D', 'F'])
axes[1].set_ylabel('Number of Students')
axes[1].set_title('Curved vs. Fixed-Scale Grading')
axes[1].legend()

plt.tight_layout()
plt.savefig('curve_grading_problem.png', dpi=150, bbox_inches='tight')
plt.show()

Jordan's key finding: when scores are bimodal, curve grading forces students into grade categories that don't reflect their actual performance. Students who scored 75 (in the "got it" group) and students who scored 55 (in the "didn't get it" group) might both get Cs, because the curve treats them as being equidistant from the mean. The curve assumes normality, and when that assumption is violated, the grades are distorted.

Step 5: The CLT Angle

But wait — even though individual exam scores aren't always normal, what about averages? Jordan's university reports average GPAs by department. Are those approximately normal?

# Simulate department-level GPA averages
np.random.seed(42)
n_departments = 50
dept_sizes = np.random.randint(20, 200, n_departments)

# Each department's average GPA is the mean of individual GPAs
# Individual GPAs are NOT normal (bounded 0-4, often left-skewed)
dept_means = []
for size in dept_sizes:
    # Simulate individual GPAs (left-skewed, bounded)
    individual_gpas = np.clip(4 - np.random.exponential(0.5, size), 0, 4)
    dept_means.append(np.mean(individual_gpas))

dept_means = np.array(dept_means)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].hist(dept_means, bins=15, density=True, color='steelblue',
             edgecolor='white', alpha=0.8)
x = np.linspace(dept_means.min(), dept_means.max(), 100)
axes[0].plot(x, stats.norm.pdf(x, dept_means.mean(), dept_means.std()),
             'r-', linewidth=2, label='Normal fit')
axes[0].set_title(f'Department Average GPAs\n(n={n_departments} departments)')
axes[0].set_xlabel('Average GPA')
axes[0].legend()

stats.probplot(dept_means, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot of Department Averages')

plt.tight_layout()
plt.savefig('dept_gpa_clt.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Individual GPAs are left-skewed (not normal)")
print(f"But department averages are approximately normal — CLT in action!")
_, p_val = stats.shapiro(dept_means)
print(f"Shapiro-Wilk test for department averages: p = {p_val:.4f}")

The Central Limit Theorem rescues Jordan's analysis at the department level. Even though individual GPAs are not normal (they're bounded between 0 and 4 and tend to be left-skewed), the averages of departments are approximately normal. This means that z-score comparisons between departments are valid, even if individual student grade comparisons using z-scores would be questionable.

Jordan's Conclusions

Jordan writes up their findings:

Individual exam scores are often NOT normally distributed. Of the four courses examined, only one (Intro Chemistry) had a convincingly normal distribution. The others showed left skew (Advanced Writing), bimodality (Organic Chemistry), or slight ceiling effects (Intro Psychology).

"Grading on a curve" can be unfair when the normal assumption is violated. Bimodal distributions are particularly problematic — the curve forces students from distinct performance groups into the same grade categories.

Department-level averages ARE approximately normal thanks to the Central Limit Theorem. Cross-department comparisons using z-scores are valid at the aggregate level.

Recommendation: Professors should plot their score distributions before applying any curve. If the distribution is bimodal or strongly skewed, a fixed-scale grading system may be more appropriate. The Q-Q plot is a quick and effective diagnostic.

The Lessons

Lesson 1: Always Check Your Assumptions

The normal distribution is a powerful tool, but it's a tool with assumptions. If you apply normal-based methods to non-normal data, your conclusions can be wrong. Always plot and check before computing.

Lesson 2: The CLT Is Your Safety Net (for Means)

Even when individual data isn't normal, averages of groups tend to be normal. This is why many statistical methods work even when the data isn't perfectly bell-shaped — they're working with means, and the CLT ensures those means are approximately normal.

Lesson 3: Distribution Shape Tells a Story

Bimodal test scores aren't just a statistical curiosity — they reveal pedagogical information (a threshold concept that divides students). Left-skewed scores suggest an easy exam or a highly competent class. Right-skewed scores suggest a challenging exam. The shape carries meaning.

Discussion Questions

If you were a professor and saw bimodal exam scores, what would you conclude about the course design? What might you change?
Is it ever appropriate to "grade on a curve"? Under what conditions would curve grading be fair?
Jordan found that department-level averages are approximately normal. What is the minimum department size needed for this to be reliable? (Think about the CLT's sample size requirements.)
How might grade inflation affect the normality of grade distributions? If almost everyone gets an A or B, what shape would the distribution have?

Connection to the Chapter

This case study applies the normal distribution (Section 21.5), Q-Q plots (Section 21.8), the Shapiro-Wilk test (Section 21.8), and the Central Limit Theorem (Section 21.7). It demonstrates that "assuming normality" isn't always safe for raw data but is often rescued by the CLT when working with averages — a nuance that matters throughout the rest of the book.