Case Study 2: Predicting Student Success — When Models Help and When They Harm


Tier 1 — Verified Concepts: This case study explores well-documented themes in educational data mining and learning analytics. The tension between using predictive models to help students and the risks of algorithmic bias in education is widely discussed in the education technology and ethics literature. The specific data and scenarios are constructed for pedagogical purposes, but the patterns and ethical concerns they illustrate are grounded in published research and documented real-world cases.


Elena's New Assignment

Elena — the data analyst we've followed throughout this book — has been asked to do something that makes her both excited and uncomfortable. Her university's provost wants a model that predicts which first-year students are at risk of failing or dropping out. The idea is simple: identify struggling students early, then offer them tutoring, counseling, or other support before it's too late.

The data is available. The university has years of records: high school GPA, SAT scores, financial aid status, first-generation college student status, declared major, dormitory assignment, and — most importantly — whether each student graduated within six years. Elena has features. She has targets. She has a supervised learning problem.

But as she starts framing the problem, questions pile up that no algorithm can answer.

Framing the Problem

Elena begins with the basics:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Simulated student records (5 years of data)
np.random.seed(42)
n = 3000

students = pd.DataFrame({
    'hs_gpa': np.random.normal(3.2, 0.5, n).clip(1.5, 4.0),
    'sat_score': np.random.normal(1100, 150, n).clip(600, 1600),
    'financial_aid': np.random.binomial(1, 0.45, n),
    'first_gen': np.random.binomial(1, 0.30, n),
    'distance_from_home': np.random.exponential(200, n),
    'stem_major': np.random.binomial(1, 0.35, n),
})

# Generate graduation outcome (influenced by features)
logit = (
    0.8 * (students['hs_gpa'] - 3.0) +
    0.003 * (students['sat_score'] - 1100) +
    -0.3 * students['first_gen'] +
    -0.2 * students['financial_aid'] +
    np.random.normal(0, 0.8, n)
)
students['graduated'] = (logit > 0).astype(int)

print(f"Total students: {n}")
print(f"Graduation rate: {students['graduated'].mean():.1%}")
print(f"\nFeature summary:")
print(students.describe().round(2))

Target: Whether the student graduates (yes/no) — this is a classification problem.

Features: High school GPA, SAT score, financial aid status, first-generation status, distance from home, STEM major declaration.

Baseline: If 72% of students graduate, always predicting "will graduate" gives 72% accuracy. The model needs to beat that, specifically by identifying the 28% who won't.

The Baseline Problem

Elena's first realization is that the baseline is deceptively strong:

# Baseline: always predict "will graduate"
baseline_accuracy = students['graduated'].mean()
print(f"Baseline accuracy: {baseline_accuracy:.1%}")
print("(By always predicting 'will graduate')")

If 72% of students graduate, the "always predict graduation" baseline is already 72% accurate. But this baseline is useless for the provost's goal — it identifies zero at-risk students. Every student who drops out is missed.

This is Elena's first lesson about modeling: accuracy alone can be misleading when classes are imbalanced. The provost doesn't want a model that's accurate overall — they want a model that correctly identifies students who need help.

Building the First Model

Elena splits the data and fits a simple model (we'll learn the details of logistic regression in Chapter 27 — for now, focus on the process):

from sklearn.linear_model import LogisticRegression

X = students[['hs_gpa', 'sat_score', 'financial_aid',
              'first_gen', 'distance_from_home', 'stem_major']]
y = students['graduated']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"Training accuracy: {train_acc:.1%}")
print(f"Test accuracy: {test_acc:.1%}")
print(f"Baseline accuracy: {baseline_accuracy:.1%}")

The model achieves about 75% accuracy — only slightly better than the baseline. But the number that matters isn't overall accuracy. It's how well the model identifies the students who will struggle.

The First Success: Early Intervention Works

Elena presents her model to the provost. For students the model flags as "at risk," the university assigns a peer mentor and weekly check-ins. At the end of the first semester, the results are encouraging:

  • Students flagged by the model and given support: 15% higher retention rate than similar students in previous years
  • Students not flagged: no change in outcomes
  • Total cost of the intervention: about $500 per student per semester

The model isn't perfect, but it's helping. Some students who would have dropped out are staying because someone reached out. The provost is delighted.

The First Problem: Who Gets Left Out

But then Elena looks more carefully at the model's errors. She examines which students the model misses — students who drop out but weren't flagged as at-risk.

# Analyze model errors by subgroup
predictions = model.predict(X_test)

test_data = X_test.copy()
test_data['actual'] = y_test.values
test_data['predicted'] = predictions
test_data['correct'] = (predictions == y_test.values)

# False negatives: students who dropped out
# but the model said they'd graduate
dropped = test_data[test_data['actual'] == 0]
missed = dropped[dropped['predicted'] == 1]  # false negatives

print(f"Students who dropped out: {len(dropped)}")
print(f"Model missed (false negatives): {len(missed)}")
print(f"Miss rate: {len(missed)/len(dropped):.1%}")

The model misses a significant number of students who drop out. And when Elena breaks down the misses by group, a pattern emerges:

# Who does the model miss most often?
for group, label in [(0, 'Not first-gen'), (1, 'First-gen')]:
    subset = dropped[dropped['first_gen'] == group]
    missed_subset = subset[subset['predicted'] == 1]
    if len(subset) > 0:
        rate = len(missed_subset) / len(subset)
        print(f"{label}: {rate:.1%} miss rate "
              f"({len(missed_subset)}/{len(subset)})")

The model is more likely to miss at-risk first-generation students. Why? Because the features that predict dropout for first-generation students might be different from the features that predict dropout overall. A first-generation student with a decent GPA and SAT score might still struggle with the cultural transition to college — a factor that isn't in the model. The model, trained primarily on patterns from the majority group, doesn't capture these nuances.

The Second Problem: Self-Fulfilling Prophecies

There's a deeper concern. When you label a student as "at risk," what happens psychologically?

Research in education has documented the Pygmalion effect (also called the Rosenthal effect): when teachers expect students to succeed, those students tend to perform better. The reverse is also true — when students are labeled as likely to fail, they sometimes internalize that expectation.

Elena worries: if a student finds out they've been flagged as "at risk" by an algorithm, could that label itself contribute to failure? Could the model's prediction become a self-fulfilling prophecy?

This is not a theoretical concern. It's a feedback loop:

Model predicts "at risk"
    → Student is labeled as "at risk"
    → Student (or their advisor) internalizes lower expectations
    → Student's motivation or treatment changes
    → Student actually performs worse
    → Model's prediction appears correct
    → Model is reinforced

The model might achieve high accuracy not because it identified true risk factors, but because its predictions caused the outcomes it predicted.

The Third Problem: What the Model Actually Learned

Elena examines the model's coefficients to understand what it learned:

# What did the model learn?
coefficients = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_[0]
}).sort_values('coefficient')

print("Model coefficients (what it learned):")
print(coefficients.to_string(index=False))

The model learned that first-generation students and students on financial aid are more likely to drop out. These are real patterns in the data. But are they useful patterns?

Consider: being a first-generation college student doesn't cause dropout. It's a proxy for a bundle of socioeconomic factors — less familiarity with college norms, possibly less family support for college, sometimes financial stress. The model can't distinguish between these underlying causes. It just sees the correlation and uses it.

If the university uses this model to deny admission to first-generation students (rather than offering them support), the model would become a tool for reinforcing existing inequality. The same model can be used to help or to harm — the model itself doesn't know the difference.

Elena's Framework: Questions Before Algorithms

Based on her experience, Elena develops a framework that she now applies before building any model that affects people:

1. Who benefits and who bears the risk?

In this case, correctly identified at-risk students benefit (they get support). But falsely labeled students might be stigmatized, and missed students lose out on help. The benefits and risks are not distributed equally across groups.

2. What would happen without the model?

Without the model, the university has no systematic early-warning system. Some advisors would notice struggling students; many wouldn't. The model, even imperfect, provides something that wasn't there before.

3. Is the model replacing human judgment or augmenting it?

Elena insists that the model's predictions go to academic advisors as one input among many, not as automated decisions. An advisor who knows a student personally might override the model — and should be able to.

4. Can we audit the model for bias?

Elena builds in regular audits: checking the model's accuracy across demographic groups, monitoring for differential miss rates, and tracking whether flagged students' outcomes improve.

5. Is the goal prediction or intervention?

The goal isn't to predict dropout — it's to prevent it. Prediction is a means to an end. If the university focused on prediction accuracy alone, it might optimize for a different model than if it focused on intervention effectiveness.

The Right Level of Complexity

Elena's experience also illustrates the bias-variance tradeoff in a human context:

A very simple model (just use high school GPA) has high bias — it misses many at-risk students whose GPAs are fine but who face other challenges. But it's easy to explain and audit.

A very complex model (use 50 features including social media activity, dining hall visits, and library card swipes) might have lower bias but raises serious privacy concerns and is hard to explain or audit. It might also be more susceptible to overfitting — finding patterns in the training data that are specific to past cohorts and don't generalize.

The model Elena chose (6 features, logistic regression) balances predictive power, interpretability, and ethical tractability. It's not the most accurate model possible, but it's one the university can understand, explain to students, audit for bias, and improve over time.

What the University Decided

After Elena's analysis, the university adopted the following policy:

  1. The model flags students for additional support — never for reduced opportunity
  2. Flagged students are offered resources, not labeled as failures
  3. The model is audited each semester for differential performance across groups
  4. Advisors can override the model based on personal knowledge
  5. Students are never told their "risk score" — the intervention appears as standard support
  6. The model is rebuilt annually with new data

This is not a perfect solution. But it illustrates a mature approach to predictive modeling — one that recognizes the model as a tool that must be wielded carefully, not a truth machine that makes decisions automatically.

Connecting to the Chapter

This case study illustrates every major concept from Chapter 25:

Chapter Concept Case Study Example
Model as simplification The model reduces a student's complex life to 6 numbers
Prediction vs. explanation The goal is intervention (explanation matters), not just prediction
Features and targets 6 features predicting graduation (yes/no)
Train-test split 80/20 split to evaluate generalization
Baseline model "Always predict graduation" gives 72% accuracy
Overfitting awareness A 50-feature model might overfit to past cohorts
Bias-variance tradeoff Simple model is biased but stable; complex model is flexible but risky
Ethical considerations Self-fulfilling prophecies, differential accuracy across groups

Discussion Questions

  1. Is it ethical to use demographic features (first-generation status, financial aid) in a model designed to help students? What if removing these features makes the model less accurate at identifying who needs help?

  2. How would you design the intervention so that being flagged by the model helps students without stigmatizing them?

  3. The model is more accurate for some demographic groups than others. Is it better to use an imperfect model that helps some students, or to use no model at all? How would you make that decision?

  4. Elena chose logistic regression over a more complex model partly because it's interpretable and auditable. Is there a level of accuracy improvement that would justify switching to a less interpretable model? How would you decide?

  5. How is this case study an example of the prediction vs. explanation distinction? What questions can the model answer and what questions can it not?


Key Takeaways from This Case Study

  • Models that affect people carry ethical obligations that go beyond accuracy
  • Baseline accuracy can be deceptively high when classes are imbalanced
  • A model's errors are not distributed equally — check performance across subgroups
  • Prediction is a means to an end, not the end itself — the real goal is effective intervention
  • Simple, interpretable models have advantages beyond accuracy: they can be explained, audited, and improved
  • The same model can be used to help or to harm — the difference is in how it's deployed, not in the algorithm