Case Study 1: The Calibration Experiment

Case Study 1: The Calibration Experiment

What Happens When You Ask 60 Students to Predict Their Own Exam Scores

Professor Elena Vasquez teaches introductory statistics at a regional university. She's been teaching for eleven years. She knows every excuse, every study pattern, every type of student. But something has always bothered her: she can predict, with uncomfortable accuracy, which students will do well on each exam and which won't — and her predictions are based not on their intelligence but on how they study.

The students who do well tend to be quiet in class but active in office hours. They come with specific questions. They describe their confusion precisely. When she asks them "what have you tried?" they describe actual attempts — not "I reviewed my notes" but "I tried to work through this problem and got stuck at this specific step."

The students who do poorly tend to be confident in class. They nod along during lecture. They rarely come to office hours. When they do come, they ask vague questions: "Can you go over the hypothesis testing stuff?" When she asks "what specifically are you confused about?" they say, after a pause, "all of it, kind of."

She designed an experiment to examine this pattern rigorously.

The Experiment

In the third week of the semester, Professor Vasquez surveyed all 67 students in her two sections of introductory statistics. The survey was short:

On a scale of 0-100, what score do you predict you will receive on the upcoming Unit 2 exam?
How many hours have you spent studying for this exam?
What is your primary study method? (Choose all that apply: rereading notes, rereading textbook, flashcards, practice problems, self-quizzing without notes, study group, other)

She collected the surveys the day before the exam. She didn't share the results with students until after the exam was graded.

The Predictions vs. The Reality

The average predicted score: 79.4% The average actual score: 68.2% The average prediction error (predicted minus actual): +11.2 points

Only 19 of 67 students (28%) predicted within 5 points of their actual score. They were "well-calibrated" by her definition — close enough that their prediction represented genuinely useful information about their readiness.

The remaining 73% of students were overconfident by meaningful amounts. Fourteen students (21%) were overconfident by more than 20 points — they predicted 80+ and scored under 65.

Three students (4.5%) were significantly underconfident — they predicted lower than they scored by 10+ points. These students tended to be high-performers who were also highly self-critical.

What Were the Overconfident Students Doing?

When Professor Vasquez analyzed the study methods reported by the overconfident vs. well-calibrated students, the pattern was stark:

Students who predicted within 5 points of their actual score (well-calibrated): - 84% reported practice problems as a primary study method - 68% reported self-quizzing without notes - Average study hours: 7.2

Students who over-predicted by 20+ points (severely overconfident): - 7% reported practice problems as a primary study method (most of the time was passive review) - 14% reported self-quizzing without notes - 71% reported rereading notes as a primary study method - 64% reported rereading the textbook as a primary study method - Average study hours: 5.8

Two findings stand out:

First: The most overconfident students were spending their time on passive review — rereading and re-reading. This is precisely the method that creates the illusion of fluency without actual retrieval strength. They felt ready because the material felt familiar. But familiarity is not recall.

Second: They were not studying fewer hours — they were studying fewer hours than the well-calibrated students, but not dramatically fewer. The difference was not in quantity but in quality of study method.

The Individual Stories

Professor Vasquez looked more closely at a few specific students who represented the extremes.

The Student 34 Points Over

Marcus (not the Marcus from earlier chapters — a different student) predicted 91%. He scored 57%. He spent approximately 6 hours studying, primarily by rereading his notes "until I felt like I had it."

When Professor Vasquez met with him after the exam, she asked him to explain hypothesis testing out loud, without notes. He could define the terms — null hypothesis, alternative hypothesis, p-value — but could not demonstrate the procedure on a problem, could not explain what a p-value actually represents, and could not describe when to reject vs. fail to reject a null hypothesis.

He knew the vocabulary. He did not know the concept.

"I thought I understood it because I could recognize all the words when I read my notes," he said. "I thought the words and the understanding were the same thing."

They aren't.

The Student 3 Points Under

Yena predicted 71%. She scored 74%. She spent 9 hours studying, primarily by working through practice problems and using self-quizzing.

When Professor Vasquez asked Yena how she knew she was ready, Yena said: "I took a practice exam from the course website and scored 72%. Then I found the questions I got wrong, went back and figured out exactly what I'd done wrong, and practiced those types of problems until I was getting them right. I figured I was roughly ready but not perfectly ready."

She was right. Remarkably so.

"What made Yena different," Professor Vasquez notes, "was that her assessment of her own knowledge was based on evidence — her performance on practice problems — rather than feelings. She felt uncertain, but she had tested herself and knew what the test showed. That's calibration."

The Pattern Across Study Methods

Professor Vasquez has now run versions of this survey for four semesters. The pattern replicates consistently:

Students who use practice problems and self-quizzing as primary study methods are consistently better calibrated
Students who rely primarily on rereading are consistently overconfident
The overconfidence gap (predicted minus actual) is significantly larger for passive reviewers than for active retrievers

Her conclusion: the study method that produces the most accurate self-assessment is the same method that produces the most effective learning — retrieval practice. This isn't a coincidence. When you test yourself, you find out what you know. When you reread, you find out what feels familiar. Familiar and known are not the same thing.

What She Changed in Her Teaching

After the first calibration experiment, Professor Vasquez made one significant change to her course: she required all students to take a practice exam (from a bank of past exam questions she provided) under realistic conditions — timed, no notes — one week before each major exam. Students submitted a one-paragraph self-analysis: where did they do well? Where did they miss questions? What specifically did they plan to study in the final week?

Effect on calibration: After requiring the practice exam, the average prediction error dropped from +11.2 points to +6.8 points in the first semester. Students who completed the practice exam (tracked by submission) predicted their scores more accurately than students who didn't.

Effect on scores: Average exam scores increased by approximately 5-7 percentage points in the semester following the change. Professor Vasquez attributes this partly to the learning effect of the practice exam itself (retrieval practice is effective) and partly to the calibration effect (students who knew their gaps actually addressed them in the final week).

The Lesson

"The single most important thing I try to teach my students about learning," Professor Vasquez says, "is the difference between familiarity and knowledge. Familiarity is what you get from rereading. Knowledge is what you demonstrate when you can solve a problem without looking anything up. Feeling familiar with statistics is not the same as being able to do statistics.

If I could make every student do one thing differently, it would be this: do a practice exam, under real conditions, at least one week before the actual exam. Not to get a grade. To find out what you don't know. The information you get from that practice exam is more valuable than any amount of rereading.

The students who know they're not ready and study their gaps perform dramatically better than the students who think they're ready and find out on exam day that they aren't."