Case Study 2: Querying a University Database — Jordan Discovers SQL
Tier 3 — Illustrative/Composite Example: This case study uses a fictional university database modeled on the kinds of academic information systems that universities maintain. All student names, course information, and numerical values are invented for pedagogical purposes. No specific institution is represented.
The Setting
Jordan — our university student from Chapter 1 — is working as a peer tutor in the campus tutoring center. Over the past semester, Jordan has noticed a pattern: students from certain introductory courses seem to struggle more in their follow-up classes. But is this a real pattern, or is Jordan just remembering the difficult cases?
Jordan's supervisor, Professor Okafor, has access to the university's academic database and agrees to create a SQLite extract for Jordan's analysis. "I can't give you the full database for privacy reasons," Professor Okafor explains, "but I can give you anonymized data for the Computer Science department."
The next day, Jordan finds cs_department.db in their shared drive. It's a SQLite database with three tables:
Table: students | Column | Type | Description | |--------|------|-------------| | student_id | INTEGER | Unique identifier | | major | TEXT | Declared major | | admission_year | INTEGER | Year admitted | | gpa | REAL | Cumulative GPA |
Table: courses | Column | Type | Description | |--------|------|-------------| | course_id | TEXT | Course code (e.g., "CS101") | | course_name | TEXT | Full course name | | credits | INTEGER | Credit hours | | level | TEXT | "intro", "intermediate", "advanced" |
Table: enrollments | Column | Type | Description | |--------|------|-------------| | student_id | INTEGER | References students table | | course_id | TEXT | References courses table | | semester | TEXT | e.g., "Fall 2023" | | grade | TEXT | Letter grade (A, A-, B+, etc.) |
Jordan has never written SQL before, but the Chapter 12 material on databases is fresh in their mind. Time to learn by doing.
Connecting and Exploring
Jordan opens a Jupyter notebook and starts with the basics:
import sqlite3
import pandas as pd
conn = sqlite3.connect('cs_department.db')
First question: what tables are actually in the database?
tables = pd.read_sql("""
SELECT name FROM sqlite_master
WHERE type='table'
""", conn)
print(tables)
name
0 students
1 courses
2 enrollments
Good — three tables, as expected. Jordan checks the size of each:
for table in ['students', 'courses', 'enrollments']:
count = pd.read_sql(
f"SELECT COUNT(*) as n FROM {table}", conn
)
print(f"{table}: {count['n'][0]} rows")
students: 1247 rows
courses: 42 rows
enrollments: 8934 rows
1,247 students, 42 courses, and 8,934 enrollment records. Let's peek at each table:
students = pd.read_sql(
"SELECT * FROM students LIMIT 5", conn
)
print(students)
student_id major admission_year gpa
0 10001 Computer Science 2021 3.42
1 10002 Computer Science 2022 2.87
2 10003 Data Science 2023 3.65
3 10004 Computer Science 2020 3.11
4 10005 Information Systems 2021 2.95
Jordan is already thinking in questions, just like Professor Okafor's research methods class taught. They write three questions in a Markdown cell:
Research Questions: 1. What is the grade distribution in introductory vs. intermediate CS courses? 2. Do students who get lower grades in CS101 tend to get lower grades in CS201? 3. Is there a difference in course performance between CS majors and non-CS majors?
Question 1: Grade Distribution by Course Level
Jordan's first real SQL query. They want to count how many of each grade appears in intro courses vs. intermediate courses:
grade_dist = pd.read_sql("""
SELECT c.level, e.grade, COUNT(*) as count
FROM enrollments e
JOIN courses c ON e.course_id = c.course_id
GROUP BY c.level, e.grade
ORDER BY c.level, e.grade
""", conn)
print(grade_dist.head(10))
level grade count
0 advanced A 52
1 advanced A- 67
2 advanced B+ 78
3 advanced B 45
4 advanced B- 38
5 intro A 312
6 intro A- 287
7 intro B+ 256
8 intro B 398
9 intro B- 342
This works, but it's hard to compare levels when the data is in long format. Jordan remembers Chapter 9 — pivot tables:
grade_pivot = grade_dist.pivot_table(
index='grade', columns='level',
values='count', fill_value=0
)
print(grade_pivot)
Now the grade distributions sit side by side. Jordan notices that intro courses have a higher proportion of C and D grades compared to advanced courses. "That makes sense," Jordan thinks. "Students who struggle in intro classes might not make it to advanced ones." This is an example of survivorship bias — a concept they'll learn formally later, but one they can already recognize.
Question 2: Do CS101 Grades Predict CS201 Grades?
This is the question that motivated the whole analysis. Jordan needs to join the enrollments table with itself — finding students who took both CS101 and CS201:
progression = pd.read_sql("""
SELECT
e1.student_id,
e1.grade as cs101_grade,
e2.grade as cs201_grade
FROM enrollments e1
JOIN enrollments e2
ON e1.student_id = e2.student_id
WHERE e1.course_id = 'CS101'
AND e2.course_id = 'CS201'
""", conn)
print(f"Students who took both: {len(progression)}")
print(progression.head())
Students who took both: 423
student_id cs101_grade cs201_grade
0 10001 B B+
1 10004 B- C+
2 10007 A A-
3 10012 C+ C
4 10015 A A
Jordan stares at this result. This is a self-join — the enrollments table is joined with itself, using different aliases (e1 and e2). The WHERE clause filters so that e1 only has CS101 records and e2 only has CS201 records, while the ON clause ensures they're the same student.
To see the pattern more clearly, Jordan converts letter grades to a numeric GPA scale and computes averages:
grade_to_points = {
'A': 4.0, 'A-': 3.7, 'B+': 3.3, 'B': 3.0,
'B-': 2.7, 'C+': 2.3, 'C': 2.0, 'C-': 1.7,
'D+': 1.3, 'D': 1.0, 'F': 0.0
}
progression['cs101_pts'] = (
progression['cs101_grade'].map(grade_to_points)
)
progression['cs201_pts'] = (
progression['cs201_grade'].map(grade_to_points)
)
# Average CS201 grade by CS101 grade
summary = progression.groupby('cs101_grade').agg(
avg_cs201=('cs201_pts', 'mean'),
count=('student_id', 'count')
).round(2)
print(summary)
avg_cs201 count
cs101_grade
A 3.52 87
A- 3.28 72
B+ 3.01 68
B 2.74 85
B- 2.45 52
C+ 2.12 34
C 1.78 18
C- 1.45 5
D+ 1.15 2
The pattern is clear: students who earned an A in CS101 averaged 3.52 in CS201, while students who earned a C averaged only 1.78. Jordan's tutoring center intuition was right — there is a strong correlation between CS101 and CS201 performance.
Jordan writes a careful note in the notebook:
Important caveat: This is a correlation, not a causal claim. We can't say that doing poorly in CS101 causes poor CS201 performance. Both could be influenced by other factors (study habits, work schedule, math preparation). What we can say is that CS101 grades are a strong predictor of CS201 outcomes, which suggests that targeted tutoring for students earning B- or below in CS101 could be valuable.
Question 3: CS Majors vs. Non-CS Majors
Jordan's final question requires joining three tables — students, enrollments, and courses:
major_perf = pd.read_sql("""
SELECT
CASE
WHEN s.major = 'Computer Science'
THEN 'CS Major'
ELSE 'Non-CS Major'
END as major_group,
c.level,
AVG(CASE
WHEN e.grade = 'A' THEN 4.0
WHEN e.grade = 'A-' THEN 3.7
WHEN e.grade = 'B+' THEN 3.3
WHEN e.grade = 'B' THEN 3.0
WHEN e.grade = 'B-' THEN 2.7
WHEN e.grade = 'C+' THEN 2.3
WHEN e.grade = 'C' THEN 2.0
WHEN e.grade = 'C-' THEN 1.7
WHEN e.grade = 'D+' THEN 1.3
WHEN e.grade = 'D' THEN 1.0
WHEN e.grade = 'F' THEN 0.0
END) as avg_gpa,
COUNT(*) as enrollments
FROM enrollments e
JOIN students s ON e.student_id = s.student_id
JOIN courses c ON e.course_id = c.course_id
GROUP BY major_group, c.level
ORDER BY c.level, major_group
""", conn)
print(major_perf)
major_group level avg_gpa enrollments
0 CS Major advanced 3.21 280
1 Non-CS Major advanced 2.87 102
2 CS Major intermediate 2.95 1245
3 Non-CS Major intermediate 2.62 534
4 CS Major intro 2.82 2890
5 Non-CS Major intro 2.58 1883
CS majors outperform non-CS majors at every level, with the gap widening slightly in advanced courses. This makes intuitive sense — CS majors are likely more invested in the material and may have stronger foundational preparation.
Jordan is impressed by how much they could learn with just a few SQL queries. The database approach let them ask precise questions across multiple related tables without loading everything into memory.
Closing the Connection
Jordan wraps up responsibly:
conn.close()
What Jordan Learned
-
SQL reads like English. Once you understand the basic keywords (SELECT, FROM, WHERE, JOIN, GROUP BY), writing queries feels more like describing what you want than programming how to get it.
-
Joins connect the dots. The students, courses, and enrollments tables are individually useful, but the real insights came from combining them. The self-join for the CS101-to-CS201 analysis was the most surprising technique — Jordan hadn't thought of joining a table with itself.
-
SQL and pandas work together. SQL was great for extracting and aggregating the data, but pandas was essential for the follow-up analysis — creating pivot tables, mapping grades to numeric values, and computing grouped statistics.
-
Caveats matter. Jordan resisted the temptation to say "bad CS101 grades cause bad CS201 grades." The data shows correlation, and that's valuable — but the notebook documents the limitation honestly.
-
Databases are organized. Compared to working with three separate CSV files, the database structure felt cleaner. Related data lives in connected tables, and SQL lets you pull exactly what you need.
The Outcome
Jordan presents these findings to Professor Okafor, who is impressed enough to forward the notebook to the department chair. The tutoring center implements a new policy: any student earning below a B- in CS101 receives an email offering free tutoring before they enroll in CS201. Jordan's data work led to a real intervention — and that's what data science is about.
Discussion Questions
-
Why was a self-join necessary for Question 2? Could Jordan have answered this question without SQL, using only pandas operations on separate DataFrames?
-
Jordan found that CS majors outperform non-CS majors. What are three possible explanations for this difference other than "CS majors are smarter"?
-
The CASE statement in Question 3's SQL query converts letter grades to numeric values. Jordan also did this conversion in pandas (using
.map()) for Question 2. What are the trade-offs of doing this conversion in SQL vs. pandas? -
Professor Okafor gave Jordan anonymized data — student names were replaced with numeric IDs. Why is this important? What could go wrong if the data contained real names?