Case Study 2: Querying a University Database — Jordan Discovers SQL

Contributors to Introduction to Data Science

Case Study 2: Querying a University Database — Jordan Discovers SQL

Tier 3 — Illustrative/Composite Example: This case study uses a fictional university database modeled on the kinds of academic information systems that universities maintain. All student names, course information, and numerical values are invented for pedagogical purposes. No specific institution is represented.

The Setting

Jordan — our university student from Chapter 1 — is working as a peer tutor in the campus tutoring center. Over the past semester, Jordan has noticed a pattern: students from certain introductory courses seem to struggle more in their follow-up classes. But is this a real pattern, or is Jordan just remembering the difficult cases?

Jordan's supervisor, Professor Okafor, has access to the university's academic database and agrees to create a SQLite extract for Jordan's analysis. "I can't give you the full database for privacy reasons," Professor Okafor explains, "but I can give you anonymized data for the Computer Science department."

The next day, Jordan finds cs_department.db in their shared drive. It's a SQLite database with three tables:

Table: students | Column | Type | Description | |--------|------|-------------| | student_id | INTEGER | Unique identifier | | major | TEXT | Declared major | | admission_year | INTEGER | Year admitted | | gpa | REAL | Cumulative GPA |

Table: courses | Column | Type | Description | |--------|------|-------------| | course_id | TEXT | Course code (e.g., "CS101") | | course_name | TEXT | Full course name | | credits | INTEGER | Credit hours | | level | TEXT | "intro", "intermediate", "advanced" |

Table: enrollments | Column | Type | Description | |--------|------|-------------| | student_id | INTEGER | References students table | | course_id | TEXT | References courses table | | semester | TEXT | e.g., "Fall 2023" | | grade | TEXT | Letter grade (A, A-, B+, etc.) |

Jordan has never written SQL before, but the Chapter 12 material on databases is fresh in their mind. Time to learn by doing.

Connecting and Exploring

Jordan opens a Jupyter notebook and starts with the basics:

import sqlite3
import pandas as pd

conn = sqlite3.connect('cs_department.db')

First question: what tables are actually in the database?

tables = pd.read_sql("""
    SELECT name FROM sqlite_master
    WHERE type='table'
""", conn)
print(tables)

          name
0     students
1      courses
2  enrollments

Good — three tables, as expected. Jordan checks the size of each:

for table in ['students', 'courses', 'enrollments']:
    count = pd.read_sql(
        f"SELECT COUNT(*) as n FROM {table}", conn
    )
    print(f"{table}: {count['n'][0]} rows")

students: 1247 rows
courses: 42 rows
enrollments: 8934 rows

1,247 students, 42 courses, and 8,934 enrollment records. Let's peek at each table:

students = pd.read_sql(
    "SELECT * FROM students LIMIT 5", conn
)
print(students)

   student_id           major  admission_year   gpa
0       10001  Computer Science           2021  3.42
1       10002  Computer Science           2022  2.87
2       10003      Data Science           2023  3.65
3       10004  Computer Science           2020  3.11
4       10005  Information Systems        2021  2.95

Jordan is already thinking in questions, just like Professor Okafor's research methods class taught. They write three questions in a Markdown cell:

Research Questions: 1. What is the grade distribution in introductory vs. intermediate CS courses? 2. Do students who get lower grades in CS101 tend to get lower grades in CS201? 3. Is there a difference in course performance between CS majors and non-CS majors?

Question 1: Grade Distribution by Course Level

Jordan's first real SQL query. They want to count how many of each grade appears in intro courses vs. intermediate courses:

grade_dist = pd.read_sql("""
    SELECT c.level, e.grade, COUNT(*) as count
    FROM enrollments e
    JOIN courses c ON e.course_id = c.course_id
    GROUP BY c.level, e.grade
    ORDER BY c.level, e.grade
""", conn)
print(grade_dist.head(10))

        level grade  count
0    advanced     A     52
1    advanced    A-     67
2    advanced    B+     78
3    advanced     B     45
4    advanced    B-     38
5       intro     A    312
6       intro    A-    287
7       intro    B+    256
8       intro     B    398
9       intro    B-    342

This works, but it's hard to compare levels when the data is in long format. Jordan remembers Chapter 9 — pivot tables:

grade_pivot = grade_dist.pivot_table(
    index='grade', columns='level',
    values='count', fill_value=0
)
print(grade_pivot)

Now the grade distributions sit side by side. Jordan notices that intro courses have a higher proportion of C and D grades compared to advanced courses. "That makes sense," Jordan thinks. "Students who struggle in intro classes might not make it to advanced ones." This is an example of survivorship bias — a concept they'll learn formally later, but one they can already recognize.

Question 2: Do CS101 Grades Predict CS201 Grades?

This is the question that motivated the whole analysis. Jordan needs to join the enrollments table with itself — finding students who took both CS101 and CS201:

progression = pd.read_sql("""
    SELECT
        e1.student_id,
        e1.grade as cs101_grade,
        e2.grade as cs201_grade
    FROM enrollments e1
    JOIN enrollments e2
        ON e1.student_id = e2.student_id
    WHERE e1.course_id = 'CS101'
        AND e2.course_id = 'CS201'
""", conn)

print(f"Students who took both: {len(progression)}")
print(progression.head())

Students who took both: 423
   student_id cs101_grade cs201_grade
0       10001           B          B+
1       10004          B-          C+
2       10007           A          A-
3       10012          C+           C
4       10015           A           A

Jordan stares at this result. This is a self-join — the enrollments table is joined with itself, using different aliases (e1 and e2). The WHERE clause filters so that e1 only has CS101 records and e2 only has CS201 records, while the ON clause ensures they're the same student.

To see the pattern more clearly, Jordan converts letter grades to a numeric GPA scale and computes averages:

grade_to_points = {
    'A': 4.0, 'A-': 3.7, 'B+': 3.3, 'B': 3.0,
    'B-': 2.7, 'C+': 2.3, 'C': 2.0, 'C-': 1.7,
    'D+': 1.3, 'D': 1.0, 'F': 0.0
}

progression['cs101_pts'] = (
    progression['cs101_grade'].map(grade_to_points)
)
progression['cs201_pts'] = (
    progression['cs201_grade'].map(grade_to_points)
)

# Average CS201 grade by CS101 grade
summary = progression.groupby('cs101_grade').agg(
    avg_cs201=('cs201_pts', 'mean'),
    count=('student_id', 'count')
).round(2)
print(summary)

             avg_cs201  count
cs101_grade
A                 3.52     87
A-                3.28     72
B+                3.01     68
B                 2.74     85
B-                2.45     52
C+                2.12     34
C                 1.78     18
C-                1.45      5
D+                1.15      2

The pattern is clear: students who earned an A in CS101 averaged 3.52 in CS201, while students who earned a C averaged only 1.78. Jordan's tutoring center intuition was right — there is a strong correlation between CS101 and CS201 performance.

Jordan writes a careful note in the notebook:

Important caveat: This is a correlation, not a causal claim. We can't say that doing poorly in CS101 causes poor CS201 performance. Both could be influenced by other factors (study habits, work schedule, math preparation). What we can say is that CS101 grades are a strong predictor of CS201 outcomes, which suggests that targeted tutoring for students earning B- or below in CS101 could be valuable.

Question 3: CS Majors vs. Non-CS Majors

Jordan's final question requires joining three tables — students, enrollments, and courses:

major_perf = pd.read_sql("""
    SELECT
        CASE
            WHEN s.major = 'Computer Science'
                THEN 'CS Major'
            ELSE 'Non-CS Major'
        END as major_group,
        c.level,
        AVG(CASE
            WHEN e.grade = 'A' THEN 4.0
            WHEN e.grade = 'A-' THEN 3.7
            WHEN e.grade = 'B+' THEN 3.3
            WHEN e.grade = 'B' THEN 3.0
            WHEN e.grade = 'B-' THEN 2.7
            WHEN e.grade = 'C+' THEN 2.3
            WHEN e.grade = 'C' THEN 2.0
            WHEN e.grade = 'C-' THEN 1.7
            WHEN e.grade = 'D+' THEN 1.3
            WHEN e.grade = 'D' THEN 1.0
            WHEN e.grade = 'F' THEN 0.0
        END) as avg_gpa,
        COUNT(*) as enrollments
    FROM enrollments e
    JOIN students s ON e.student_id = s.student_id
    JOIN courses c ON e.course_id = c.course_id
    GROUP BY major_group, c.level
    ORDER BY c.level, major_group
""", conn)

print(major_perf)

    major_group        level  avg_gpa  enrollments
0      CS Major     advanced     3.21          280
1  Non-CS Major     advanced     2.87          102
2      CS Major  intermediate     2.95         1245
3  Non-CS Major  intermediate     2.62          534
4      CS Major        intro     2.82         2890
5  Non-CS Major        intro     2.58         1883

CS majors outperform non-CS majors at every level, with the gap widening slightly in advanced courses. This makes intuitive sense — CS majors are likely more invested in the material and may have stronger foundational preparation.

Jordan is impressed by how much they could learn with just a few SQL queries. The database approach let them ask precise questions across multiple related tables without loading everything into memory.

Closing the Connection

Jordan wraps up responsibly:

conn.close()

What Jordan Learned

SQL reads like English. Once you understand the basic keywords (SELECT, FROM, WHERE, JOIN, GROUP BY), writing queries feels more like describing what you want than programming how to get it.
Joins connect the dots. The students, courses, and enrollments tables are individually useful, but the real insights came from combining them. The self-join for the CS101-to-CS201 analysis was the most surprising technique — Jordan hadn't thought of joining a table with itself.
SQL and pandas work together. SQL was great for extracting and aggregating the data, but pandas was essential for the follow-up analysis — creating pivot tables, mapping grades to numeric values, and computing grouped statistics.
Caveats matter. Jordan resisted the temptation to say "bad CS101 grades cause bad CS201 grades." The data shows correlation, and that's valuable — but the notebook documents the limitation honestly.
Databases are organized. Compared to working with three separate CSV files, the database structure felt cleaner. Related data lives in connected tables, and SQL lets you pull exactly what you need.

The Outcome

Jordan presents these findings to Professor Okafor, who is impressed enough to forward the notebook to the department chair. The tutoring center implements a new policy: any student earning below a B- in CS101 receives an email offering free tutoring before they enroll in CS201. Jordan's data work led to a real intervention — and that's what data science is about.

Discussion Questions

Why was a self-join necessary for Question 2? Could Jordan have answered this question without SQL, using only pandas operations on separate DataFrames?
Jordan found that CS majors outperform non-CS majors. What are three possible explanations for this difference other than "CS majors are smarter"?
The CASE statement in Question 3's SQL query converts letter grades to numeric values. Jordan also did this conversion in pandas (using .map()) for Question 2. What are the trade-offs of doing this conversion in SQL vs. pandas?
Professor Okafor gave Jordan anonymized data — student names were replaced with numeric IDs. Why is this important? What could go wrong if the data contained real names?