Case Study 2: An Alternative Capstone: Analyzing Your Own Dataset

Contributors to Introduction to Data Science

Case Study 2: An Alternative Capstone: Analyzing Your Own Dataset

Tier 3 — Illustrative/Composite Example: This case study presents two fictional students who chose to pursue their own capstone topics rather than the vaccination rate analysis. Their projects, analytical approaches, and findings are constructed for pedagogical purposes to illustrate what a self-directed capstone looks like and what challenges it introduces. The datasets and domains described are realistic, but specific results are illustrative.

Introduction

Not every capstone student wants to analyze vaccination data. And that's perfectly fine.

Throughout this book, we've used the vaccination rate analysis as the progressive project because public health data is rich, important, and accessible. But data science is a way of thinking that applies to every domain — and some of the strongest capstone projects come from students who found a question they were personally passionate about and pursued it with the tools they learned here.

This case study follows two students who did exactly that: one who analyzed local housing data to understand gentrification, and another who investigated whether streaming algorithms have homogenized popular music. Their projects illustrate both the rewards and the unique challenges of choosing your own dataset.

Project 1: "Priced Out — Mapping Gentrification Through Housing Data"

The Student

Khadija grew up in a neighborhood that changed dramatically during her teenage years. Longtime residents were displaced as rents rose, new restaurants replaced family-owned shops, and the demographic makeup shifted. When she reached the capstone, she didn't want to analyze vaccination data — she wanted to use data science to understand the neighborhood transformation she'd witnessed firsthand.

The Question

"Can publicly available housing and demographic data identify neighborhoods undergoing gentrification, and if so, what indicators appear earliest — giving community organizations warning before displacement accelerates?"

The Data Journey

This is where Khadija's capstone got interesting — and difficult. Unlike the vaccination project, where the data was well-documented and provided in the progressive milestones, Khadija had to find and assemble her own data from scratch.

What she found: - Census tract-level demographic data from the American Community Survey (ACS), including median household income, racial composition, educational attainment, and housing tenure (rent vs. own) - Zillow Home Value Index data at the ZIP code level - Building permit data from the city's open data portal, including permits for renovations, demolitions, and new construction - Business license data, also from the city's open data portal

Challenges she encountered:

Geographic mismatch. Census tracts, ZIP codes, and city council districts don't align. Zillow data was at the ZIP level, Census data at the tract level, and permit data at the address level. She had to geocode addresses and assign them to census tracts — a process that took two full days and required learning a new library (geopandas) that wasn't covered in this course.
Temporal mismatch. Census data is published on a five-year rolling average (2018-2022 ACS), while Zillow data is monthly. She had to decide whether to use point-in-time estimates or aggregate to match the Census temporal resolution. She chose to aggregate Zillow data to a five-year average but also created a separate time-series analysis at the monthly level.
Defining "gentrification." This turned out to be the hardest analytical decision. There's no universal definition of gentrification. Some researchers define it purely by income changes, others by racial composition shifts, others by housing cost increases, and some use composite indices. Khadija reviewed three published methodologies and chose a composite index combining: (a) change in median household income, (b) change in percentage of residents with a bachelor's degree, (c) change in median home value, and (d) change in percentage of non-Hispanic white residents. She documented this choice thoroughly and noted that the definition itself embeds assumptions about what gentrification is.
Small sample size. Her city had about 280 census tracts. After removing tracts with incomplete data, she had 241. This limited the complexity of models she could build — a random forest with 20 features on 241 observations is prone to overfitting.

What Worked

Domain passion showed through. Khadija's introduction wasn't abstract — it was personal. She described her neighborhood, named the specific streets where changes happened, and explained why data analysis of gentrification matters for community organizing. A reader could tell she cared deeply about the topic.

The analytical decisions were excellent. Because she had to define "gentrification" herself, she couldn't just follow instructions. She reviewed literature, compared definitions, chose one, and tested whether her conclusions were sensitive to the definition. This demonstrated independent thinking at a level most capstone projects don't reach.

The ethical reflection was profound. Khadija wrestled with a real tension: building a "gentrification prediction model" could be useful for community organizations seeking early warning — but the same model could be used by real estate investors to identify neighborhoods ripe for speculation. She discussed this dual-use problem thoughtfully and didn't pretend to resolve it.

What Could Be Improved

The technical execution was uneven. The geocoding and geographic merging took so long that Khadija ran short on time for modeling. Her regression analysis was solid but basic — she didn't have time to explore non-linear models or perform the thorough cross-validation she'd planned.

The scope was ambitious. Four data sources, geographic merging, a literature review on definitions, time-series analysis, and cross-sectional modeling — this was closer to a master's thesis than a capstone. A more focused question ("Do building permits predict home value increases at the census tract level?") would have allowed deeper analysis.

Rubric Assessment

Dimension	Score	Notes
Question and Motivation	4	Specific, personal, compelling
Data Handling	3	Thorough but some cleaning steps underdocumented due to time pressure
Exploration and Visualization	4	Strong choropleth maps; effective before/after comparisons
Statistical Analysis / Modeling	3	Solid regression but limited model comparison
Communication	4	Exceptional narrative driven by personal connection
Critical Reflection	4	Outstanding ethical reflection on dual-use concerns
Total	22/24	Excellent

Lessons for Your Capstone

Choosing your own dataset is rewarding but risky. The data acquisition and cleaning challenges Khadija faced consumed time that the vaccination project students spent on analysis and polish. Budget extra time for data preparation if you go this route.
Scope carefully. Khadija's project was borderline too ambitious. If you choose your own topic, define a narrow, achievable question first — you can always expand if time permits, but you can't un-expand if you run short.
Domain expertise is a real advantage. Khadija's personal experience with gentrification made her analysis richer, her questions more specific, and her interpretation more nuanced than any outsider's could be.

Project 2: "The Sound of Sameness — Has Streaming Homogenized Popular Music?"

The Student

Andre was a music production hobbyist who'd been casually analyzing music data on his own before taking this course. His friends teased him that "everything on Spotify sounds the same now," and he wanted to test whether that intuition had any basis in data.

The Question

"Have the audio characteristics of popular music become more homogeneous over the past 20 years, and if so, which characteristics have converged the most?"

The Data Journey

What he used: - Spotify Web API audio features (tempo, energy, danceability, valence, acousticness, instrumentalness, speechiness, loudness) for tracks appearing on the Billboard Hot 100, 2004-2024 - Billboard Hot 100 chart data (song titles, artists, peak position, weeks on chart) - He collected data for approximately 12,000 unique tracks using the Spotify API

Challenges he encountered:

API rate limits. The Spotify API limits requests, and querying audio features for 12,000 tracks required careful rate management with sleep intervals between requests. Andre wrote a robust data collection script with error handling, retries, and progress logging — a genuine technical accomplishment.
Matching songs across sources. Song titles on Billboard didn't always match Spotify's catalog exactly. "featuring" vs. "feat." vs. "ft.," remixes, live versions, and re-releases created matching ambiguities. Andre used fuzzy string matching (from the fuzzywuzzy library) and estimated a 94% match rate, manually checking a random sample of 100 matches for accuracy.
Defining "homogeneity." How do you measure whether music is becoming "more similar"? Andre chose to measure the standard deviation of each audio feature per year — if music is becoming more homogeneous, the standard deviation should decrease over time. He also calculated the coefficient of variation (standard deviation divided by mean) to account for potential shifts in the mean.
Survivorship bias. The Billboard Hot 100 only includes songs that were popular. Music that didn't chart — which might be more diverse — is invisible in this analysis. Andre acknowledged this as a fundamental limitation.

Key Findings

Andre's analysis revealed a more nuanced picture than "everything sounds the same":

Loudness converged dramatically. The standard deviation of loudness decreased by 42% from 2004 to 2024, confirming the well-documented "loudness war" in music production. Popular songs are compressed to a narrower loudness range than ever before.
Tempo converged moderately. Songs clustered increasingly around 90-130 BPM, with fewer very slow or very fast songs making the charts. The standard deviation of tempo decreased by 18%.
Danceability and energy increased but didn't necessarily converge. The mean danceability and energy of Hot 100 songs both increased, but the variation around those means stayed roughly constant. Music got more "danceable" on average, but the range of danceability remained similar.
Acousticness plummeted, then partially recovered. The percentage of acoustic (non-electronic) songs on the Hot 100 dropped steadily from 2004 to 2018, then partially recovered from 2019 to 2024 — possibly reflecting the rise of folk and indie-pop in the streaming era.
The overall picture is "partial homogenization." Some features converged, others didn't. Andre's conclusion: "My friends were half right. Popular music has become more homogeneous in loudness and tempo, but not in all dimensions. The perception of sameness may be driven by the very noticeable convergence in production loudness and rhythmic template, even though melodic and emotional characteristics remain diverse."

What Worked

The question was genuinely interesting. This is the kind of analysis that gets shared on social media, discussed in podcasts, and noticed by hiring managers in media and entertainment. Andre's blog post about the project received significant attention in music production forums.

The data acquisition was technically impressive. Writing a robust API scraping pipeline with rate limiting, error handling, and fuzzy matching demonstrated real engineering skills beyond typical course material.

The visualizations were outstanding. Andre created a "heatmap of convergence" showing the coefficient of variation for each audio feature over time, with a clear color gradient from "more diverse" (cool colors) to "more similar" (warm colors). It was the kind of chart that communicates the entire story in a single image.

What Could Be Improved

The statistical rigor was sometimes thin. Andre reported trends in standard deviation over time but didn't always test whether those trends were statistically significant. A formal trend test (like Mann-Kendall) would have strengthened the claims.

The modeling section was minimal. Andre's project was primarily exploratory and descriptive, with limited formal modeling. While this was appropriate for the question (he wasn't trying to predict anything), the capstone specification requires a modeling section. He could have built a model predicting a song's chart year from its audio features — this would test the homogenization hypothesis from a different angle.

The ethical reflection was brief. Andre discussed algorithm-driven homogenization (how recommendation systems might reward similarity) but didn't deeply engage with questions about whose music gets amplified and whose gets suppressed by streaming platforms.

Rubric Assessment

Dimension	Score	Notes
Question and Motivation	4	Specific, culturally relevant, personally motivated
Data Handling	4	Impressive API pipeline; fuzzy matching; good documentation
Exploration and Visualization	4	Outstanding visualizations; clear pattern identification
Statistical Analysis / Modeling	2	Trends described but not formally tested; minimal modeling
Communication	4	Engaging narrative; blog-worthy writing
Critical Reflection	3	Good on algorithmic bias; thin on representation issues
Total	21/24	Strong

Lessons for Your Capstone

Interesting questions make everything easier. Andre's question was so naturally engaging that the narrative wrote itself. If your question excites you, it'll likely excite your reader.
Technical challenges can become strengths. The API scraping, rate limiting, and fuzzy matching weren't required by the course — they emerged from the project's needs. But they became impressive portfolio demonstrations of skills beyond the curriculum.
Don't neglect the modeling requirement. Even if your question is primarily descriptive, find a way to include a formal model. It doesn't have to be complex — a simple regression can satisfy the requirement while adding analytical value.

Comparing the Two Projects

Aspect	Khadija (Gentrification)	Andre (Music)
Strength	Depth of analysis, ethical reflection	Technical data acquisition, visualization
Risk taken	Ambitious scope with geographic data	Non-traditional domain for data science portfolio
Biggest challenge	Geographic data merging, defining gentrification	API rate limits, matching across sources
Career signal	Strong for urban planning, policy, social science roles	Strong for media, entertainment, creative industry roles
Key lesson	Scope carefully; domain expertise is a real asset	Ensure formal analysis matches the capstone specification

Guidance for Choosing Your Own Topic

If you're considering an alternative capstone topic, ask yourself these questions:

Is the data accessible? Can you actually get the data you need, in the time you have? If data acquisition will take more than 20% of your total project time, the scope may be too ambitious.
Is the question specific enough? "Is social media bad?" is not a capstone question. "Has the average sentiment of tweets mentioning a specific public figure changed over the past three years?" is.
Can you satisfy all rubric dimensions? Even if your question is primarily descriptive, you need exploration, formal analysis/modeling, communication, and critical reflection. Make sure your topic allows all of these.
Do you have domain knowledge? Khadija's personal experience with gentrification and Andre's music production background gave them analytical advantages. If you're choosing a domain you know nothing about, you'll spend time learning the domain that could be spent on analysis.
Would you be proud to show this to a hiring manager? The capstone is a portfolio piece. Choose a topic that demonstrates the kind of thinking you want to be known for.

Discussion Questions

Khadija's gentrification project raised a dual-use concern: her model could help community organizations or real estate speculators. Have you encountered similar dual-use tensions in data science? How should analysts navigate them?
Andre's project was primarily descriptive and exploratory. Do you think a data science project needs a predictive model to be considered "real" data science? Why or why not?
Both students faced data challenges not covered in this course (geographic data for Khadija, API rate limiting for Andre). How would you handle encountering a technical challenge that's beyond what you've been taught?
Khadija's personal connection to gentrification made her analysis richer but also introduced potential bias — she had strong prior beliefs about what she expected to find. How should analysts balance domain passion with analytical objectivity?
If you were choosing your own capstone topic, what would it be? Can you articulate the specific question, identify the data source, and describe the analytical approach in three sentences?