Chapter 34 Exercises: Building Your Portfolio

Contributors to Introduction to Data Science

Chapter 34 Exercises: Building Your Portfolio

How to use these exercises: Unlike most chapters in this book, these exercises are actions, not calculations. Many of them ask you to build, write, or create something tangible. By the end, you should have concrete portfolio artifacts — not just understanding, but visible output. Treat this as a working session, not a problem set.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Portfolio Assessment ⭐

These exercises build your ability to evaluate portfolio projects — a skill you need before you can build great ones.

Exercise 34.1 — Spotting the generic portfolio

Look at the following five project titles and brief descriptions. For each, explain whether it would be a strong or weak portfolio piece, and why. Apply the CRISP criteria from the chapter.

"Titanic Survival Prediction" — Loads the Kaggle dataset, one-hot encodes features, trains a random forest, reports accuracy.
"Does Weather Affect NYC Subway Delays? A Five-Year Investigation" — Merges MTA performance data with NOAA weather records to test whether precipitation correlates with service disruptions, including an interactive map of the most delay-prone stations.
"MNIST Digit Classification" — Builds a convolutional neural network to classify handwritten digits, achieving 99.1% accuracy.
"The Geography of Noise Complaints: Mapping 311 Data to Identify New York's Loudest Neighborhoods" — Scrapes 311 complaint data, geocodes addresses, builds choropleth maps, and investigates whether noise complaints correlate with gentrification indicators.
"Sentiment Analysis" — Uses VADER sentiment analyzer on 10,000 movie reviews, reports positive/negative classification accuracy.

Guidance

Projects 2 and 4 are strong. They have clear questions (CRISP: C), use real data that required acquisition effort (R), show independent thinking in the analytical choices (I), tell a story through the project title and description (S), and suggest polished output with maps and merged datasets (P). Projects 1, 3, and 5 are weak portfolio pieces — not because they're bad learning exercises, but because they use pre-packaged datasets, follow well-worn paths, and don't demonstrate original thinking. The titles are generic ("Sentiment Analysis"), and the descriptions suggest code-dumps rather than narratives. To strengthen the weak ones: Project 1 could become "Did 'Women and Children First' Really Apply? Intersectional Analysis of Titanic Survival by Class, Gender, and Age." Project 3 could become "Can Handwriting Recognition Cross Languages? Testing MNIST-Trained Models on Japanese Hiragana Characters." Project 5 could become "Is Movie Review Sentiment Shifting? A Decade of IMDB Ratings vs. Textual Tone."

Exercise 34.2 — The CRISP audit

Take a data science project you've seen online — a Kaggle notebook, a blog post, or a GitHub repository. (If you can't find one, use one of your own earlier notebooks from this course.) Evaluate it against each of the five CRISP criteria, giving it a score of 1-5 for each. Write a brief paragraph explaining what the project does well and what could be improved.

Guidance

Score each criterion independently: - **C (Clear question):** Is there a specific question stated? (1 = no question, 5 = precise, interesting question) - **R (Real data):** Is the data sourced from a real context, or is it a pre-cleaned tutorial dataset? (1 = Iris/Titanic with no twist, 5 = self-collected or merged from real sources) - **I (Independent thinking):** Are analytical decisions documented with reasoning? (1 = follows a tutorial step-by-step, 5 = multiple documented judgment calls) - **S (Story and structure):** Does it read as a narrative? (1 = code dump, 5 = clear introduction, logical flow, written conclusion) - **P (Polished presentation):** Are charts labeled, is code clean, are there no leftover debugging cells? (1 = messy, 5 = publication-ready) A strong portfolio piece should score at least 3 on every dimension and at least 4 on two of them.

Exercise 34.3 — Reading like a hiring manager

Imagine you have 60 seconds to evaluate a portfolio. Open three different data science GitHub profiles (search GitHub for "data science portfolio" or look at profiles of people who blog on Medium/Towards Data Science). For each profile, write down:

Your first impression in one sentence (within 10 seconds of arriving)
Whether you would click on any of the pinned repositories (and why or why not)
What the profile tells you about the person's interests and skill level

Guidance

First impressions are driven by: profile photo (real vs. default), profile README (present vs. absent), and pinned repositories (descriptive titles vs. "project1," "assignment2"). The profiles you're most drawn to will likely have a clear README, interesting project titles, and evidence of completed work. Blank profiles, forked-but-unmodified repositories, and generic project names all reduce interest.

Exercise 34.4 — The README test

Without opening the notebook, read the README of any data science project on GitHub. Based on the README alone, answer:

What question does this project investigate?
What data sources does it use?
What are the key findings?
Could you reproduce this analysis? (Is there enough information?)

If any of these questions can't be answered from the README, that's a gap.

Guidance

A well-written README should answer all four questions clearly. Many projects fail on questions 3 and 4 — they describe *what they did* but not *what they found*, and they provide no instructions for reproduction. These are exactly the gaps you should avoid in your own projects.

Exercise 34.5 — Distinguishing portfolio from homework

Consider two versions of the same analysis:

Version A: A notebook titled "Chapter 8 Exercises" that cleans the vaccination dataset by following the textbook instructions step by step, with comments like "As per Exercise 8.3, we drop rows with missing values."

Version B: A notebook titled "Addressing Data Quality Challenges in WHO Vaccination Records" that cleans the same dataset but frames each decision as an analytical choice: "47 countries had missing GDP data. Rather than dropping them (which would eliminate most of Sub-Saharan Africa), I imputed using the nearest-year World Bank data and flagged these values for sensitivity analysis."

Write three sentences explaining why Version B is a stronger portfolio piece, even though both contain the same cleaning operations.

Guidance

Version B demonstrates three things Version A doesn't: (1) **Independent framing** — the title and context are original, not referencing a textbook assignment; (2) **Decision justification** — the choices are explained with reasoning, not attributed to instructions; (3) **Awareness of consequences** — the note about Sub-Saharan Africa shows the author understands the analytical implications of their cleaning choices. A hiring manager reading Version B sees a person who *thinks*; reading Version A, they see a person who *follows instructions*.

Part B: Building Your Portfolio ⭐⭐

These exercises produce tangible portfolio artifacts. Treat them as action items.

Exercise 34.6 — Write your project introduction

Open a new Jupyter notebook (or a text file). Write the introduction for your vaccination rate analysis portfolio piece. Include:

The research question (one to two sentences, specific and interesting)
Why it matters (two to three sentences connecting to real-world significance)
What data you used (one to two sentences with source names)
A brief preview of your approach and findings (two to three sentences)

Target length: 150-250 words. This should be compelling enough that a hiring manager would keep reading.

Guidance

A strong introduction might look like: "COVID-19 vaccination rates varied enormously across countries — from over 90% in some high-income nations to under 10% in parts of Sub-Saharan Africa. What explains this divide? Is it purely an economic issue, or do factors like healthcare infrastructure, geographic isolation, or political stability play a role? This analysis investigates the relationship between national-level indicators and vaccination coverage for 194 WHO member states, using data from the WHO, World Bank, and WHO Global Health Expenditure Database. Through exploratory analysis, statistical testing, and predictive modeling, I find that while GDP per capita is strongly associated with vaccination rates, healthcare worker density and health spending as a percentage of GDP are even stronger predictors — suggesting that *how* a country allocates resources matters more than *how much* it has." Notice: specific numbers, a clear question, real data sources, and a finding that's genuinely interesting.

Exercise 34.7 — Curate your visualizations

Review all the charts you've created throughout this book (or in your data science work). Select the five to eight best visualizations — the ones that tell the most compelling story about vaccination rate disparities. For each selected chart:

Write a descriptive title (not "Figure 1" but something a reader could learn from)
Write a one to two sentence caption explaining what the chart reveals
Note any improvements you'd make (labels, colors, layout)

Guidance

Good visualization titles lead with the finding: "Low-Income Countries Lag Behind: Vaccination Rates by World Bank Income Group" is better than "Bar Chart of Vaccination Rates." Captions should tell the reader what to notice: "The median vaccination rate in low-income countries (14.2%) was roughly one-fifth of the rate in high-income countries (72.8%), with Sub-Saharan Africa showing the widest within-region variation."

Exercise 34.8 — Write your project README

Write a complete README.md for your vaccination rate analysis repository, following the template in Section 34.4 of the chapter. Include all sections: Title, Overview, Motivation, Data Sources, Key Findings, Methods, Repository Structure, How to Reproduce, and Limitations/Future Work.

Guidance

The README should be 300-500 words. Lead with your most interesting finding in the Overview. The Motivation section should explain *your* reason for doing this — not "this was a course project" but "vaccination equity is one of the defining challenges of global public health, and understanding what drives disparities is essential for addressing them." The Limitations section should be honest: what data didn't you have? What assumptions did you make? What questions remain?

Exercise 34.9 — Set up your GitHub profile

If you don't already have a polished GitHub profile, create or update one now:

Create a profile README repository (same name as your username)
Write a brief, professional profile description (3-5 lines)
Upload your vaccination analysis project as a repository
Pin it to your profile
Write at least three descriptive commit messages as you work

Screenshot your profile before and after. The difference should be visible.

Guidance

Keep the profile README short and genuine. Include: who you are (one sentence), what you're interested in (one sentence), what skills you have (one line), and how to find you elsewhere (LinkedIn, blog). Avoid: motivational quotes, walls of badge icons, and multi-paragraph autobiographies. The profile should take less than 10 seconds to scan and give someone a clear picture of who you are and what you do.

Exercise 34.10 — The before/after notebook transformation

Take one section of your vaccination analysis — the data cleaning section is a good candidate — and create "before" and "after" versions:

Before: The raw exercise version from when you first completed it (probably sparse on explanation, focused on making the code work).

After: A polished portfolio version with narrative Markdown, documented decisions, and clean formatting.

Write a brief reflection on what changed and why those changes matter for a reader.

Guidance

Key transformations include: adding Markdown context before and after code cells; replacing generic variable names with descriptive ones; removing debugging output; adding comments that explain *why*, not *what*; and writing interpretive text after results. The "after" version should read like a report section, not a coding exercise.

Exercise 34.11 — Write a project description for your resume

Write a two to three sentence project description for your vaccination analysis that you could include on a resume. Follow the pattern: "Built [what] using [how] to answer [question], finding [result]."

Then write a second version that's slightly longer (three to four sentences) for LinkedIn or a cover letter.

Guidance

Resume version: "Analyzed COVID-19 vaccination data for 194 countries using Python (pandas, scikit-learn, matplotlib) to identify predictors of vaccination coverage disparities. Found that healthcare worker density was a stronger predictor than GDP per capita, with a random forest model achieving R² = 0.78 on held-out test data." LinkedIn version adds a sentence of context and motivation: "Motivated by the stark global inequalities in COVID-19 vaccine access, I conducted an end-to-end data science investigation of vaccination rate disparities across 194 countries..."

Exercise 34.12 — Blog post outline

Write an outline for a blog post about your vaccination analysis. The outline should include:

A compelling title (something a data-interested reader would click on)
An opening hook (the first one to two sentences that draw the reader in)
Five to seven section headings that tell the story of your analysis
The three key charts you would include
A closing takeaway (the one thing you want the reader to remember)

Guidance

A good title is specific and intriguing: "What Predicts Whether a Country Vaccinates Its People? It's Not What You Think." The opening hook should start with a surprising fact or statistic. Sections should follow a narrative arc: setup (the question), data (what you worked with), exploration (what you found), analysis (what the models showed), and conclusion (what it means). The closing takeaway should be the most interesting single finding, stated simply.

Exercise 34.13 — LinkedIn profile upgrade

Update (or create) your LinkedIn profile with:

A headline that mentions data science and your domain interest
An "About" section of three to four sentences
At least one featured item (link to your GitHub, blog post, or project summary)
Three relevant skills listed (e.g., Python, Data Analysis, Machine Learning)

Guidance

Headline examples: "Aspiring Data Scientist | Public Health + Analytics | Python, SQL, Machine Learning" or "Data Science Student | Turning Messy Data into Clear Stories | Seeking Analyst Roles." The "About" section should be conversational but professional: who you are, what interests you about data science, what kind of work you're looking for. Avoid buzzwords ("synergy," "leverage," "passionate about disruption").

Part C: Project Planning ⭐⭐

These exercises help you identify and scope your next portfolio projects.

Exercise 34.14 — The interest inventory

List five topics you're genuinely curious about — not things you think would look good on a resume, but things you actually wonder about. For each topic, brainstorm at least one data science question and identify a potential data source.

Example: - Topic: Music discovery - Question: "Has Spotify's recommendation algorithm made people's listening habits more homogeneous over time?" - Data source: Spotify API (personal listening history)

Guidance

The best portfolio projects come from genuine curiosity. If you're interested in sports, don't force yourself to do a finance project because you think it'll look better. An enthusiastic sports analytics project with genuine insight will always outperform a lukewarm finance project that follows a tutorial. Your passion shows through in the depth of your questions, the care in your analysis, and the quality of your writing.

Exercise 34.15 — Project scoping

Choose one of the ideas from Exercise 34.14 and develop it into a project scope document:

Question: What specific question will you investigate?
Data: What data sources will you use? Are they accessible? How large?
Approach: What techniques will you apply? (EDA, visualization, statistical testing, modeling?)
Scope control: What will you explicitly not do? (Setting boundaries prevents scope creep)
Timeline: How many hours will this take? Set a deadline.
Definition of done: What does the finished project look like? (A notebook? A blog post? A dashboard?)

Guidance

The "scope control" and "definition of done" items are the most important. Without them, projects expand indefinitely. A good scope control statement might be: "I will analyze data from 2015-2023. I will not attempt real-time prediction. I will limit the analysis to the top 30 countries by data completeness." A good definition of done might be: "A polished Jupyter notebook with 5-7 visualizations, a statistical analysis section, and a written conclusion, plus a README, published on GitHub."

Exercise 34.16 — The range assessment

Map your current (or planned) portfolio projects against the three-project framework from the chapter:

Slot	Project	Skills Demonstrated	Domain	Status
Deep Dive
Domain Project
Technical Demo

Identify any gaps. If all three projects use the same technique (e.g., all are classification tasks), that's a range problem. If all three are in the same domain, consider adding variety.

Guidance

A strong portfolio shows range in both techniques and domains. Example portfolio: (1) Deep Dive: vaccination rate analysis (regression, visualization, statistics, public health); (2) Domain Project: NBA shot analysis (web scraping, exploratory analysis, sports); (3) Technical Demo: real-time weather dashboard (API integration, Plotly Dash, deployment). This shows a person who can go deep, follow their curiosity, and handle technical challenges.

Exercise 34.17 — Data source scavenger hunt

Find three datasets you didn't know about that could fuel interesting portfolio projects. For each, record:

The source (URL and organization)
What data is included (variables, time period, geographic coverage)
A potential research question
File format and approximate size

Good places to look: data.gov, data.gov.uk, the EU Open Data Portal, Kaggle Datasets, Google Dataset Search, the Bureau of Labor Statistics, your city or state's open data portal.

Guidance

Lesser-known datasets make for more original projects. Instead of the frequently-used datasets, look for things like: municipal 311 complaint data, agricultural crop yields, public transit performance records, building permit applications, campaign finance filings, or professional sports performance data. The more specific the domain, the more likely you'll produce an analysis nobody has seen before.

Part D: Interview Preparation ⭐⭐⭐

Exercise 34.18 — The STAR-D practice

Choose your vaccination rate analysis project and practice describing it using the STAR-D framework:

Situation: One to two sentences of context
Task: What were you trying to accomplish?
Action: What did you specifically do?
Result: What did you find or achieve?
Decisions: What judgment calls did you make and why?

Write this out, then practice saying it aloud in under two minutes.

Guidance

The time constraint matters — interview answers that go beyond two to three minutes lose the interviewer's attention. Practice trimming. The D (Decisions) component is what separates a good answer from a great one: "I chose a random forest over linear regression because my exploratory plots showed non-linear relationships between GDP and vaccination rates, and I valued feature importance rankings for communicating results to a non-technical audience."

Exercise 34.19 — Take-home simulation

Simulate a take-home assessment. Choose a dataset you haven't worked with before (Kaggle's "Datasets" section is a good source). Set a timer for four hours. In that time:

Formulate a question
Clean the data
Create three to five visualizations
Run at least one statistical test or model
Write a summary of findings

Produce a polished notebook as if you were submitting it to a hiring manager. Then review it against the CRISP criteria.

Guidance

The time constraint is the point. In a real take-home, you have limited hours (typically 48-72, but you shouldn't spend all of it). The key skill is prioritization: what analysis will answer the question most directly? Start with EDA and a clear summary before attempting anything complex. A clean, well-communicated simple analysis beats a messy, uncommunicated complex one every time.

Exercise 34.20 — Behavioral question bank

Write brief (three to four sentence) answers to each of these common behavioral interview questions, drawing on your actual experience from this course:

"Tell me about a time you worked with messy data."
"Describe a time when your analysis produced a surprising result."
"How do you explain technical findings to a non-technical audience?"
"Tell me about a project where you had to make a difficult analytical decision."
"What's the most interesting thing you've learned about data science?"

Guidance

Draw on specific moments from your work in this course. For question 1, talk about the vaccination dataset's missing values, inconsistent country names, or duplicate records. For question 2, mention a finding from your analysis that challenged your assumptions. For question 3, describe writing the executive summary in Chapter 31. The key is specificity — "I worked with messy data" is weak; "I found that 47 countries had missing GDP data, and dropping them would have eliminated most of Sub-Saharan Africa from my analysis, so I..." is strong.

Exercise 34.21 — SQL practice for interviews

Many data science interviews include SQL questions. Write SQL queries (or pandas equivalents) for these common interview-style problems using your vaccination dataset:

Find the top 10 countries by vaccination rate
Calculate the average vaccination rate by WHO region
Find countries where the vaccination rate is above the average for their income group
Calculate the year-over-year change in vaccination rate for each country
Find regions where no country has a vaccination rate below 50%

Guidance

These map to common SQL patterns: ORDER BY + LIMIT (1), GROUP BY + AVG (2), subquery or window function (3), self-join or LAG window function (4), and NOT EXISTS or HAVING MIN() >= 50 (5). If you write them in both SQL and pandas, you're demonstrating versatility that many interviewers appreciate.

Part E: Synthesis and Reflection ⭐⭐⭐

Exercise 34.22 — The six-month portfolio plan

Create a concrete plan for building your portfolio over the next six months. Include:

Month 1: What will you polish and publish? (Start with the vaccination analysis)
Months 2-3: What domain project will you build? (Choose from Exercise 34.14)
Months 4-5: What technical demonstration project will you create?
Month 6: What blog posts will you write? What will your final portfolio look like?

For each month, set specific deliverables with deadlines.

Guidance

The key to this plan is specificity and realism. "Build projects" is not a plan. "By March 31, publish the polished vaccination analysis on GitHub with a complete README and pin it to my profile" is a plan. Build in buffer time — projects always take longer than you expect. And remember: completing three projects in six months is better than starting six and finishing none.

Exercise 34.23 — Peer review exchange

Find a classmate, study partner, or online community member and exchange portfolio pieces for review. For their project, write feedback addressing:

First impression (would you keep reading?)
CRISP criteria scores (1-5 for each)
Three specific things they did well
Three specific things they could improve
One question their analysis raised that they could investigate further

Guidance

Peer review is one of the most valuable things you can do for your portfolio. You'll see patterns in your own work that you can't see from the inside, and reviewing someone else's work sharpens your ability to evaluate your own. Be constructive — specific praise and specific suggestions are both more useful than vague comments. "Your chart in section 3 would be clearer with a legend" is better than "nice charts" or "needs work."

Exercise 34.24 — The elevator pitch

Write a 30-second elevator pitch for yourself as a data science candidate. It should cover: who you are, what you can do, what you're interested in, and what you're looking for. Practice saying it aloud until it feels natural.

Guidance

Template: "I'm [name], and I've been training as a data scientist with a focus on [domain]. I recently completed a project analyzing [topic] using [tools], where I found [key finding]. I'm looking for [role type] where I can [what you want to do]. My portfolio is at [URL] — I'd love for you to take a look." This should be conversational, not rehearsed-sounding. The key is to be specific about what you've done and genuine about what interests you.

Exercise 34.25 — The honest self-assessment

Complete this self-assessment table honestly. For each skill area, rate yourself 1-5 and identify specific evidence from your portfolio:

Skill Area	Self-Rating (1-5)	Evidence (specific project/analysis)	Next Step to Improve
Data cleaning and wrangling
Exploratory visualization
Statistical analysis
Machine learning
Communication and writing
Code quality and organization
Domain knowledge
SQL

Use this assessment to guide which portfolio projects to prioritize next. Build projects that strengthen your weakest areas or highlight your strongest ones — ideally both.

Guidance

This exercise requires genuine honesty. If your SQL skills are a 2, acknowledge it and plan to build a project that uses SQL. If your communication is a 4, make sure your portfolio showcases that strength prominently. The "evidence" column is the most important — if you can't point to a specific project or analysis that demonstrates a skill, that skill isn't yet part of your portfolio, regardless of how well you think you know it.