Exercises: Your Data Toolkit — Python, Excel, and Jupyter Notebooks

These exercises progress from concept checks to hands-on coding challenges. Estimated completion time: 2 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Setup: For exercises that require code, open a new Jupyter notebook (Google Colab or local) and work through each problem in a separate cell. Save your notebook when you're done — it becomes part of your learning portfolio.


Part A: Conceptual Understanding ⭐

A.1. In your own words, explain the difference between a code cell and a text cell in a Jupyter notebook. When would you use each?

A.2. What is the kernel in a Jupyter notebook? What happens to your variables when you restart the kernel? Why is this important to know?

A.3. Explain why we write import pandas as pd at the top of our notebooks. What would happen if you tried to use pd.read_csv() without this import statement?

A.4. A classmate says, "I'll just use Excel for everything — I don't need Python." Give two scenarios where Excel would be the better choice and two scenarios where Python would be clearly superior.

A.5. What does CSV stand for? Why is this file format so widely used for sharing data across different tools and platforms?

A.6. Explain the difference between = and == in Python. Give an example of each.

A.7. A friend writes this code and gets a NameError:

import pandas as pd
df = pd.read_csv("data.csv")

# ... restarts the kernel ...

print(df.head())

Explain why the error occurs and how to fix it.


Part B: Code Reading and Interpretation ⭐⭐

For each code snippet, predict the output before running the code. Then verify by running it in a notebook.

B.1. What does each line of the following code do? What will the output look like?

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Chen', 'Diana'],
        'score': [88, 92, 76, 95],
        'grade': ['B+', 'A-', 'C+', 'A']}
df = pd.DataFrame(data)
print(df.shape)
df.head()

B.2. Predict the output of this code:

scores = [85, 90, 78, 92, 88]
total = sum(scores)
average = total / len(scores)
print(f"Average: {average}")

B.3. This code loads a dataset and runs several exploration commands. For each command, explain what information it provides:

df = pd.read_csv("some_data.csv")
print(df.shape)       # What does this tell you?
print(df.dtypes)      # What does this tell you?
print(df.info())      # What does this tell you?
df.describe()         # What does this tell you?

B.4. A student writes the following filtering code. It runs without error but gives the wrong result. Find the bug and explain the fix:

# Goal: Find students with scores ABOVE 90
high_scorers = df[df['score'] > 90]

Wait — that code is actually correct. But what if the student had written df[df['score'] = 90] instead? What error would they get, and why?

B.5. What's the difference between these two lines of code?

df.sort_values('score')
df.sort_values('score', ascending=False)

If the score column contains [88, 92, 76, 95], what order would each produce?


Part C: Hands-On Coding ⭐⭐-⭐⭐⭐

For these exercises, load the health dataset used in the chapter:

import pandas as pd

url = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/brfss_sample_500.csv"
health = pd.read_csv(url)

C.1. Write code to answer each of the following questions about the health dataset:

a) How many rows and columns does the dataset have? b) What are the names of all the columns? c) What data type does pandas assign to each column? d) How many missing values are in the bmi column?

C.2. Use .describe() on the health dataset and answer these questions (using both the output and your Chapter 2 knowledge):

a) What is the mean BMI? b) What is the median sleep hours? c) What is the range of ages (max minus min)? d) What percentage of respondents are smokers? (Hint: think about what the mean of a 0/1 variable tells you.)

C.3. Write the code to filter the dataset and find:

a) All respondents who sleep fewer than 5 hours per night. How many are there? b) All respondents who are smokers AND over age 50. How many are there? c) All respondents from Texas ('TX'). What is their average BMI?

C.4. Using the health dataset, use .value_counts() on the gen_health column. Then answer:

a) Which general health rating is most common? b) Which is least common? c) What percentage of respondents rate their health as "Excellent" (1) or "Very Good" (2)? (Hint: add those two counts and divide by the total.)

C.5. Write code to find the 10 respondents with the highest BMI. Then write code to find the 10 youngest respondents. What do you notice about the health ratings of the youngest respondents compared to the highest-BMI respondents?


Part D: Integration and Analysis ⭐⭐⭐

D.1. Load the StreamVibe dataset from the chapter:

url2 = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/streamvibe_users_300.csv"
stream = pd.read_csv(url2)

Write code to answer: a) How many users have each subscription type? (Use .value_counts()) b) What is the average satisfaction score for each subscription type? (Use .groupby()) c) Filter for users who joined in 2020 or earlier. How many are there? What is their average watch time compared to the overall average?

D.2. For the StreamVibe dataset, create a data exploration that Alex Rivera might present to her manager. In a notebook with text cells for context:

a) State one question you want to investigate about user behavior b) Write the code to answer it (filtering, grouping, or sorting) c) Write a one-paragraph interpretation of your findings — what do the numbers mean in business terms? d) Note one limitation of your analysis (think about what you learned in Chapters 1 and 2 about descriptive vs. inferential statistics, variable types, or observational units)

D.3. Load the basketball dataset from the chapter:

url3 = "https://raw.githubusercontent.com/intro-stats-data/datasets/main/basketball_stats_200.csv"
ball = pd.read_csv(url3)

Sam Okafor is asked: "Which teams have the best three-point shooters?" Write the code to: a) Find the average three-point percentage by team b) Sort the result from highest to lowest c) Show only the top 5 teams

Then write a text cell explaining your findings and one thing Sam should be cautious about when interpreting the results.

D.4. Python vs. Spreadsheet Challenge: Perform the following task in both Google Sheets and Python: - Load the health dataset (paste the CSV into Sheets, use pd.read_csv() in Python) - Calculate the average age of smokers vs. non-smokers - Find the state with the most respondents

Write a short paragraph comparing the two experiences. Which felt easier? Which would you rather use if you had to repeat this analysis every month with new data?


Part E: Deeper Thinking ⭐⭐⭐⭐

E.1. The .describe() function only summarizes numerical columns by default. Why doesn't it try to calculate the mean or standard deviation of categorical columns like state or sex? What would be useful summary information for categorical columns? Write the code to generate that information.

E.2. In the health dataset, gen_health is stored as an integer (1-5) but is actually an ordinal categorical variable. Write code that creates a new column called gen_health_label that replaces the numbers with their text equivalents: - 1 = "Excellent" - 2 = "Very Good" - 3 = "Good" - 4 = "Fair" - 5 = "Poor"

(Hint: Look up df['col'].map() with a dictionary.)

Then use .value_counts() on your new column. Does it give you the same counts as the original column?

E.3. Research Exercise: Find a real CSV dataset online (not one from this textbook). Good sources include: - data.gov (U.S. government open data) - Kaggle Datasets - Our World in Data - FiveThirtyEight data

Load it into a notebook and perform a complete initial exploration: a) .head(), .shape, .info(), .describe() b) Identify each variable as categorical or numerical (use your Chapter 2 skills) c) Note any missing values d) Ask and answer one question using filtering, sorting, or .groupby() e) Write a brief data dictionary for the dataset (at least 5 variables)


Part M: Mixed Review (Interleaved with Chapters 1 and 2) ⭐-⭐⭐

These problems deliberately mix concepts from Chapters 1, 2, and 3 to strengthen your ability to connect ideas across topics.

M.1. When we loaded the health dataset and computed health['smoker'].mean() to find that 17.8% of respondents smoke, we were doing descriptive statistics. Write a sentence that would turn this into inferential statistics. What additional information would you need to feel confident about the inference? (Chapter 1 connection)

M.2. In the basketball dataset, player_id is stored as an object type in pandas. Is it a categorical or numerical variable? Is it nominal or ordinal? Could you meaningfully calculate ball['player_id'].mean()? Why or why not? (Chapter 2 connection)

M.3. Dr. Maya Chen is analyzing the health dataset. She wants to know whether exercise is associated with better general health. She writes:

exercisers = health[health['exercise'] == 1]['gen_health'].mean()
non_exercisers = health[health['exercise'] == 0]['gen_health'].mean()
print(f"Exercisers: {exercisers:.2f}")
print(f"Non-exercisers: {non_exercisers:.2f}")

a) What do you expect to find — will exercisers have a lower or higher mean gen_health score? (Remember: 1 = Excellent, 5 = Poor.) b) Even if exercisers have a better average health rating, can Maya conclude that exercise causes better health? Why or why not? (Chapter 1 connection — correlation vs. causation) c) What kind of study design would you need to establish causation? (Preview of Chapter 4)

M.4. Consider this scenario: A student loads a dataset with a column called zip_code that pandas recognizes as int64. The student runs .describe() and reports that the "average zip code" is 55102.

a) What's wrong with this analysis? (Chapter 2 connection) b) What Python command would be more appropriate for summarizing zip codes? c) What would you change about how the data is stored to prevent this mistake?