Chapter 24 Exercises: Audience Analytics with Python

DataField.Dev

Chapter 24 Exercises: Audience Analytics with Python

Prerequisites: Python 3.8+, pandas, matplotlib, seaborn, scikit-learn installed. See Section 24.2 for installation instructions.

Exercise 24.1: Running and Interpreting the Growth Analysis Script

Objective: Execute growth_analysis.py with sample data, interpret the outputs, and adapt the script to your own data.

Part A: Run the Script as-is (30 minutes)

Open your command line or terminal
Navigate to the code/ directory of this chapter
Run: python growth_analysis.py
The script will generate a chart (growth_analysis_chart.png) and print a console report

Answer these questions based on the output:

How many inflection points were detected in the sample data? On what approximate dates did they occur?
Looking at the chart's top panel: at the inflection point dates, does the 4-week moving average curve show a steeper slope than the 12-week moving average? What does it mean when the short-term MA is above the long-term MA?
The console report shows "Current growth momentum." Is the recent trend above or below the historical average in the sample data?
Looking at the bottom panel (weekly growth rate bars): approximately what percentage growth rate was needed to qualify as an inflection point?

Part B: Adapt for Your Own Data (60 minutes)

If you have any of the following, use it for this exercise: - A YouTube Studio subscriber count CSV export - A manually maintained follower tracking spreadsheet exported as CSV - Any CSV with a date column and a count column

Update DATA_FILE, DATE_COLUMN, and FOLLOWER_COLUMN in the configuration section
Run the script with your real data
Compare the detected inflection points to your memory of your growth history. Which inflection points does the algorithm identify that you would have identified manually? Which, if any, does it miss or add incorrectly?

Part C: Code Modification (45 minutes)

Make one of these modifications to the script and re-run it:

Option A: Change INFLECTION_THRESHOLD_MULTIPLIER from 1.5 to 1.0, then to 2.0. How does this change the number of detected inflection points? Which threshold level produces the most useful results for your data?

Option B: Add a third moving average line to the chart. In the calculate_growth_metrics function, add a ma_26wk (26-week moving average) column. In the generate_growth_chart function, plot it as a purple line with linewidth=1.5. What additional insight does the 26-week MA provide compared to the 4-week and 12-week MAs?

Option C: Add a "projected growth" line to the top panel. Calculate the average weekly growth over the last 12 weeks and project it forward 12 weeks. Plot this as a dashed gray line extending beyond the current end date. Does the projection look consistent with what you'd expect from your channel's trajectory?

Exercise 24.2: Hands-On with Pandas — Data Cleaning Challenges

Objective: Develop practical pandas data cleaning skills using messy real-world-style data.

Part A: Create a Messy Data File

Create a CSV file called messy_data.csv with the following exact content (including the formatting problems intentionally included):

Date,Followers,Notes
"January 7, 2024","1,204",New year post
"January 14, 2024","1,287",Weekly update
"January 21, 2024","1,345",Tutorial video
"January 28, 2024",,No post this week
"February 4, 2024","2,891",Viral video!
"February 11, 2024","3,012",Follow-up content
"February 18, 2024","N/A",Platform outage
"February 25, 2024","3,156",Normal week

Part B: Write a Cleaning Script

Write a Python script called data_cleaning_exercise.py that:

Loads messy_data.csv using pd.read_csv()
Identifies and prints all data quality issues found (hint: check .dtypes, .isna().sum(), and look for non-numeric values in the Followers column)
Converts the Date column to proper datetime format using pd.to_datetime()
Removes commas from the Followers column and converts to integer
Handles the missing value (empty row) by filling with the average of the surrounding rows
Handles the "N/A" string in Followers by replacing it with NaN and then interpolating (use df['Followers'].interpolate())
Drops the Notes column (not needed for analysis)
Prints the clean DataFrame and confirms all dtypes are correct

Part C: Answer these questions in comments at the top of your script: 1. What would happen if you ran df['Followers'].pct_change() on the original (uncleaned) data? 2. The "February 18" row had a platform outage — the data is missing for a genuine operational reason. Is it appropriate to interpolate this value, or should it be excluded from growth rate calculations? What are the arguments for each approach?

Exercise 24.3: K-Means Clustering Exploration

Objective: Develop intuition for how K-means clustering behaves through hands-on experimentation.

Part A: Run audience_segmentation.py as-is

Run the script and record: 1. The silhouette score (a measure of cluster quality) 2. The count and percentage for each segment (Lurkers, Engagers, Superfans) 3. The mean values for each behavioral metric per segment

Part B: Vary K (Number of Clusters)

Modify N_CLUSTERS in the configuration section and re-run the script with k=2, k=4, and k=5.

For each value of k: 1. Record the silhouette score 2. Describe in one sentence what the clusters look like at that k value 3. Does the segmentation still produce interpretable, actionable segments?

Create a summary table: | k | Silhouette Score | Interpretability (1-5) | Notes | |---|-----------------|----------------------|-------| | 2 | | | | | 3 | | | | | 4 | | | | | 5 | | | |

Question: Based on your experimentation, what k value would you recommend for a creator community? Justify your answer using both the silhouette scores and the interpretability of results.

Part C: Change the Feature Weights

The assign_segment_labels function uses an engagement score that weights purchases most heavily. Rewrite the scoring formula so that comments_made is weighted most heavily (weight=4) and purchases_made is weighted second (weight=3).

Does this change which cluster is labeled "Superfan"?
Under what creator business model would comments be a better primary differentiator than purchases? (Think about which creator types rely on community engagement over direct sales.)

Part D: Add a New Feature

Modify the sample data generator to add a fifth column: shares_made (number of times a user shared your content). Add this to the Lurker, Engager, and Superfan distributions as follows: - Lurkers: Poisson(lam=0.1) — rarely share - Engagers: Poisson(lam=1.5) — occasionally share - Superfans: Poisson(lam=5.0) — frequently share

Add shares_made to the feature columns in run_clustering(). Re-run and compare the silhouette score and segment profiles to the original. Did adding this feature improve cluster separation?

Exercise 24.4: UTM Parameter Setup and Attribution Simulation

Objective: Set up real UTM tracking for your own creator links and understand the revenue attribution data model.

Part A: Create UTM-Tagged Links for Your Content

Go to Google's Campaign URL Builder (https://ga.dev/campaignurlbuilder or search "Google Campaign URL Builder").

Create UTM-tagged versions of your current primary product or landing page link for each of your active platforms. Use this naming convention: - utm_source: platform name in lowercase (youtube, tiktok, email, instagram) - utm_medium: content type (video, newsletter, reel, story) - utm_campaign: a short descriptive identifier that you can use as a content_id in your tracking

Generate at least 4 tagged links (one per active platform or content type). Record them in a table:

Platform	Medium	Campaign	Full UTM URL

Part B: Plan Your Attribution Data Structure

Design the CSV structure you would actually use to track revenue attribution for your creator business. Answer:

What would you use as your content_id — how would you format it to be both unique and descriptive?
How would your payment processor (Gumroad, Stripe, Kajabi, Teachable) capture the UTM parameter at the time of purchase? Research your specific platform's UTM or tracking capabilities.
Some sales will always be unattributed (word of mouth, bookmarked links without UTM tags). What is a realistic target attribution rate for your business, and what would you do to improve it?

Part C: Run revenue_attribution.py and Interpret

Run the script with sample data and answer:

What percentage of sales were unattributed in the sample data?
The script calculates "revenue per 1K views" for each content piece. Looking at the top 10 content pieces, is there a pattern in which platform or content type has the highest revenue efficiency?
The sample data was designed with an 80/20 distribution (top 20% of content drives 80% of revenue). Does the actual output reflect this pattern? What percentage of content pieces account for 80% of the attributed revenue?
What would you do differently in your content strategy if you had this attribution data for your own content?

Exercise 24.5: Building a Mini Creator Dashboard

Objective: Combine the three analysis scripts into a single, integrated creator analytics report.

Instructions:

Create a new Python file called creator_dashboard.py that:

Imports and calls the main analysis functions from all three scripts (you'll need to refactor the scripts slightly to import them as modules — or simply copy the functions into the dashboard file)
Generates a combined 3-panel matplotlib figure with: - Panel 1: Growth trend chart (simplified version — just follower count and the 4-week MA) - Panel 2: Segmentation pie chart showing Lurker/Engager/Superfan percentages - Panel 3: Top 5 content pieces by attributed revenue as a horizontal bar chart
Adds a "Dashboard Header" with: - Current date - Total follower count (from growth data) - Total attributed revenue (from attribution data) - Superfan count and percentage (from segmentation data)
Saves the dashboard as creator_dashboard.png at 200 DPI

Hint for the pie chart:

segment_counts = df_labeled['segment'].value_counts()
ax.pie(segment_counts.values,
       labels=segment_counts.index,
       autopct='%1.1f%%',
       colors=[SEGMENT_COLORS[s] for s in segment_counts.index])

Reflection question (write in a comment in your code): This dashboard combines three separate analyses into a unified view. What additional metric or analysis would you add as a fourth panel if you were building this for your own creator business? What data source would you need, and what question would it answer?