Case Study 24-2: Substack's Creator Analytics Infrastructure

DataField.Dev

Case Study 24-2: Substack's Creator Analytics Infrastructure

Background

Substack launched in 2017 with a premise that became influential across the creator economy: writers should own their audience relationship directly, and the business model should be subscription revenue shared with the platform (90% to creator, 10% to Substack) rather than advertising.

By 2024, Substack hosted several thousand paid publications generating over $100 million annually across its platform, with the top 10 publications collectively earning tens of millions of dollars. Publications like The Atlantic's "The Weekly Planet," Heather Cox Richardson's "Letters from an American," and dozens of independent newsletters on topics from finance to culture to technology were generating six-figure and seven-figure annual revenues.

One of Substack's differentiating factors for creators — less discussed than the revenue model — was its analytics infrastructure. Substack provided creators with subscriber-level engagement data that most social platforms didn't offer, making it a more transparent business tool than typical social media.

What Substack's Analytics Offered

Substack's analytics dashboard (as of 2024) provided creators with:

Post-level analytics: - Unique opens: How many distinct subscribers opened each post - Open rate: Percentage of subscribers who opened - Click-through rate: Percentage who clicked at least one link - Email vs. web reads: What proportion read via email vs. on the Substack website - Top clicked links: Which specific URLs within the post got the most clicks

Subscriber-level data: - Individual subscriber open history (for their own subscribers) - Subscription date - Whether a subscriber is paid or free - Geographic distribution - Referral source: How the subscriber found the publication (social, direct, Substack's internal recommendation network)

Growth analytics: - New subscriber graph over time - Paid conversion rate (free to paid subscriber conversion) - Churn rate (paid subscribers who canceled) - Revenue metrics: MRR, total revenue, average revenue per subscriber

Referral analytics: - How new subscribers found the publication (Substack Network vs. direct vs. external) - Top referral sources

This level of data was notably more detailed than most email service providers offered in their standard tiers, and far more granular than any social media platform's free analytics.

The Subscriber-Level Data Advantage

The most analytically significant aspect of Substack's infrastructure: per-subscriber engagement data.

Unlike most creator platforms where analytics are aggregate (you see total open rates, not which specific subscribers opened), Substack allowed creators to see, for each individual subscriber, which posts they had opened. This enabled a rough segmentation of the audience based purely on reading behavior — without needing Python or any external tool.

A creator could identify: - Their most engaged subscribers (opened most posts) — the equivalent of the "Superfan" segment - Subscribers who had drifted (used to open, now don't) — candidates for a re-engagement sequence - Brand-new subscribers whose engagement pattern was still forming

For advanced creators who exported this data and ran their own analysis (as audience_segmentation.py is designed to enable), the per-subscriber data supported the full K-means segmentation workflow described in Chapter 24.

The Revenue Attribution Model Built into Substack

One of Substack's most practically useful analytics features was its referral tracking — which post or external source drove each new subscription.

When a subscriber signed up, Substack recorded: - The URL they came from if it was an external link - Which "Substack Network" recommendation drove the signup if it was internal - Direct traffic if they came without a referral

For creators who linked to their Substack from social media with UTM parameters, this created a basic revenue attribution model out of the box: which pieces of content on social or email drove new paid subscribers?

Casey Newton, who runs the technology newsletter "Platformer" on Substack, has discussed publicly how this referral data shaped his distribution strategy. Understanding that a significant portion of his paid subscriber growth came from specific Twitter/X threads (when that platform was still actively shared) helped him understand where to invest his time outside the newsletter itself.

For Marcus Webb's hypothetical Substack counterpart: this data would reveal whether YouTube videos or email sequences drove more paid newsletter subscriptions — the same question revenue_attribution.py is designed to answer with more granular data.

What Substack's Analytics Couldn't Do

Despite its relative analytics richness, Substack's native analytics had limitations that illustrate why Python-based custom analysis matters:

No visualization beyond simple charts. Substack showed time-series data in basic line charts and pie charts. Creators who wanted to see growth trend analysis with moving averages, or scatter plot visualizations of subscriber clusters, had to export data and build those visualizations themselves.

No algorithmic clustering. While per-subscriber engagement data was available, Substack didn't automatically segment subscribers into behavioral groups. That analysis required exporting data and running K-means or similar analysis externally.

Limited cross-platform attribution. Substack could tell you a subscriber came from Twitter, but not which specific tweet. For creators who wanted to attribute subscriber growth to specific pieces of content (not just platforms), manual UTM parameter setup was still required.

No engagement scoring or health metric. Substack showed raw open rates but didn't provide a calculated "subscriber health score" or flag which subscribers were likely to churn. Identifying at-risk subscribers for re-engagement campaigns required manual analysis of the per-subscriber data.

These gaps — visualization, segmentation, granular attribution, health scoring — are exactly what the three Python scripts in Chapter 24 address.

The Platform Analytics Insight

Substack's approach illustrates a broader principle about the relationship between platform analytics and creator business strategy: better analytics create better business decisions, which creates better creator retention, which benefits the platform.

Substack's decision to provide per-subscriber data (a privacy-respecting version — creators see their own subscriber behavior, not others') wasn't purely altruistic. Creators who understand their audience analytics build more sustainable businesses on the platform, generate more revenue, and stay on Substack rather than migrating to alternatives. Analytics quality is a competitive moat for creator platforms.

This dynamic helps explain why some platforms have invested heavily in creator analytics infrastructure (YouTube Studio, Substack, Spotify for Podcasters) while others have kept analytics limited. Platforms with subscription revenue models have strong incentives to help creators succeed. Platforms with advertising revenue models have more complex incentives — they want creators to keep posting, but they don't necessarily benefit from creators understanding their business health deeply enough to optimize away from advertising revenue toward owned products.

Applying Custom Python Analytics to Substack Data

For creators using Substack who want to go beyond native analytics:

Substack provides data export (Settings → Exports → Download subscriber list). The export includes per-subscriber data that can be loaded directly into pandas and run through audience_segmentation.py after appropriate column mapping.

A practical workflow: 1. Export subscriber list from Substack (includes open counts per subscriber for the last 90 days) 2. Clean column names in pandas: df.columns = df.columns.str.lower().str.replace(' ', '_') 3. Map Substack's column names to the script's expected names: - emails_opened → posts_viewed - clicks → likes_given (proxy) - paid (boolean) → derive purchases_made (1 if paid, 0 if free) 4. Run audience_segmentation.py on this data 5. The resulting segment profiles reveal: what percentage of your free subscribers behave like potential paid subscribers (Engager/Superfan engagement patterns with 0 purchases)?

That last question — free subscribers who behave like paid subscribers — is the most valuable output. These are the people most likely to convert if approached with the right offer.

Analysis Questions

Substack provides per-subscriber open history, which most social platforms don't offer. What are the privacy implications of this level of data transparency, and how should creators think about the ethical use of per-subscriber behavioral data in their marketing and segmentation strategies?
The case study argues that Substack's analytics investment serves the platform's own business interests (creator retention and revenue). Can you apply this same logic to explain why YouTube Studio is more comprehensive than TikTok's free analytics? What does each platform's incentive structure predict about their analytics investment levels?
A creator runs audience_segmentation.py on their Substack subscriber export and finds that 8% of their free subscribers have behavioral profiles matching the Superfan segment — but they're not yet paid. They have 5,000 free subscribers, which means roughly 400 superfan-pattern free subscribers. Design a specific marketing campaign to convert these 400 subscribers. What offer, messaging, and timing would you use?
The case study describes a gap: Substack's analytics can identify subscribers who came from Twitter, but not which specific tweet. How would you design a UTM parameter system for a Substack creator who posts regularly on Twitter, writes a weekly newsletter, and occasionally appears on other creators' podcasts? What specific campaign values would you use to track each of these three sources distinctly?
Substack's per-subscriber data enables engagement scoring: identifying which subscribers are likely to churn based on declining open rates. If you were building a simple churn prediction model using the exported subscriber data, what behavioral signals would you use as input features, and what threshold would you set to trigger a re-engagement email?