Appendix C: The Swipe Right Dataset — Codebook and Python Toolkit

IMPORTANT NOTICE — SYNTHETIC DATA: The Swipe Right Dataset is entirely synthetic. It was generated algorithmically for educational purposes to model realistic patterns from the peer-reviewed literature on online dating. It contains no real user data, no personally identifiable information, and no data sourced from any actual dating platform. Any resemblance to real individuals is coincidental. When you see patterns in this dataset — such as racial disparities in match rates — these patterns were deliberately constructed to reflect findings from published empirical research (see Hitsch et al., 2010; Robnett & Feliciano, 2011). They are pedagogical illustrations, not raw market data.

Part 1: Dataset Overview

The Swipe Right Dataset consists of 50,000 synthetic user profiles modeled on a hypothetical North American dating application. It was created to give students hands-on experience with the kinds of data that researchers and data scientists in the field of computational social science work with — without compromising anyone's privacy or misrepresenting real user behavior.

The dataset is structured as a flat CSV file with one row per user profile and 22 variables. It is used in three chapters:

Chapter 20 — Initial dataset exploration: descriptive statistics, response rates by demographics
Chapter 25 — Racial preference patterns: match rates by race/ethnicity, controlling for other variables
Chapter 36 — Integration with hookup culture prevalence data from survey sources

The Python utility library for this dataset is attraction_toolkit.py, located in this appendix directory. Instructions for loading the dataset and using the toolkit are in Part 3.

Design Principles

The dataset was generated to satisfy three goals:

Realistic marginal distributions. Each variable's distribution was calibrated to approximate published empirical findings (e.g., age distributions from app user demographics research, match rate baselines from Tyson et al., 2016).
Realistic joint distributions. Relationships between variables reflect patterns in the literature — for example, profile completeness and photo count positively predict match rates; racial disparities in match rates are consistent with Robnett & Feliciano (2011) and Hitsch et al. (2010).
Sufficient complexity for meaningful analysis. With 50,000 rows and 22 variables spanning demographic, behavioral, and outcome dimensions, there is enough data to run subgroup analyses without empty cells, and enough noise to produce realistic uncertainty.

Part 2: Complete Variable Codebook

Each entry below includes: variable name, data type, range or categories, description, how it was generated, and what it represents empirically.

2.1 Identity and Demographics

user_id - Type: String (anonymized) - Format: USR_XXXXXX where X is a 6-digit zero-padded integer - Description: Anonymous unique identifier for each profile. Contains no personally identifying information. - How generated: Sequential integers zero-padded to 6 digits, prefixed with "USR_". - What it represents: A synthetic stand-in for user account identifiers. In real research, user IDs are typically hashed or otherwise anonymized before researchers access them.

age - Type: Integer - Range: 18–65 - Description: User's self-reported age at time of profile creation. - How generated: Drawn from a right-skewed distribution peaking in the mid-to-late 20s (mean ≈ 29, SD ≈ 8), reflecting the age distribution of dating app users reported in Pew Research surveys. Ages truncated to [18, 65]. - What it represents: Age is among the most studied variables in mate preference research. In the dataset, it is used to examine age homophily (the tendency to prefer same-age partners), age gaps in matching, and age-stratified differences in app behavior.

gender - Type: Categorical (string) - Categories: M (man), F (woman), NB (nonbinary), Other - Distribution (approximate): M ≈ 52%, F ≈ 44%, NB ≈ 3%, Other ≈ 1% - Description: User's self-reported gender identity. - How generated: Probabilities calibrated to app demographic surveys, with nonbinary/other representation modeled at approximately double general population estimates to reflect the demographic skew of progressive dating apps. - What it represents: Gender shapes nearly every variable in the dataset — selectivity patterns, match rates, message response rates, safety concerns, and relationship goals.

sexuality - Type: Categorical (string) - Categories: Het (heterosexual), Gay (gay/lesbian), Bi (bisexual), Pan (pansexual), Other - Distribution (approximate): Het ≈ 68%, Gay ≈ 10%, Bi ≈ 14%, Pan ≈ 6%, Other ≈ 2% - Description: User's self-reported sexual orientation. - How generated: Distribution reflects elevated proportion of LGBTQ+ users relative to general population, consistent with the research finding that LGBTQ+ individuals disproportionately use dating apps (Pew Research Center, 2020). Sexuality and gender are generated with correlated joint distributions (e.g., lesbian users skew toward F gender category). - What it represents: Sexual orientation shapes who sees whom on a platform, what matching pool is available, what behaviors are normative, and the degree to which platform design assumptions (often heteronormative) fit the user's experience.

race_ethnicity - Type: Categorical (string) - Categories: White, Black, Latino, Asian, MENA (Middle Eastern / North African), Mixed, Other - Distribution (approximate): White ≈ 42%, Latino ≈ 18%, Black ≈ 16%, Asian ≈ 13%, Mixed ≈ 6%, MENA ≈ 3%, Other ≈ 2% - Description: User's self-reported racial/ethnic identity. - How generated: Reflects approximate urban US demographic distribution, skewed toward minority groups relative to national census to reflect urban app user concentrations. Racial disparities in match rates are modeled per published literature (Hitsch et al., 2010; Tyson et al., 2016). - What it represents: Race/ethnicity is a key variable in Chapter 25's analysis of racialized matching patterns. The dataset models documented disparities: Black women and Asian men experience lower match rates in the literature; White profiles receive higher baseline match rates. These inequities are built into the synthetic data to enable honest analysis of structural bias.

education - Type: Categorical (ordinal) - Categories: HS (high school diploma or less), Some_College, BA (bachelor's degree), Grad_Plus (graduate or professional degree) - Distribution (approximate): HS ≈ 14%, Some_College ≈ 22%, BA ≈ 41%, Grad_Plus ≈ 23% - Description: User's highest completed education level. - How generated: Distribution skewed toward BA and Grad_Plus relative to national averages, consistent with dating app user demographics being more educated than the general population. Education level correlates positively with income bracket in the dataset. - What it represents: Education is used in socioeconomic status analyses (Chapter 26 themes) and as a control variable in match rate regressions.

income_bracket - Type: Categorical (ordinal) - Categories: Under_30K, 30K_50K, 50K_75K, 75K_100K, Over_100K - Distribution (approximate): Under_30K ≈ 18%, 30K_50K ≈ 22%, 50K_75K ≈ 28%, 75K_100K ≈ 18%, Over_100K ≈ 14% - Description: User's self-reported approximate annual income. - How generated: Drawn with correlations to education and age (income tends to rise with age and education). Income is not collected on all real dating apps but is available on some (e.g., OKCupid historically). - What it represents: Enables analysis of socioeconomic status and mate selection patterns. In the published literature, income is a more significant predictor of women's response rates to men's profiles than vice versa (Hitsch et al., 2010).

location_type - Type: Categorical (string) - Categories: Urban, Suburban, Rural - Distribution (approximate): Urban ≈ 54%, Suburban ≈ 37%, Rural ≈ 9% - Description: Urban/suburban/rural classification of user's primary location. - How generated: Distribution reflects dating app usage patterns: rural users are underrepresented on most platforms relative to their population share. - What it represents: Location type affects pool size, match rates (rural users have fewer potential matches), and potentially relationship goals.

2.2 Profile Characteristics

profile_completeness - Type: Float - Range: 0–100 (percent) - Description: Percentage of optional profile fields that the user has completed (bio, height, job, education, relationship goals, prompts, etc.). - How generated: Drawn from a bimodal distribution: many users have either very sparse profiles (< 30%) or fairly complete profiles (> 70%), with fewer in the middle. Correlated positively with subscription tier. - What it represents: Profile completeness is a significant predictor of match rates across platforms. More complete profiles signal investment and effort, which may increase trustworthiness signaling.

photos_count - Type: Integer - Range: 0–10 - Description: Number of photos in the user's profile at the time of data snapshot. - How generated: Drawn from a Poisson-like distribution with mean ≈ 4. Zero photos is possible (≈ 5% of profiles). Correlated with profile completeness. - What it represents: Photo count is a strong predictor of swipe behavior. Profiles with no photos receive almost no matches. Research suggests that between 3–6 photos is optimal for most platforms.

bio_word_count - Type: Integer - Range: 0–500 - Description: Number of words in the user's written biography/about-me section. - How generated: Many users have 0 (no bio written, ≈ 20%), with the remainder drawn from a log-normal distribution (median ≈ 65 words, long right tail to 500). - What it represents: Bio length is a self-presentation variable. Research on linguistic content of bios (e.g., positivity, humor, self-disclosure level) predicts contact rates. This variable captures quantity rather than quality.

subscription_tier - Type: Categorical (ordinal) - Categories: Free, Basic, Premium, Gold - Distribution (approximate): Free ≈ 65%, Basic ≈ 19%, Premium ≈ 12%, Gold ≈ 4% - Description: User's subscription level on the platform. - How generated: Most users are free-tier; paid subscriptions increase with age (older users more willing to pay) and income. Subscription tier correlates with profile completeness and behavioral variables. - What it represents: Subscription status is a proxy for investment and behavioral capacity — paid users can see who liked them, have more super-likes, etc. — which affects match rates.

2.3 Behavioral Variables

daily_swipes - Type: Float - Range: 0–200 (right-swipes per day) - Description: Average number of right-swipes (expressions of interest) the user makes per day. - How generated: Strongly sex-differentiated: men's distribution has mean ≈ 45 (SD ≈ 20); women's distribution has mean ≈ 12 (SD ≈ 8), consistent with findings from Tyson et al. (2016) on gender asymmetry in swiping behavior. NB/Other profiles intermediate. - What it represents: Daily swipes reflect selectivity. Lower swiping is associated with higher selectivity; the sex difference in this variable is one of the most robust findings in dating app research.

match_rate - Type: Float - Range: 0–100 (percentage of right-swipes that result in a match) - Description: Proportion of the user's right-swipes that result in a mutual match. - How generated: Complex function of gender, race/ethnicity, photos_count, profile_completeness, subscription_tier, and random noise. Men's match rates are substantially lower than women's on average (men ≈ 7%, women ≈ 31%), consistent with the asymmetry in selectivity documented in the literature. Racial disparities are built in per Hitsch et al. (2010) and Robnett & Feliciano (2011) patterns. - What it represents: Match rate is the primary "success" outcome in the dataset. It is biased by both who the user swipes on and how desirable the user's profile is to others — these factors cannot be cleanly separated in observational data.

message_response_rate - Type: Float - Range: 0–100 (percentage of received messages that the user responds to) - Description: Proportion of incoming messages to which the user sends a reply. - How generated: Correlated with gender (women respond to a lower proportion of messages than men, as their inboxes are more full), relationship goal, and age. Mean for women ≈ 22%; for men ≈ 58%. - What it represents: Response rate measures filtering behavior after a match — not all matches lead to conversation. A low response rate may indicate inbox saturation, selectivity, or disengagement.

avg_message_length - Type: Float - Range: 0–300 (characters) - Description: Average length of messages sent by the user, in characters. - How generated: Log-normally distributed. Users who never send messages have avg_message_length = 0 (≈ 15% of users). Among senders, mean ≈ 85 characters. - What it represents: Message length is a proxy for communicative investment. Linguistic style matching research (discussed in Chapter 17) suggests longer, more substantive messages predict higher response rates.

months_on_platform - Type: Integer - Range: 0–60 - Description: Number of months the user has been active on the platform. - How generated: Right-skewed distribution; most users are relatively new (median ≈ 8 months). Long tenures (> 24 months) may reflect churning users who cycle on and off the platform. - What it represents: Tenure on platform correlates inversely with relationship formation success (people who found relationships tend to leave). High tenure may signal the "perpetual user" phenomenon — difficulty translating matches to real relationships.

dates_per_month - Type: Float - Range: 0–20 - Description: Self-reported average number of in-person dates per month originating from this platform. - How generated: Drawn from a zero-inflated distribution: approximately 38% of users report 0 dates per month. Among those with dates, drawn from Poisson distribution with mean ≈ 2.1. Correlated with match_rate and message_response_rate. - What it represents: Bridges the gap between digital matching and real-world behavior. High match rates do not necessarily translate to high date rates (the "matching but not meeting" phenomenon).

2.4 Outcome Variables

relationship_goal - Type: Categorical (string) - Categories: Casual (casual sex / no commitment), Relationship (long-term partner), Unsure (open to both / undecided) - Distribution (approximate): Casual ≈ 27%, Relationship ≈ 45%, Unsure ≈ 28% - Description: User's self-stated relationship goal. - How generated: Distributed with gender and sexuality correlations — men report Casual somewhat more often than women (consistent with SOI-R research); LGBTQ+ users show more diverse distributions. - What it represents: Relationship goal is a key moderating variable. The match between partners' relationship goals predicts satisfaction; mismatch is a major source of dissatisfaction in app-mediated dating.

reported_satisfaction - Type: Float - Range: 1–10 (self-report Likert-type scale) - Description: User's overall self-reported satisfaction with their experience on the platform (1 = very dissatisfied, 10 = very satisfied). - How generated: Mean ≈ 5.2 (SD ≈ 2.0), with a slight negative skew (more users report dissatisfaction than satisfaction, consistent with findings from nationally representative surveys of online dating users, e.g., Vogels, 2020). Correlated positively with dates_per_month and match_rate; correlated negatively with months_on_platform (time without success reduces satisfaction). - What it represents: The primary subjective outcome variable. Used in the regression exercises in Chapter 20 and in the satisfaction_predictor() function in attraction_toolkit.py.

Part 3: Loading and Using the Dataset

Requirements

Install the required packages if you haven't already. From the repo root or this appendix directory:

pip install -r requirements.txt

The requirements file includes: numpy, pandas, matplotlib, seaborn, scipy.

Basic Loading

The attraction_toolkit.py file is in the same directory as this codebook. The simplest way to get started:

from attraction_toolkit import load_dataset, summary_stats

# Load the dataset (generates it if not already on disk)
df = load_dataset()

# Print descriptive statistics
summary_stats(df)

The first time you call load_dataset(), the function will generate the synthetic dataset and save it as swipe_right_data.csv in the same directory. On subsequent calls it loads from the CSV file. If you want to regenerate with a different random seed, call load_dataset(regenerate=True, seed=42).

Exploring the Data

import pandas as pd
from attraction_toolkit import load_dataset, filter_profiles, match_rate_by_group

df = load_dataset()

# Basic shape and columns
print(df.shape)          # (50000, 22)
print(df.dtypes)         # column types

# First look
print(df.head())

# Filter to women-identifying users aged 25-35 seeking relationships
women_25_35 = filter_profiles(df, gender='F', age_min=25, age_max=35, relationship_goal='Relationship')
print(f"Filtered: {len(women_25_35)} profiles")

# Match rates by race
match_by_race = match_rate_by_group(df, 'race_ethnicity')
print(match_by_race)

Part 4: Pandas Primer for New Users

If you are new to Python or the pandas library, this brief section covers the most common operations you will need for working with the Swipe Right Dataset.

What is a DataFrame?

A pandas DataFrame is a two-dimensional table of data — like a spreadsheet — where each column is a variable and each row is an observation. The dataset loads as a DataFrame with 50,000 rows (profiles) and 22 columns (variables).

import pandas as pd

# After loading:
df = pd.read_csv("swipe_right_data.csv")

# The shape: (rows, columns)
print(df.shape)

# Column names
print(df.columns.tolist())

# First 5 rows
print(df.head())

# A single column
print(df['age'])

# Basic statistics for numeric columns
print(df.describe())

Filtering Rows

Use Boolean indexing to select rows where conditions are met:

# Men only
men = df[df['gender'] == 'M']

# Users aged 22–28
young_users = df[(df['age'] >= 22) & (df['age'] <= 28)]

# Combine conditions
urban_bi = df[(df['location_type'] == 'Urban') & (df['sexuality'] == 'Bi')]

Grouping and Aggregating

The .groupby() method splits data by a categorical variable, applies a function to each group, and combines results:

# Mean match rate by gender
df.groupby('gender')['match_rate'].mean()

# Multiple statistics
df.groupby('race_ethnicity')['match_rate'].agg(['mean', 'median', 'std'])

# Crosstab: gender × relationship goal counts
pd.crosstab(df['gender'], df['relationship_goal'])

Sorting and Selecting

# Sort by satisfaction (descending)
df.sort_values('reported_satisfaction', ascending=False).head(10)

# Select specific columns
df[['age', 'gender', 'match_rate', 'reported_satisfaction']]

Missing Data

The Swipe Right Dataset was generated to have no missing values, but you should always check:

print(df.isnull().sum())  # Count missing per column
df_clean = df.dropna()    # Drop rows with any missing value
df_filled = df.fillna(0)  # Replace missing with 0 (use cautiously)

Part 5: Ethical Notes on Working with Dating Data

Even though the Swipe Right Dataset is synthetic, the patterns it encodes are real — they reflect documented inequities in how dating platforms serve different users. When you analyze these patterns, you are engaging with content that touches real people's experiences of rejection, racialization, and commodification.

A few principles for responsible analysis:

Describe, then explain. Finding that Black women have lower match rates in the dataset is a descriptive finding. The next step is not to conclude that Black women are "less attractive" — it is to examine what structural, algorithmic, and social factors produce this pattern.
Distinguish the platform from the person. Match rates are a joint product of user behavior, algorithmic ranking, platform design choices, and cultural bias. Low match rates are not individual failures.
Be cautious with causal language. The dataset is cross-sectional. You can identify correlations, not causes. "Profile completeness predicts match rate" does not prove that completing your profile causes more matches.
Avoid stigmatizing language. Variables like subscription_tier and daily_swipes describe behaviors, not character. Resist the framing that power users or highly selective users are doing something wrong.

These principles apply equally to real data analysis. The Swipe Right Dataset is practice for working with real-world data in a field where the stakes — people's sense of desirability, loneliness, and belonging — are genuinely high.