Appendix B: Python and Data Toolkit Reference

This appendix is a working reference document. It is not meant to teach Python from scratch — if you have no programming experience, work through a brief Python tutorial (see suggestions at the end of this appendix) before the first code chapter. This appendix assumes you know what a variable, a loop, and a function are. What it provides is everything you need to set up your environment correctly and execute the code examples in Chapters 4, 6, 11, 14, 19, 22, 27, 33, and 37, plus the capstone projects.

Keep this appendix open alongside any chapter with a Code/ subdirectory.

B.1 Environment Setup

Installing Anaconda

Anaconda is the recommended Python distribution for this textbook. It bundles Python with the scientific computing stack (NumPy, pandas, matplotlib, scikit-learn) and the conda package manager, which handles dependency conflicts better than pip alone for scientific packages.

Step 1: Download Anaconda

Visit anaconda.com/download. Download the installer for your operating system (Windows, macOS, or Linux). Choose the Python 3.10 or later distribution. The full Anaconda installer is approximately 800 MB; Miniconda (a minimal installer) is about 70 MB and is sufficient if disk space is limited.

Step 2: Install

Windows: Run the .exe installer. When prompted, check "Add Anaconda to my PATH environment variable" only if you know what you are doing; otherwise, use the Anaconda Prompt application that the installer creates.
macOS/Linux: Run bash Anaconda3-<version>-MacOSX-x86_64.sh from your terminal. Accept the license, choose installation location, and allow the installer to initialize conda.

Step 3: Verify installation

Open your terminal (or Anaconda Prompt on Windows) and run:

conda --version
python --version

Both commands should return version numbers. If they produce "command not found," the installation did not complete correctly — restart your terminal or re-run the installer.

Creating a Virtual Environment

A virtual environment is an isolated Python installation for a specific project. Using one prevents package conflicts between projects and ensures that updating a package for one course does not break another.

# Create an environment named "polanalytics" with Python 3.11
conda create -n polanalytics python=3.11

# Activate the environment
conda activate polanalytics

# Your prompt should now show (polanalytics) at the left

Always activate your environment before working on this textbook's code.

Installing Required Packages

The textbook's code directory includes a requirements.txt file listing all required packages. With your environment activated:

pip install -r requirements.txt

If you do not have the requirements file, install packages manually:

pip install pandas numpy matplotlib seaborn scipy statsmodels scikit-learn nltk textblob vaderSentiment jupyter notebook openpyxl xlrd requests beautifulsoup4 geopandas plotly

For packages that are difficult to install via pip (particularly geopandas on Windows), use conda instead:

conda install -c conda-forge geopandas

B.2 Jupyter Notebook Basics

Jupyter Notebook is an interactive environment where code, output, and explanatory text coexist in a single document. All code examples in this textbook are designed to run in Jupyter.

Starting Jupyter

# With your environment activated:
jupyter notebook

This opens a browser tab showing the Jupyter file browser. Navigate to your textbook directory and open or create a .ipynb file.

Cell Types

A notebook consists of cells. Each cell has a type:

Code cells: contain Python code. Run them with Shift+Enter or click the Run button. Output appears directly below the cell.
Markdown cells: contain formatted text using Markdown syntax. Run them to render the formatting.
Raw cells: pass content through to output unmodified; rarely needed for this textbook.

Change cell type using the dropdown menu in the toolbar or the keyboard shortcut Esc then M (for Markdown) or Y (for code).

Key Keyboard Shortcuts

Shortcut	Action
`Shift+Enter`	Run cell and move to next
`Ctrl+Enter`	Run cell and stay
`Esc` then `A`	Insert cell above
`Esc` then `B`	Insert cell below
`Esc` then `DD`	Delete current cell
`Esc` then `Z`	Undo cell deletion
`Tab`	Autocomplete
`Shift+Tab`	Show function documentation

Running Code Out of Order

Jupyter cells can be run in any order. This flexibility is also a hazard: if you run cell 5 before cell 3, cell 5 may fail because cell 3 defines a variable cell 5 needs. Always use Kernel > Restart & Run All to verify that your notebook runs correctly from top to bottom before submitting assignments.

B.3 Quick Reference: Core pandas Operations

pandas is the foundational data manipulation library. Every data chapter in this textbook uses it.

import pandas as pd
import numpy as np

Loading Data

# Load a CSV file
df = pd.read_csv('data/elections_2024.csv')

# Load a CSV with specific encoding (common with older political datasets)
df = pd.read_csv('data/legacy_data.csv', encoding='latin-1')

# Load an Excel file (requires openpyxl)
df = pd.read_excel('data/polling_data.xlsx', sheet_name='Sheet1')

# Load from a URL
df = pd.read_csv('https://example.com/data.csv')

# Load and parse date columns
df = pd.read_csv('data/approval_ratings.csv', parse_dates=['date'])

Inspecting Data

# First 5 rows (default); pass n for different number
df.head()
df.head(10)

# Last 5 rows
df.tail()

# Dimensions: (rows, columns)
df.shape

# Column names, data types, non-null counts, memory usage
df.info()

# Summary statistics for numeric columns
df.describe()

# Summary statistics including non-numeric columns
df.describe(include='all')

# Unique values in a column
df['party'].unique()
df['party'].nunique()  # Count of unique values

# Value counts (frequency table)
df['party'].value_counts()
df['party'].value_counts(normalize=True)  # As proportions

Selecting Columns and Rows

# Select a single column (returns Series)
df['vote_share']

# Select multiple columns (returns DataFrame)
df[['state', 'candidate', 'vote_share']]

# Select rows by integer position
df.iloc[0]        # First row
df.iloc[0:5]      # Rows 0 through 4
df.iloc[:, 2]     # All rows, third column

# Select rows and columns by label
df.loc[0, 'state']                    # Row 0, column 'state'
df.loc[0:4, ['state', 'vote_share']] # Rows 0-4, named columns

Filtering Data

# Boolean mask
df[df['party'] == 'Democrat']
df[df['vote_share'] > 0.5]

# Multiple conditions (use & for AND, | for OR, parentheses required)
df[(df['party'] == 'Democrat') & (df['vote_share'] > 0.5)]
df[(df['state'] == 'Ohio') | (df['state'] == 'Pennsylvania')]

# Using .loc for filtering (preferred for clarity)
df.loc[df['vote_share'] > 0.5]

# Negation
df[~(df['party'] == 'Republican')]  # All non-Republican rows

# Filter with .isin()
swing_states = ['Michigan', 'Wisconsin', 'Pennsylvania', 'Arizona', 'Nevada']
df[df['state'].isin(swing_states)]

# Filter out rows based on column values
df[df['incumbency_status'].notna()]  # Remove rows where column is NaN

Groupby and Aggregation

# Mean vote share by party
df.groupby('party')['vote_share'].mean()

# Multiple aggregations
df.groupby('party')['vote_share'].agg(['mean', 'median', 'std', 'count'])

# Group by multiple columns
df.groupby(['party', 'region'])['vote_share'].mean()

# Custom aggregation
df.groupby('state').agg({
    'vote_share': 'mean',
    'turnout_rate': 'median',
    'campaign_spending': 'sum'
})

# Reset index after groupby (flatten result into regular DataFrame)
summary = df.groupby('party')['vote_share'].mean().reset_index()

Merging and Joining DataFrames

# Inner join: keep only rows with matches in both DataFrames
merged = pd.merge(elections_df, demographics_df, on='fips_code', how='inner')

# Left join: keep all rows from left, match from right where possible
merged = pd.merge(elections_df, demographics_df, on='fips_code', how='left')

# Merge on differently named columns
merged = pd.merge(elections_df, census_df,
                  left_on='county_fips', right_on='GEOID', how='left')

# Merge on multiple columns
merged = pd.merge(df1, df2, on=['state', 'year'], how='inner')

# Concatenate DataFrames vertically (stack rows)
all_years = pd.concat([df_2016, df_2020, df_2024], ignore_index=True)

Handling Missing Data

# Check for missing values
df.isna().sum()           # Count NaN per column
df.isna().any()           # Boolean: any NaN per column
df.isna().sum().sum()     # Total NaN count in entire DataFrame

# Drop rows with any NaN values
df_clean = df.dropna()

# Drop rows with NaN in specific columns
df_clean = df.dropna(subset=['vote_share', 'turnout_rate'])

# Fill NaN with a constant
df['poll_result'].fillna(0)

# Fill NaN with column mean
df['approval_rating'].fillna(df['approval_rating'].mean())

# Fill NaN with forward fill (useful for time series)
df['approval_rating'].fillna(method='ffill')

# Replace specific values with NaN
df.replace(-99, np.nan)  # Many datasets use -99 as a missing code
df.replace({'DK': np.nan, 'RF': np.nan})  # Common survey codes

Sorting and Ranking

# Sort by one column
df.sort_values('vote_share', ascending=False)

# Sort by multiple columns
df.sort_values(['state', 'year'], ascending=[True, False])

# Rank within a column
df['spending_rank'] = df['campaign_spending'].rank(ascending=False)

# Rank within groups (e.g., rank candidates within each state)
df['state_rank'] = df.groupby('state')['vote_share'].rank(ascending=False)

B.4 Quick Reference: matplotlib and seaborn

import matplotlib.pyplot as plt
import seaborn as sns

# Set a clean style (run once at the start of your notebook)
sns.set_theme(style='whitegrid', palette='colorblind')
plt.rcParams['figure.dpi'] = 120

Figure and Axes Setup

# Single figure
fig, ax = plt.subplots(figsize=(10, 6))

# Multiple subplots in a grid
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
ax = axes[0, 0]  # Access individual axes

# Tight layout prevents overlap
plt.tight_layout()
plt.show()

Common Chart Types

# Bar chart — comparing categories
ax.bar(df['party'], df['vote_share'], color=['#2166AC', '#D73027'])
# seaborn equivalent (easier formatting):
sns.barplot(data=df, x='party', y='vote_share', ax=ax)

# Horizontal bar chart — for many categories (e.g., all 50 states)
df.sort_values('vote_share').plot(kind='barh', x='state', y='vote_share', ax=ax)

# Line chart — time series (approval ratings, polling trends)
ax.plot(df['date'], df['approval'], color='steelblue', linewidth=2)
sns.lineplot(data=df, x='date', y='approval', ax=ax)

# Scatter plot — relationship between two numeric variables
ax.scatter(df['unemployment'], df['approval'], alpha=0.6)
sns.scatterplot(data=df, x='unemployment', y='approval',
                hue='party', size='election_year', ax=ax)

# Histogram — distribution of one variable
ax.hist(df['vote_share'], bins=20, edgecolor='white')
sns.histplot(data=df, x='vote_share', bins=20, kde=True, ax=ax)

# Box plot — distribution across categories
sns.boxplot(data=df, x='region', y='turnout_rate', ax=ax)

# Heatmap — correlation matrix or crosstab
corr_matrix = df[['turnout', 'income', 'education', 'age_median']].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, ax=ax)

# Choropleth map — see geopandas section in Chapter 4

Labels, Titles, and Legends

# Axis labels and title
ax.set_xlabel('Unemployment Rate (%)', fontsize=12)
ax.set_ylabel('Presidential Approval (%)', fontsize=12)
ax.set_title('Unemployment and Approval, 1953–2024', fontsize=14, fontweight='bold')

# Add a subtitle (using fig.suptitle or text)
fig.suptitle('Source: Gallup / BLS', y=0.01, fontsize=9, color='gray')

# Legend
ax.legend(labels=['Democrat', 'Republican'], title='Party', loc='upper right')

# Annotations — add text at a specific data point
ax.annotate('2008 Financial Crisis',
            xy=(10.0, 27),          # Point to annotate
            xytext=(8.5, 22),       # Where to put the text
            arrowprops=dict(arrowstyle='->', color='black'),
            fontsize=10)

# Rotate x-axis labels to prevent overlap
plt.xticks(rotation=45, ha='right')

# Set axis limits
ax.set_xlim(0, 15)
ax.set_ylim(20, 80)

Saving Figures

# Save to file (before plt.show())
fig.savefig('figures/approval_vs_unemployment.png', dpi=150, bbox_inches='tight')

# Save as PDF for high-quality publication
fig.savefig('figures/approval_vs_unemployment.pdf', bbox_inches='tight')

# Save as SVG for vector editing
fig.savefig('figures/approval_vs_unemployment.svg', bbox_inches='tight')

B.5 Quick Reference: The ODA Dataset

The OpenDemocracy Analytics (ODA) Dataset is the primary teaching dataset for this textbook. It covers U.S. elections from 1980 to 2024 at the county level, combined with demographic, economic, and media environment data. The dataset is fictional in the sense that it was constructed for pedagogical purposes, but its structure, variable names, and approximate statistical properties mirror real datasets like the MIT Election Data and Science Lab County Presidential Election Returns dataset.

The Six Tables

oda_elections.csv — Core election results

Column	Description	Type
`fips`	5-digit county FIPS code	string
`state`	State name	string
`state_abbr`	State abbreviation	string
`county`	County name	string
`year`	Election year (1980–2024)	integer
`dem_votes`	Democratic candidate raw votes	integer
`rep_votes`	Republican candidate raw votes	integer
`total_votes`	Total ballots cast	integer
`dem_share`	Democratic two-party vote share	float
`rep_share`	Republican two-party vote share	float
`winner`	'D' or 'R'	string
`margin`	dem_share - rep_share	float
`dem_candidate`	Democratic candidate name	string
`rep_candidate`	Republican candidate name	string

oda_demographics.csv — County demographics by year (Census-aligned)

Key columns: fips, year, pop_total, pop_white_pct, pop_black_pct, pop_hispanic_pct, pop_asian_pct, pop_65plus_pct, pop_under30_pct, median_age, urban_rural_code (1–6, USDA classification), pop_density.

oda_economics.csv — County economic indicators by year

Key columns: fips, year, median_household_income, unemployment_rate, poverty_rate, manufacturing_employment_pct, college_attainment_pct, gini_coefficient, home_ownership_rate.

oda_polling.csv — State-level polling averages, 2000–2024

Key columns: state, year, pollster, poll_date, dem_support, rep_support, sample_size, methodology (live phone/IVR/online), likely_voter_screen, days_to_election.

oda_media.csv — Media environment proxies by market and year

Key columns: dma_code, dma_name, year, fox_news_viewership_index, cable_news_penetration, local_news_stations, newspaper_circulation_per_capita, political_ad_spending_per_voter.

oda_congress.csv — Congressional district results, 2000–2024

Key columns: district_id, state, district_num, year, dem_votes, rep_votes, incumbent_party, incumbent_running, dem_share, margin, open_seat.

Standard Loading Pattern

import pandas as pd

# Set the data directory path once
DATA_DIR = 'data/oda/'

# Load all tables
elections = pd.read_csv(f'{DATA_DIR}oda_elections.csv',
                        dtype={'fips': str})  # Keep FIPS as string to preserve leading zeros
demographics = pd.read_csv(f'{DATA_DIR}oda_demographics.csv',
                           dtype={'fips': str})
economics = pd.read_csv(f'{DATA_DIR}oda_economics.csv',
                        dtype={'fips': str})
polling = pd.read_csv(f'{DATA_DIR}oda_polling.csv',
                      parse_dates=['poll_date'])
media = pd.read_csv(f'{DATA_DIR}oda_media.csv')
congress = pd.read_csv(f'{DATA_DIR}oda_congress.csv')

# Create a merged county-level dataset for a single election year
def get_county_panel(year):
    """Merge elections, demographics, and economics for a given year."""
    elec = elections[elections['year'] == year]
    demo = demographics[demographics['year'] == year]
    econ = economics[economics['year'] == year]

    merged = elec.merge(demo, on='fips', how='left', suffixes=('', '_demo'))
    merged = merged.merge(econ, on='fips', how='left', suffixes=('', '_econ'))
    return merged

county_2020 = get_county_panel(2020)

Standard Data Quality Checks

Run these before any analysis to confirm the data loaded correctly:

def oda_quality_check(df, name):
    print(f"\n=== {name} ===")
    print(f"Shape: {df.shape}")
    print(f"Missing values:\n{df.isna().sum()[df.isna().sum() > 0]}")
    print(f"Duplicate rows: {df.duplicated().sum()}")
    if 'year' in df.columns:
        print(f"Year range: {df['year'].min()} – {df['year'].max()}")
    if 'fips' in df.columns:
        print(f"Unique FIPS codes: {df['fips'].nunique()}")

oda_quality_check(elections, 'Elections')
oda_quality_check(demographics, 'Demographics')
oda_quality_check(economics, 'Economics')

Known data notes: - Alaska reports results at the borough/census area level in some years; FIPS codes for Alaska may differ across years. - A small number of counties were created or dissolved during 1980–2024 (notably in Virginia, where independent cities appear as counties). The ODA dataset standardizes these to 2020 FIPS boundaries where possible. - Economic data for 1980–1989 is sparser than for later decades; use with caution and document the limitation in your analysis.

B.6 Quick Reference: Text Analysis

Chapters 27 and 37 use natural language processing (NLP) to analyze political speeches, social media content, and news coverage.

Setup: NLTK Downloads

import nltk

# Run once to download required data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

Tokenization and Preprocessing

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
# Add political domain stopwords
stop_words.update(['said', 'would', 'could', 'also', 'one', 'us', 'like'])

lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    """Tokenize, lowercase, remove stopwords and punctuation, lemmatize."""
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Keep only alphabetic tokens, remove stopwords
    tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
    # Lemmatize (reduces words to base form: 'running' -> 'run')
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

# Apply to a DataFrame column
df['tokens'] = df['speech_text'].apply(preprocess_text)
df['clean_text'] = df['tokens'].apply(lambda x: ' '.join(x))

VADER Sentiment Analysis

VADER (Valence Aware Dictionary and sEntiment Reasoner) is calibrated for short social media texts but works reasonably well for political messaging and news headlines. It returns scores from -1 (most negative) to +1 (most positive).

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text):
    """Returns a dict with neg, neu, pos, compound scores."""
    return analyzer.polarity_scores(text)

# Apply to DataFrame
sentiments = df['tweet_text'].apply(get_sentiment)
df['sentiment_compound'] = sentiments.apply(lambda x: x['compound'])
df['sentiment_pos'] = sentiments.apply(lambda x: x['pos'])
df['sentiment_neg'] = sentiments.apply(lambda x: x['neg'])

# Classify sentiment direction
df['sentiment_label'] = df['sentiment_compound'].apply(
    lambda x: 'positive' if x >= 0.05 else ('negative' if x <= -0.05 else 'neutral')
)

# Summary
print(df['sentiment_label'].value_counts(normalize=True))

CountVectorizer and TF-IDF

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Count vectorizer: raw word counts
cv = CountVectorizer(max_features=5000, min_df=5, max_df=0.95)
count_matrix = cv.fit_transform(df['clean_text'])
feature_names = cv.get_feature_names_out()

# TF-IDF: down-weights words that appear everywhere, up-weights distinctive words
tfidf = TfidfVectorizer(max_features=5000, min_df=5, max_df=0.95,
                         ngram_range=(1, 2))  # Include bigrams
tfidf_matrix = tfidf.fit_transform(df['clean_text'])

# Get most important terms for a specific document
doc_idx = 0
scores = tfidf_matrix[doc_idx].toarray().flatten()
top_indices = scores.argsort()[::-1][:20]
top_terms = [(tfidf.get_feature_names_out()[i], scores[i]) for i in top_indices]
print("Top terms:", top_terms)

# Get most distinctive terms by party (group average TF-IDF)
import numpy as np
for party in ['Democrat', 'Republican']:
    mask = df['party'] == party
    party_tfidf = tfidf_matrix[mask].mean(axis=0).A1
    top_idx = party_tfidf.argsort()[::-1][:10]
    terms = [tfidf.get_feature_names_out()[i] for i in top_idx]
    print(f"\n{party} distinctive terms: {terms}")

B.7 Troubleshooting Guide

The following are the ten most common errors students encounter working through this textbook's code.

Error 1: ModuleNotFoundError: No module named 'pandas'

Cause: You are running Python in a different environment than where pandas is installed.

Fix: Confirm your environment is activated (conda activate polanalytics). In Jupyter, confirm the kernel matches your environment: Kernel > Change Kernel > select the polanalytics environment. If it does not appear, run conda install -n polanalytics ipykernel then python -m ipykernel install --user --name polanalytics.

Error 2: KeyError: 'fips'

Cause: The column name in your DataFrame does not match what the code expects. Common causes: the CSV uses 'FIPS' or 'Fips' instead of 'fips'; the column name has trailing whitespace.

Fix: Run df.columns.tolist() to see exact column names. Use df.columns = df.columns.str.strip().str.lower() to standardize all column names to lowercase with no whitespace.

Error 3: FIPS codes like 1001 instead of 01001

Cause: pandas read the FIPS column as integers, dropping the leading zero.

Fix: Always load FIPS as string: pd.read_csv('file.csv', dtype={'fips': str}). Or fix after loading: df['fips'] = df['fips'].astype(str).str.zfill(5).

Error 4: ValueError: Cannot merge on a key with dtype <int64> and dtype <object>

Cause: You are trying to merge two DataFrames on a column that has different data types in each (one is integer, one is string).

Fix: Standardize before merging: df1['fips'] = df1['fips'].astype(str) and df2['fips'] = df2['fips'].astype(str).

Error 5: Figures appear blurry in Jupyter

Cause: Default DPI is low.

Fix: Add plt.rcParams['figure.dpi'] = 120 at the top of your notebook. For exports, use fig.savefig('name.png', dpi=150, bbox_inches='tight').

Error 6: SettingWithCopyWarning

Cause: You are trying to modify a DataFrame that is a slice of another DataFrame. pandas is warning you the modification may not apply to the original.

Fix: Use .copy() when creating subsets you intend to modify: subset = df[df['year'] == 2020].copy(). Then subset['new_col'] = ... will work without warnings.

Error 7: LookupError: Resource vader_lexicon not found

Cause: NLTK data has not been downloaded.

Fix: Run nltk.download('vader_lexicon') in a cell and re-execute.

Error 8: Memory error on large datasets

Cause: Loading the full ODA dataset or a large text corpus into memory.

Fix: Load only needed columns: pd.read_csv('file.csv', usecols=['fips', 'year', 'dem_share']). Or process data in chunks for very large files: pd.read_csv('file.csv', chunksize=10000).

Error 9: TypeError: '<' not supported between instances of 'str' and 'float'

Cause: A column you expect to be numeric contains string values (often because of missing-value codes like "N/A" or "-" that were not converted to NaN on load).

Fix: Force conversion: df['column'] = pd.to_numeric(df['column'], errors='coerce'). The errors='coerce' argument converts unparseable values to NaN instead of raising an error.

Error 10: Regression results look wrong (all coefficients near zero, very low R²)

Cause: Often caused by forgetting to check the scale of your variables. A regression of vote share (as proportion, 0–1) on unemployment (as percent, 0–100) will produce a coefficient of about 0.01 that is hard to interpret, versus the same data expressed consistently in percentage points.

Fix: Use df.describe() to check the range of every variable before running regression. Standardize variables if necessary: df['var_std'] = (df['var'] - df['var'].mean()) / df['var'].std().

B.8 Additional Learning Resources

If you need to build foundational Python skills before engaging with the code chapters:

Python for Everybody (py4e.com) — Free, beginner-oriented, no prerequisites
Kaggle Learn (kaggle.com/learn) — Free short courses on pandas, data visualization, and machine learning
Imai, Quantitative Social Science (Princeton University Press, 2022) — The companion textbook for political scientists learning R; the Python translation of all datasets and code is available at qss-data.princeton.edu
Python Data Science Handbook (Jake VanderPlas, free at jakevdp.github.io/PythonDataScienceHandbook/) — Reference-quality coverage of NumPy, pandas, matplotlib, and scikit-learn

For text analysis specifically: - Natural Language Processing with Python (Bird, Klein, Loper — free at nltk.org/book) — The official NLTK tutorial - Text Analysis with Python (Grimmer, Stewart, Roberts, 2022, Princeton) — Methods-focused, political science examples