Exercises: Chapter 23
Association Rules and Market Basket Analysis
Exercise 1: Core Metrics by Hand (Conceptual)
Consider the following 8 transactions:
| Transaction | Items |
|---|---|
| T1 | bread, butter, milk |
| T2 | bread, butter, eggs |
| T3 | butter, milk, eggs |
| T4 | bread, milk |
| T5 | bread, butter, milk, eggs |
| T6 | bread, eggs |
| T7 | butter, milk |
| T8 | bread, butter, milk, eggs, cheese |
a) Compute the support of each individual item (bread, butter, milk, eggs, cheese).
b) Compute the support, confidence, and lift for the rule {bread, butter} -> {milk}. Show your work.
c) Compute the support, confidence, and lift for the reverse rule {milk} -> {bread, butter}. Is the lift the same as in (b)? Explain why or why not.
d) Compute the conviction for both {bread, butter} -> {milk} and {milk} -> {bread, butter}. Are they the same? What does this tell you about the directionality of these metrics?
e) A colleague sees that {bread} -> {eggs} has confidence 60% and declares it a strong rule. Milk has support 75%. Eggs have support 62.5%. Compute the lift for {bread} -> {eggs}. Is this rule genuinely interesting, or is the high confidence an artifact of the base rate of eggs?
Exercise 2: Apriori by Hand (Conceptual)
Using the same 8 transactions from Exercise 1 and min_support = 0.375 (i.e., an itemset must appear in at least 3 of 8 transactions):
a) List all frequent 1-itemsets. For each, state its support.
b) Generate all candidate 2-itemsets from the frequent 1-itemsets. Which candidates survive the support threshold?
c) Generate all candidate 3-itemsets from the frequent 2-itemsets. Which candidates survive?
d) At what point does the algorithm stop? How many database scans did it require?
e) Suppose you lower min_support to 0.125 (1 of 8 transactions). How does this change the number of frequent itemsets? What is the tradeoff?
Exercise 3: Apriori vs. FP-Growth Comparison (Code)
Using mlxtend, run both Apriori and FP-Growth on a simulated grocery dataset and compare their outputs and performance.
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder
import time
np.random.seed(42)
# Simulate 20,000 grocery transactions
n_transactions = 20_000
products = [
'milk', 'bread', 'butter', 'eggs', 'cheese', 'yogurt',
'apples', 'bananas', 'chicken', 'rice', 'pasta', 'tomatoes',
'onions', 'garlic', 'olive_oil', 'coffee', 'tea', 'sugar',
'cereal', 'juice', 'water', 'chips', 'cookies', 'ice_cream',
'beer', 'wine', 'soda', 'detergent', 'soap', 'shampoo'
]
# Simulate with realistic co-purchase patterns
# Your code here: generate transactions with embedded affinities
# (e.g., pasta + tomatoes + garlic co-occur at 25% probability)
# a) Convert to one-hot encoded DataFrame
# b) Run Apriori with min_support=0.02 and time it
# c) Run FP-Growth with min_support=0.02 and time it
# d) Verify that both produce the same set of frequent itemsets
# e) Generate association rules from the FP-Growth result (lift > 1.2)
# f) Report: How many rules? What is the highest-lift rule?
# What is the most supported 3-item set?
Exercise 4: Filtering Rules for Business Use (Code)
Starting from the rules generated in Exercise 3:
# a) Create a filtering function that accepts:
# - min_support, min_confidence, min_lift
# - max_antecedent_length (to limit complexity)
# - max_consequent_length (typically 1 for recommendations)
# Apply it with min_lift=1.5, min_confidence=0.3, max_antecedent_length=2
# b) For the filtered rules, add a column 'rule_text' that formats the
# rule as a human-readable string: "{bread, butter} -> {milk}"
# Sort by lift descending and print the top 20.
# c) Compute Zhang's metric for all filtered rules.
# Rank by Zhang's metric instead of lift. Do the rankings differ?
# Which metric would you trust more for production recommendations?
# d) Group the filtered rules by consequent item.
# For each consequent, count how many rules recommend it.
# Which items are recommended most often? What does this tell you
# about the difference between "popular items" and "items with
# strong cross-sell signals"?
Exercise 5: The min_support Sensitivity Analysis (Code)
The min_support threshold is the most impactful parameter in association rule mining. This exercise explores its effect.
# Using the grocery dataset from Exercise 3:
# a) Run FP-Growth with min_support values of:
# [0.001, 0.005, 0.01, 0.02, 0.05, 0.10, 0.20]
# For each, record:
# - Number of frequent itemsets
# - Number of rules (with min_lift=1.2)
# - Runtime (seconds)
# Plot all three as a function of min_support.
# b) At min_support=0.001, how many rules do you get?
# Sample 10 random rules from this set. Are they actionable?
# What is the average lift?
# c) At min_support=0.10, how many rules do you get?
# Are any of them surprising (i.e., not obvious to a grocery
# store manager)?
# d) Based on your analysis, what min_support would you recommend
# for a grocery chain with 500,000 monthly transactions?
# Justify your answer in 2-3 sentences.
Exercise 6: Association Rules on Non-Retail Data (Code)
Association rules work on any data that can be framed as transactions. This exercise applies them to university course enrollment data.
np.random.seed(42)
# Simulate 5,000 students, each taking 4-6 courses per semester
courses = [
'calc_1', 'calc_2', 'linear_algebra', 'stats_intro', 'stats_advanced',
'cs_intro', 'data_structures', 'algorithms', 'databases', 'ml_intro',
'econ_intro', 'micro', 'macro', 'econometrics',
'writing_101', 'ethics', 'philosophy', 'psychology',
'physics_1', 'physics_2', 'chemistry', 'biology',
'accounting', 'finance', 'marketing', 'management'
]
# Define course sequences (students who take A often take B later)
course_affinities = {
'calc_1': ['calc_2', 'physics_1'],
'calc_2': ['linear_algebra', 'physics_2'],
'stats_intro': ['stats_advanced', 'econometrics', 'ml_intro'],
'cs_intro': ['data_structures'],
'data_structures': ['algorithms', 'databases'],
'econ_intro': ['micro', 'macro'],
}
# Your code here:
# a) Generate student enrollment "transactions" using the affinities above
# (each student's course list is a transaction)
# b) Run FP-Growth with min_support=0.05
# c) Generate rules with min_lift=1.5
# d) Find the top 10 rules by lift. Do they correspond to known
# prerequisite sequences?
# e) Find rules that are NOT obvious prerequisites (cross-department
# patterns). What do they suggest about student behavior?
# f) How could a university registrar use these rules for course
# recommendation or scheduling?
Exercise 7: Temporal Stability of Rules (Code + Analysis)
Association rules can change over time. This exercise tests whether rules are stable across different time periods.
# Simulate 12 months of transaction data with:
# - Stable rules (present in all months)
# - Seasonal rules (present only in Nov-Dec)
# - Trending rules (appearing in later months, not earlier)
# a) Generate 12 monthly transaction datasets (5,000 transactions each)
# Include:
# - 5 product pairs that always co-occur (stable, lift > 2)
# - 3 product pairs that co-occur only in months 11-12 (seasonal)
# - 2 product pairs that start co-occurring in month 7 (trending)
# b) Mine rules independently for each month (min_support=0.02, min_lift=1.5)
# c) For each rule, count the number of months it appears.
# Plot a histogram of "rule persistence" (months present out of 12).
# d) Classify rules as:
# - Stable: present in 10+ months
# - Seasonal: present in 1-3 months
# - Trending: present in later months but not earlier
# Print examples of each category.
# e) In production, a merchandising team refreshes rules monthly.
# Write a function that compares this month's rules to last month's
# and returns: new_rules, dropped_rules, persistent_rules.
# Why is this comparison important for business operations?
Exercise 8: Scalability and Sparse Data (Code)
This exercise explores the practical limits of association rule mining with mlxtend.
# a) Generate transaction datasets of increasing size:
# n_transactions = [1000, 5000, 10000, 50000, 100000]
# n_items = 200, avg_basket_size = 6
# For each, time FP-Growth with min_support=0.01.
# Plot runtime vs. n_transactions. Is the scaling linear?
# b) Now fix n_transactions = 10000 and vary n_items:
# n_items = [50, 100, 200, 500, 1000]
# avg_basket_size = 6 (so sparsity increases with n_items)
# Time FP-Growth with min_support=0.01.
# Plot runtime vs. n_items. What happens to runtime as items increase?
# c) For the n_items=1000 case, compute the memory usage of the
# dense boolean DataFrame. Then create a sparse representation
# using scipy.sparse.csr_matrix and compare memory usage.
# Note: mlxtend requires a dense DataFrame, so the sparse
# representation is for storage/transfer, not for direct mining.
# d) At what dataset size does mlxtend become impractical on your
# machine? What alternatives exist for larger datasets?
# (Hint: Spark MLlib FPGrowth, efficient-apriori package)
Exercise 9: Negative Association Rules (Conceptual + Code)
Most analyses focus on positive associations (lift > 1). Negative associations (lift < 1) can also be valuable.
# a) Generate rules with NO lift filter (min_threshold=0.0 on some
# other metric like confidence). Filter to rules where lift < 0.8.
# These are product pairs that co-occur LESS than expected.
# b) In a grocery context, what might a rule like
# {whole_milk} -> {oat_milk} with lift=0.3 mean?
# Why is this commercially interesting?
# c) Compute Zhang's metric for the negative-association rules.
# Which metric (lift or Zhang's) better captures the strength
# of negative association? Explain in 2-3 sentences.
# d) Give three real-world examples where negative association rules
# would be useful for business decisions. For each, describe
# the actionable insight.
Exercise 10: End-to-End Market Basket Pipeline (Project)
Build a complete market basket analysis pipeline from raw transaction data to business recommendations.
# Use this simulated dataset or substitute your own:
# - An online bookstore with 30,000 transactions
# - 150 book titles across 10 genres
# - Known affinities between genres (mystery readers buy thrillers, etc.)
# Your pipeline should include:
# 1. Data loading and one-hot encoding
# 2. Exploratory analysis (basket size distribution, item frequencies)
# 3. FP-Growth with justified min_support choice
# 4. Rule generation and multi-criteria filtering
# 5. Rule visualization: scatter plot of support vs. confidence,
# colored by lift
# 6. A recommendation function: given a cart, return top-3 suggestions
# 7. A brief written summary (3-5 sentences) of the top findings
# and recommended business actions
# Deliverable: a single Python script or Jupyter notebook that runs
# end-to-end and produces the summary at the end.
These exercises accompany Chapter 23: Association Rules and Market Basket Analysis. Return to the chapter for full context.