Exercises: Chapter 23

DataField.Dev

Exercises: Chapter 23

Association Rules and Market Basket Analysis

Exercise 1: Core Metrics by Hand (Conceptual)

Consider the following 8 transactions:

Transaction	Items
T1	bread, butter, milk
T2	bread, butter, eggs
T3	butter, milk, eggs
T4	bread, milk
T5	bread, butter, milk, eggs
T6	bread, eggs
T7	butter, milk
T8	bread, butter, milk, eggs, cheese

a) Compute the support of each individual item (bread, butter, milk, eggs, cheese).

b) Compute the support, confidence, and lift for the rule {bread, butter} -> {milk}. Show your work.

c) Compute the support, confidence, and lift for the reverse rule {milk} -> {bread, butter}. Is the lift the same as in (b)? Explain why or why not.

d) Compute the conviction for both {bread, butter} -> {milk} and {milk} -> {bread, butter}. Are they the same? What does this tell you about the directionality of these metrics?

e) A colleague sees that {bread} -> {eggs} has confidence 60% and declares it a strong rule. Milk has support 75%. Eggs have support 62.5%. Compute the lift for {bread} -> {eggs}. Is this rule genuinely interesting, or is the high confidence an artifact of the base rate of eggs?

Exercise 2: Apriori by Hand (Conceptual)

Using the same 8 transactions from Exercise 1 and min_support = 0.375 (i.e., an itemset must appear in at least 3 of 8 transactions):

a) List all frequent 1-itemsets. For each, state its support.

b) Generate all candidate 2-itemsets from the frequent 1-itemsets. Which candidates survive the support threshold?

c) Generate all candidate 3-itemsets from the frequent 2-itemsets. Which candidates survive?

d) At what point does the algorithm stop? How many database scans did it require?

e) Suppose you lower min_support to 0.125 (1 of 8 transactions). How does this change the number of frequent itemsets? What is the tradeoff?

Exercise 3: Apriori vs. FP-Growth Comparison (Code)

Using mlxtend, run both Apriori and FP-Growth on a simulated grocery dataset and compare their outputs and performance.

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder
import time

np.random.seed(42)

# Simulate 20,000 grocery transactions
n_transactions = 20_000
products = [
    'milk', 'bread', 'butter', 'eggs', 'cheese', 'yogurt',
    'apples', 'bananas', 'chicken', 'rice', 'pasta', 'tomatoes',
    'onions', 'garlic', 'olive_oil', 'coffee', 'tea', 'sugar',
    'cereal', 'juice', 'water', 'chips', 'cookies', 'ice_cream',
    'beer', 'wine', 'soda', 'detergent', 'soap', 'shampoo'
]

# Simulate with realistic co-purchase patterns
# Your code here: generate transactions with embedded affinities
# (e.g., pasta + tomatoes + garlic co-occur at 25% probability)

# a) Convert to one-hot encoded DataFrame
# b) Run Apriori with min_support=0.02 and time it
# c) Run FP-Growth with min_support=0.02 and time it
# d) Verify that both produce the same set of frequent itemsets
# e) Generate association rules from the FP-Growth result (lift > 1.2)
# f) Report: How many rules? What is the highest-lift rule?
#    What is the most supported 3-item set?

Exercise 4: Filtering Rules for Business Use (Code)

Starting from the rules generated in Exercise 3:

# a) Create a filtering function that accepts:
#    - min_support, min_confidence, min_lift
#    - max_antecedent_length (to limit complexity)
#    - max_consequent_length (typically 1 for recommendations)
#    Apply it with min_lift=1.5, min_confidence=0.3, max_antecedent_length=2

# b) For the filtered rules, add a column 'rule_text' that formats the
#    rule as a human-readable string: "{bread, butter} -> {milk}"
#    Sort by lift descending and print the top 20.

# c) Compute Zhang's metric for all filtered rules.
#    Rank by Zhang's metric instead of lift. Do the rankings differ?
#    Which metric would you trust more for production recommendations?

# d) Group the filtered rules by consequent item.
#    For each consequent, count how many rules recommend it.
#    Which items are recommended most often? What does this tell you
#    about the difference between "popular items" and "items with
#    strong cross-sell signals"?

Exercise 5: The min_support Sensitivity Analysis (Code)

The min_support threshold is the most impactful parameter in association rule mining. This exercise explores its effect.

# Using the grocery dataset from Exercise 3:

# a) Run FP-Growth with min_support values of:
#    [0.001, 0.005, 0.01, 0.02, 0.05, 0.10, 0.20]
#    For each, record:
#    - Number of frequent itemsets
#    - Number of rules (with min_lift=1.2)
#    - Runtime (seconds)
#    Plot all three as a function of min_support.

# b) At min_support=0.001, how many rules do you get?
#    Sample 10 random rules from this set. Are they actionable?
#    What is the average lift?

# c) At min_support=0.10, how many rules do you get?
#    Are any of them surprising (i.e., not obvious to a grocery
#    store manager)?

# d) Based on your analysis, what min_support would you recommend
#    for a grocery chain with 500,000 monthly transactions?
#    Justify your answer in 2-3 sentences.

Exercise 6: Association Rules on Non-Retail Data (Code)

Association rules work on any data that can be framed as transactions. This exercise applies them to university course enrollment data.

np.random.seed(42)

# Simulate 5,000 students, each taking 4-6 courses per semester
courses = [
    'calc_1', 'calc_2', 'linear_algebra', 'stats_intro', 'stats_advanced',
    'cs_intro', 'data_structures', 'algorithms', 'databases', 'ml_intro',
    'econ_intro', 'micro', 'macro', 'econometrics',
    'writing_101', 'ethics', 'philosophy', 'psychology',
    'physics_1', 'physics_2', 'chemistry', 'biology',
    'accounting', 'finance', 'marketing', 'management'
]

# Define course sequences (students who take A often take B later)
course_affinities = {
    'calc_1': ['calc_2', 'physics_1'],
    'calc_2': ['linear_algebra', 'physics_2'],
    'stats_intro': ['stats_advanced', 'econometrics', 'ml_intro'],
    'cs_intro': ['data_structures'],
    'data_structures': ['algorithms', 'databases'],
    'econ_intro': ['micro', 'macro'],
}

# Your code here:
# a) Generate student enrollment "transactions" using the affinities above
#    (each student's course list is a transaction)
# b) Run FP-Growth with min_support=0.05
# c) Generate rules with min_lift=1.5
# d) Find the top 10 rules by lift. Do they correspond to known
#    prerequisite sequences?
# e) Find rules that are NOT obvious prerequisites (cross-department
#    patterns). What do they suggest about student behavior?
# f) How could a university registrar use these rules for course
#    recommendation or scheduling?

Exercise 7: Temporal Stability of Rules (Code + Analysis)

Association rules can change over time. This exercise tests whether rules are stable across different time periods.

# Simulate 12 months of transaction data with:
# - Stable rules (present in all months)
# - Seasonal rules (present only in Nov-Dec)
# - Trending rules (appearing in later months, not earlier)

# a) Generate 12 monthly transaction datasets (5,000 transactions each)
#    Include:
#    - 5 product pairs that always co-occur (stable, lift > 2)
#    - 3 product pairs that co-occur only in months 11-12 (seasonal)
#    - 2 product pairs that start co-occurring in month 7 (trending)

# b) Mine rules independently for each month (min_support=0.02, min_lift=1.5)

# c) For each rule, count the number of months it appears.
#    Plot a histogram of "rule persistence" (months present out of 12).

# d) Classify rules as:
#    - Stable: present in 10+ months
#    - Seasonal: present in 1-3 months
#    - Trending: present in later months but not earlier
#    Print examples of each category.

# e) In production, a merchandising team refreshes rules monthly.
#    Write a function that compares this month's rules to last month's
#    and returns: new_rules, dropped_rules, persistent_rules.
#    Why is this comparison important for business operations?

Exercise 8: Scalability and Sparse Data (Code)

This exercise explores the practical limits of association rule mining with mlxtend.

# a) Generate transaction datasets of increasing size:
#    n_transactions = [1000, 5000, 10000, 50000, 100000]
#    n_items = 200, avg_basket_size = 6
#    For each, time FP-Growth with min_support=0.01.
#    Plot runtime vs. n_transactions. Is the scaling linear?

# b) Now fix n_transactions = 10000 and vary n_items:
#    n_items = [50, 100, 200, 500, 1000]
#    avg_basket_size = 6 (so sparsity increases with n_items)
#    Time FP-Growth with min_support=0.01.
#    Plot runtime vs. n_items. What happens to runtime as items increase?

# c) For the n_items=1000 case, compute the memory usage of the
#    dense boolean DataFrame. Then create a sparse representation
#    using scipy.sparse.csr_matrix and compare memory usage.
#    Note: mlxtend requires a dense DataFrame, so the sparse
#    representation is for storage/transfer, not for direct mining.

# d) At what dataset size does mlxtend become impractical on your
#    machine? What alternatives exist for larger datasets?
#    (Hint: Spark MLlib FPGrowth, efficient-apriori package)

Exercise 9: Negative Association Rules (Conceptual + Code)

Most analyses focus on positive associations (lift > 1). Negative associations (lift < 1) can also be valuable.

# a) Generate rules with NO lift filter (min_threshold=0.0 on some
#    other metric like confidence). Filter to rules where lift < 0.8.
#    These are product pairs that co-occur LESS than expected.

# b) In a grocery context, what might a rule like
#    {whole_milk} -> {oat_milk} with lift=0.3 mean?
#    Why is this commercially interesting?

# c) Compute Zhang's metric for the negative-association rules.
#    Which metric (lift or Zhang's) better captures the strength
#    of negative association? Explain in 2-3 sentences.

# d) Give three real-world examples where negative association rules
#    would be useful for business decisions. For each, describe
#    the actionable insight.

Exercise 10: End-to-End Market Basket Pipeline (Project)

Build a complete market basket analysis pipeline from raw transaction data to business recommendations.

# Use this simulated dataset or substitute your own:
# - An online bookstore with 30,000 transactions
# - 150 book titles across 10 genres
# - Known affinities between genres (mystery readers buy thrillers, etc.)

# Your pipeline should include:
# 1. Data loading and one-hot encoding
# 2. Exploratory analysis (basket size distribution, item frequencies)
# 3. FP-Growth with justified min_support choice
# 4. Rule generation and multi-criteria filtering
# 5. Rule visualization: scatter plot of support vs. confidence,
#    colored by lift
# 6. A recommendation function: given a cart, return top-3 suggestions
# 7. A brief written summary (3-5 sentences) of the top findings
#    and recommended business actions

# Deliverable: a single Python script or Jupyter notebook that runs
# end-to-end and produces the summary at the end.

These exercises accompany Chapter 23: Association Rules and Market Basket Analysis. Return to the chapter for full context.