Quiz: Chapter 23

DataField.Dev

Quiz: Chapter 23

Association Rules and Market Basket Analysis

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A dataset has 10,000 transactions. Item A appears in 3,000 transactions, item B appears in 2,000 transactions, and both A and B appear together in 900 transactions. What is the lift of the rule {A} -> {B}?

A) 0.30
B) 0.45
C) 1.50
D) 3.00

Answer: C) 1.50. Lift = support(A, B) / (support(A) * support(B)) = (900/10000) / ((3000/10000) * (2000/10000)) = 0.09 / (0.30 * 0.20) = 0.09 / 0.06 = 1.50. This means A and B co-occur 50% more often than expected if they were independent. Lift > 1 indicates a positive association.

Question 2 (Multiple Choice)

Which statement about confidence and lift is correct?

A) A rule with confidence 90% is always actionable
B) Lift corrects for the base rate of the consequent; confidence does not
C) Lift and confidence always rank rules in the same order
D) Confidence is symmetric: confidence(A -> B) = confidence(B -> A)

Answer: B) Lift corrects for the base rate of the consequent; confidence does not. Confidence measures P(B|A) but does not account for P(B). If B appears in 95% of all transactions, confidence(A -> B) will be high regardless of A. Lift divides confidence by support(B), correcting for this base rate effect. Choice A is wrong because high confidence with low lift is not actionable. Choice C is wrong because confidence and lift can rank rules differently. Choice D is wrong because confidence(A -> B) = support(A,B)/support(A) and confidence(B -> A) = support(A,B)/support(B), which are equal only when support(A) = support(B).

Question 3 (Multiple Choice)

The Apriori principle states:

A) If an itemset is frequent, all of its subsets are also frequent
B) If an itemset is infrequent, all of its supersets are also infrequent
C) Both A and B
D) Neither A nor B

Answer: C) Both A and B. The Apriori principle (also called downward closure or anti-monotonicity of support) has two equivalent formulations. If {bread, wine} is infrequent, then {bread, wine, cheese} and all other supersets must also be infrequent. Conversely, if {bread, wine, cheese} is frequent, then all of its subsets ({bread, wine}, {bread, cheese}, {wine, cheese}, {bread}, {wine}, {cheese}) must also be frequent. This property allows the Apriori algorithm to prune candidate itemsets without counting their support.

Question 4 (Multiple Choice)

What is the primary advantage of FP-Growth over Apriori?

A) FP-Growth produces more accurate rules
B) FP-Growth eliminates candidate generation by compressing the database into an FP-tree
C) FP-Growth does not require a min_support threshold
D) FP-Growth can find rules that Apriori cannot

Answer: B) FP-Growth eliminates candidate generation by compressing the database into an FP-tree. Apriori generates candidate itemsets at each level and scans the full database for each level to count support. FP-Growth builds a compact tree structure in two passes and mines frequent itemsets directly from the tree, avoiding the costly candidate generation and repeated database scans. Both algorithms find the same set of frequent itemsets (A and D are wrong). FP-Growth still requires min_support (C is wrong).

Question 5 (Short Answer)

Explain why lift is symmetric --- that is, lift(A -> B) = lift(B -> A) --- and why this can be a limitation in practice.

Answer: Lift = support(A, B) / (support(A) * support(B)). Since both the numerator and denominator are symmetric with respect to A and B, swapping them produces the same value. This is a limitation because business actions are directional: recommending wine to cheese buyers is different from recommending cheese to wine buyers. The two directions may have different confidence levels, different conviction values, and different commercial value. Lift alone cannot distinguish between these directional business decisions, which is why conviction (which is asymmetric) is useful as a supplementary metric.

Question 6 (Multiple Choice)

A rule has confidence 0.70, and the consequent has support 0.70. What is the conviction of this rule?

A) 0.70
B) 1.00
C) 1.43
D) Infinity

Answer: B) 1.00. Conviction = (1 - support(Y)) / (1 - confidence(X -> Y)) = (1 - 0.70) / (1 - 0.70) = 0.30 / 0.30 = 1.00. Conviction of 1.0 indicates that the antecedent and consequent are independent --- the rule has no predictive value beyond the base rate of the consequent. Note that this rule also has lift = 0.70 / 0.70 = 1.0, confirming independence.

Question 7 (Multiple Choice)

You run FP-Growth with min_support=0.001 on a retail dataset with 5 million transactions and 40,000 unique products. The algorithm returns 2.3 million frequent itemsets and 18 million rules. What is the best next step?

A) Lower min_support to find even more patterns
B) Report all 18 million rules to the merchandising team
C) Raise min_support and apply lift, confidence, and support filters to reduce the rule set to a manageable size
D) Switch to Apriori, which produces fewer rules

Answer: C) Raise min_support and apply lift, confidence, and support filters to reduce the rule set to a manageable size. No business team can act on 18 million rules. The output needs to be filtered to tens or hundreds of actionable rules using layered criteria: min_support to ensure frequency, min_confidence to ensure reliability, and min_lift to ensure the pattern is genuinely above chance. Choice A worsens the problem. Choice B is impractical. Choice D is wrong because Apriori and FP-Growth find the same frequent itemsets; the rule count depends on thresholds, not the algorithm.

Question 8 (Short Answer)

The beer-and-diapers story claims that a retailer discovered men buying diapers on Friday evenings also bought beer. Regardless of whether this specific example is real, explain what additional information (beyond the association rule itself) you would need to make a merchandising decision based on such a rule.

Answer: You would need: (1) the lift value, to confirm the co-occurrence is genuinely above chance rather than just two popular items appearing together; (2) the support, to confirm enough transactions contain both items to justify action; (3) the profit margin on each item, because a high-lift rule on low-margin items may not justify the cost of a display change; and (4) a causal or at least plausible behavioral explanation, because acting on a spurious correlation (e.g., both items happen to be near the store entrance) could backfire. You would also want to test the rule's stability across multiple time periods to ensure it is not a seasonal or one-time artifact.

Question 9 (Multiple Choice)

Zhang's metric ranges from -1 to +1. A rule with Zhang's metric = -0.6 indicates:

A) The antecedent and consequent are strongly positively associated
B) The antecedent and consequent are independent
C) The antecedent and consequent are negatively associated (buying A makes buying B less likely than chance)
D) The rule has low support

Answer: C) The antecedent and consequent are negatively associated (buying A makes buying B less likely than chance). Zhang's metric of -0.6 indicates a strong negative association --- the antecedent and consequent co-occur substantially less than expected under independence. This is a substitute relationship: customers who buy A tend not to buy B in the same transaction. Examples include brand substitutes (Coca-Cola vs. Pepsi) or product alternatives (whole milk vs. oat milk). Zhang's metric near 0 would indicate independence (B), and positive values indicate positive association (A).

Question 10 (Multiple Choice)

In the StreamFlow case study, association rules were applied to genre viewing patterns rather than product purchases. What was the key difference in how the results were used?

A) The rules were used to predict churn directly
B) The rules identified genre co-occurrences, which were then linked to churn rates to find "sticky" combinations
C) The rules replaced the churn prediction model from Chapter 17
D) The rules were used to determine pricing for each genre

Answer: B) The rules identified genre co-occurrences, which were then linked to churn rates to find "sticky" combinations. Standard market basket analysis treats co-occurrence as the end goal. In the StreamFlow case, co-occurrence was an intermediate step: the team first mined genre combinations using association rules (support, lift), then computed churn rates for subscribers who watched each combination, and finally identified combinations where co-occurrence predicted lower churn. This two-step approach --- mine patterns, then link to outcomes --- extends association rules beyond simple cross-selling to retention strategy.

Question 11 (Short Answer)

Why is pre-filtering items by minimum support before running Apriori or FP-Growth a useful optimization? What is the risk of aggressive pre-filtering?

Answer: Pre-filtering removes items that appear in fewer than min_support transactions, reducing the number of columns in the one-hot matrix and the number of candidate itemsets the algorithm must evaluate. Since any itemset containing an infrequent item is guaranteed to be infrequent (by the Apriori principle), removing such items does not change the result. The risk of aggressive pre-filtering (setting the pre-filter threshold higher than necessary) is discarding items that, while individually uncommon, participate in high-lift rules with other items. A niche item with 0.5% support might have lift > 5 with another niche item --- pre-filtering at 1% would destroy that rule.

Question 12 (Multiple Choice)

You discover the rule {premium_coffee_beans} -> {french_press} with support 0.008, confidence 0.42, and lift 7.3. A colleague argues this rule is not useful because the support is below 1%. How would you respond?

A) Agree --- rules with support below 1% are never actionable
B) Disagree --- the high lift (7.3) indicates a genuine, strong association that could drive targeted recommendations for premium coffee buyers
C) Disagree --- the high confidence (42%) alone makes this rule actionable
D) Agree --- lift above 5 usually indicates a data artifact

Answer: B) Disagree --- the high lift (7.3) indicates a genuine, strong association that could drive targeted recommendations for premium coffee buyers. Low support means the pattern is infrequent in absolute terms, but it may still be highly actionable for the niche segment of customers who buy premium coffee beans. A lift of 7.3 means french press purchasers are 7.3x more likely among premium coffee buyers than in the general population. This is exactly the kind of rule that provides value beyond what category managers already know. The question is whether the absolute number of affected transactions (0.8% of the dataset) justifies the business action, not whether the pattern is real.

Question 13 (Short Answer)

Explain the difference between within-category association rules (e.g., {phone} -> {phone_case}) and cross-category rules (e.g., {yoga_mat} -> {water_bottle}). Which type is more valuable for discovery, and why?

Answer: Within-category rules connect items in the same product category (electronics accessories, baking ingredients). These are often already known to category managers and reflected in existing store layouts or recommendation engines. Cross-category rules connect items from different departments (fitness and kitchen, electronics and books). These are more valuable for discovery because they represent patterns that organizational silos prevent humans from seeing --- a category manager for fitness equipment does not typically coordinate with the kitchen department. Cross-category rules are where association rule mining provides the most marginal value over human intuition.

Question 14 (Multiple Choice)

Which of the following is a valid use of negative association rules (lift < 1)?

A) Identifying substitute products that customers choose between (e.g., Coke vs. Pepsi)
B) Finding products that should be bundled together
C) Increasing the support of infrequent items
D) Replacing confidence as the primary metric

Answer: A) Identifying substitute products that customers choose between (e.g., Coke vs. Pepsi). Lift < 1 means two items co-occur less often than expected under independence --- customers who buy one tend not to buy the other. This is the signature of substitute products: customers choose one or the other, not both. This insight is useful for shelf placement (place substitutes near each other to facilitate comparison), competitive analysis (understanding which products compete for the same purchase occasion), and promotional strategy (discounting one substitute may cannibalize the other).

Question 15 (Short Answer)

A retail analytics team mines association rules monthly and notices that the rule {sunscreen} -> {aloe_vera_gel} appears only in May through August. What type of rule is this, and how should the merchandising team handle it differently from a rule that appears in all 12 months?

Answer: This is a seasonal rule --- it reflects summer purchase behavior (sunburn prevention and treatment). The merchandising team should handle it differently from a year-round rule in two ways. First, the cross-sell recommendation should be active only during the relevant months; showing aloe vera recommendations to sunscreen buyers in December wastes recommendation slots. Second, the team should use the rule for seasonal inventory planning and temporary end-cap displays rather than permanent shelf adjacency. Seasonal rules should be tagged with their active months and managed on a calendar rather than treated as static recommendations.

This quiz accompanies Chapter 23: Association Rules and Market Basket Analysis. Return to the chapter for full context.