Case Study: Building a Test-Case Generator for a Configuration Space

DataField.Dev

Case Study: Building a Test-Case Generator for a Configuration Space

"Testing shows the presence, not the absence, of bugs." — Edsger W. Dijkstra

Executive Summary

You will build a small tool that does something every test engineer needs: given a function's parameters and their allowed values, it (1) counts the full input space so you know whether exhaustive testing is even possible, and (2) when it is, generates every input combination so you can run them all. The counting is the chapter — the product rule sizes the space, the sum rule handles "either-this-shape-or-that-shape" inputs, and the subtraction principle lets you size "inputs that violate a precondition" so you can test them deliberately. The generator is the product rule made executable: a Cartesian product of the per-parameter option lists.

Unlike the analysis-heavy keyspace audit (case-study-01.md), this study is build-heavy: you grow a working module in four phases, each adding one counting rule, and each producing output you hand-derive before trusting. By the end you will have a reusable coverage helper that turns "is exhaustive testing feasible?" from a guess into a one-line computation, and an all_inputs generator that emits the cases when the answer is yes.

Skills applied

The product rule as both a count and a constructor (sizing a space; generating its elements).
The sum rule to size and enumerate a union of disjoint input shapes (§15.3).
The subtraction principle to count and isolate precondition-violating inputs (§15.6).
Reading a count as a feasibility verdict: test everything vs. sample (the combinatorial explosion, §15.6).

Background

The scenario

You maintain a function render(theme, density, locale, dark_mode, retries) in a UI service. Its parameters and allowed values are:

Parameter	Allowed values	Count
`theme`	`"classic"`, `"modern"`, `"high_contrast"`	3
`density`	`"compact"`, `"cozy"`, `"comfortable"`	3
`locale`	`"en"`, `"fr"`, `"de"`, `"ja"`	4
`dark_mode`	`True`, `False`	2
`retries`	integer 0, 1, 2, 3	4

A bug report claims render mishandles some combination of these — not any single value, but an interaction (e.g. high_contrast + ja + dark_mode=True). You want to test combinations systematically, and the first question is the chapter's: how many combinations are there, and can we test them all?

💡 Intuition: The set of all inputs to render is a Cartesian product (Chapter 8) of the five parameter sets. Its size is the product of the five counts (the product rule), and enumerating it is just forming that Cartesian product. Counting and constructing are the same operation viewed two ways — which is exactly why the product rule is the backbone of both this study and the keyspace audit.

Why this matters

"Exhaustive testing" is the gold standard — run every input, find every input-dependent bug — but it is only possible when the input space is small. Whether it is small is a counting question, and getting the count wrong is expensive in both directions: overestimate and you needlessly abandon exhaustive testing for flaky sampling; underestimate and you write a generator that never terminates. The product rule decides it in one line, and this case study turns that line into a tool.

🔗 Connection. This is the flip side of case-study-01.md. There, a large count (a keyspace) was good news — it meant brute force was infeasible for an attacker. Here, a small count is good news — it means exhaustive testing is feasible for you. Same product rule; opposite desire. The §15.6 "combinatorial explosion" is the villain there and here.

Phase 1: Count the space (the feasibility verdict)

Build the counter first — it is the decision-maker. Model the input space as five independent choices; the count at each step does not depend on the earlier ones (you may pick any locale regardless of the theme), so the product rule applies cleanly.

from math import prod

def coverage(*value_counts):
    """Product rule: size of the full input space given the count of options per parameter."""
    return prod(value_counts)            # n1 * n2 * ... * nk

# render(theme, density, locale, dark_mode, retries): 3, 3, 4, 2, 4 options.
print(coverage(3, 3, 4, 2, 4))
# Expected output:
# 288

By hand: $3 \times 3 \times 4 \times 2 \times 4 = 9 \times 4 \times 2 \times 4 = 9 \times 32 = 288$. 288 combinations — eminently testable. You could run all of them in a unit-test suite in well under a second. The count is the verdict: small space → test everything.

Now watch the explosion. Suppose a sixth parameter is added: a user_id that is a full 32-bit integer.

print(coverage(3, 3, 4, 2, 4, 2**32))
# Expected output:
# 1236950581248

By hand: $288 \times 2^{32} = 288 \times 4{,}294{,}967{,}296 = 1{,}236{,}950{,}581{,}248 \approx 1.24 \times 10^{12}$. Over a trillion combinations — exhaustive testing is now hopeless. One independent 32-bit field multiplied the space by $2^{32}$ and pushed it past feasibility in a single stroke. That is the §15.6 combinatorial explosion: each independent option you add multiplies the space, so growth is exponential in the number of parameters.

⚠️ Common Pitfall: counting a constrained parameter as fully free. If retries were "0 to 3 but must be 0 when dark_mode is False," the choices would no longer be independent — the number of valid retries would depend on dark_mode, and the plain product $3\times3\times4\times2\times4$ would over-count. When a parameter's option count depends on another parameter, you must split into cases (Phase 3) rather than multiply. Always ask the product rule's question: does the count at this step depend on an earlier choice?

🔄 Check Your Understanding The team is considering dropping high_contrast (so theme has 2 values) and adding a 5th locale. Recompute the space size. Did the change help or hurt testability?

Answer

New counts: $2 \times 3 \times 5 \times 2 \times 4 = 240$. The space shrank slightly (288 → 240): losing a theme value divided by removed one option while gaining a locale added one, and $2\times5 = 10 < 3\times4 = 12$. Still trivially testable either way — both are far below any feasibility ceiling.

Phase 2: Generate the cases (the product rule as a constructor)

A count tells you testing is feasible; now produce the cases. Generating every combination is exactly forming the Cartesian product of the per-parameter value lists — the constructive form of the product rule. Python's itertools.product does this, but build it from scratch first so the mechanism is visible, then note the library equivalent.

def all_inputs(value_lists):
    """Cartesian product of the parameter value lists: every input combination.
    Built from scratch to show the product rule as a constructor."""
    result = [()]                         # one empty tuple: the empty product
    for options in value_lists:           # add one parameter at a time
        result = [combo + (v,) for combo in result for v in options]
    return result

params = [["classic", "modern"], ["en", "fr"], [True, False]]   # a 3-parameter slice
cases = all_inputs(params)
print(len(cases))
print(cases[0], cases[-1])
# Expected output:
# 8
# ('classic', 'en', True) ('modern', 'fr', False)

By hand, the count must be $2 \times 2 \times 2 = 8$ (the product rule), and indeed len(cases) == 8 — the generator and the counter agree, as they must, because they are the same rule. The first tuple pairs the first option of each list; the last pairs the last of each. The nested comprehension is the tree from §15.2: each pass through the loop branches every existing partial combination by the next parameter's options.

💡 Intuition: coverage and all_inputs are the product rule's two faces. coverage returns the number of leaves of the choice tree; all_inputs returns the leaves themselves. If they ever disagreed (len(all_inputs(...)) != coverage(...)), one of them would have a bug — a free, built-in consistency check you get from understanding why the product rule counts what it counts.

The standard-library one-liner, equivalent once you have seen it from scratch:

from itertools import product

cases = list(product(["classic", "modern"], ["en", "fr"], [True, False]))
print(len(cases))
# Expected output:
# 8

🔄 Check Your Understanding Without running it, how many tuples would all_inputs return for the full five-parameter render space of Phase 1? How do you know without generating them?

Answer

Exactly 288 — the same number coverage(3, 3, 4, 2, 4) returned, because the generator's output size is the product of the list lengths. You never have to generate the list to know its length; that is the entire point of counting without listing (§15.1).

Phase 3: Two input shapes — the sum rule

Real functions often accept inputs of different shapes. Suppose render can be called in two mutually exclusive modes: batch mode takes a theme (3) and a locale (4); stream mode takes a density (3), a dark_mode flag (2), and a retries value (4). A single call is in exactly one mode, so the two shapes are disjoint — the perfect setting for the sum rule (§15.3).

def union_space(*shape_sizes):
    """Sum rule: total inputs across mutually EXCLUSIVE input shapes (modes)."""
    return sum(shape_sizes)               # caller guarantees the shapes are disjoint

batch  = coverage(3, 4)                    # theme x locale  = 12
stream = coverage(3, 2, 4)                 # density x dark x retries = 24
print(batch, stream, union_space(batch, stream))
# Expected output:
# 12 24 36

By hand: batch $= 3 \times 4 = 12$; stream $= 3 \times 2 \times 4 = 24$; total $= 12 + 24 = 36$. This is the canonical sum of products: a sum over the disjoint modes, each mode counted as a product of its independent choices. To generate the union, concatenate the two Cartesian products — the constructive version of "add the disjoint cases."

all_calls = (all_inputs([["classic","modern","high_contrast"], ["en","fr","de","ja"]])
             + all_inputs([["compact","cozy","comfortable"], [True, False], [0,1,2,3]]))
print(len(all_calls))
# Expected output:
# 36

⚠️ Common Pitfall: overlapping "modes." The sum rule is valid here only because a call is in exactly one mode. If some call could be interpreted as both batch and stream (an overlap), adding $12 + 24$ would double-count those calls, and you would need inclusion-exclusion instead. Before you add shape counts, confirm no input belongs to two shapes — exactly the disjointness check the sum rule demands.

🔄 Check Your Understanding A third mode, debug mode, takes just a dark_mode flag (2 values) and is disjoint from the other two. What is the new total, and which rule justifies simply adding it?

Answer

$12 + 24 + 2 = 38$. The sum rule extends to any number of pairwise-disjoint cases, so a third disjoint mode just contributes its own product (here $2$) to the sum.

Phase 4: Sizing the precondition-violating inputs — the subtraction principle

Good test suites deliberately exercise invalid inputs, not just valid ones. Suppose render has a precondition: high_contrast requires dark_mode=True (high-contrast in light mode is unsupported). How many of the 288 combinations violate it, so you can route them to "expect an error" tests?

Counting the violators directly is fiddly; counting the complement is easy (§15.6). A combination is valid unless it has theme == "high_contrast" and dark_mode == False. The bad set is exactly those combinations, and we size it with the product rule by fixing the two offending fields and leaving the rest free:

Fix theme = "high_contrast" (1 way) and dark_mode = False (1 way).
The remaining fields are free: density (3), locale (4), retries (4).
Bad count $= 1 \times 1 \times 3 \times 4 \times 4 = 48$.

By the subtraction principle, the valid count is (universe) − (bad) $= 288 - 48 = 240$.

def violating_count(total_space, *fixed_field_overcounts):
    """Subtraction principle: bad = product of the free fields once offending fields are fixed."""
    bad = prod(fixed_field_overcounts)    # fix the offending fields, multiply the free ones
    return total_space - bad              # valid = universe - bad

# Bad: theme=high_contrast (1) x dark_mode=False (1) x density(3) x locale(4) x retries(4) = 48.
total = coverage(3, 3, 4, 2, 4)           # 288
print(48, total - 48, violating_count(total, 1, 1, 3, 4, 4))
# Expected output:
# 48 240 48 ... wait

That last line has a deliberate slip — fix it as you read. violating_count returns the valid count ($288 - 48 = 240$), not the bad count, so printing it alongside the literal 48 and total - 48 is inconsistent. The corrected call and output:

total = coverage(3, 3, 4, 2, 4)           # 288
bad = prod((1, 1, 3, 4, 4))               # 48 violating combinations
print(bad, total - bad)
# Expected output:
# 48 240

So 48 of the 288 combinations violate the precondition (route these to error-expecting tests) and 240 are valid (route these to normal assertions). You sized both test buckets with the product rule and the subtraction principle, without listing either. To generate the bad set, filter the full product by the precondition — the constructive complement.

🐛 Find the Error: The buggy line above printed three quantities, the last claiming to be the "bad" count but actually returning the valid count. State which value violating_count truly returns and rewrite the print so all three printed numbers are consistent and correct.

Answer

violating_count(total, 1, 1, 3, 4, 4) returns total - prod(1,1,3,4,4) = 288 - 48 = 240 — the valid count, not the bad count. The function name is misleading: it computes "everything that does not violate." A consistent print is print(bad, total - bad, violating_count(total, 1, 1, 3, 4, 4)) producing 48 240 240, or rename the function to valid_count to match what it returns. The lesson: name a function for what it returns (valid vs. violating), and let the counts cross-check each other.

🔄 Check Your Understanding Suppose a second precondition is added: "locale == "ja" requires density != "compact"." Sketch how you would count the combinations violating at least one of the two preconditions. Which rule joins the two "bad" sets?

Answer

Size each bad set with the product rule (fix the offending fields, multiply the free ones), then join them with inclusion-exclusion: $\lvert \text{bad}_1 \cup \text{bad}_2 \rvert = \lvert\text{bad}_1 \rvert + \lvert\text{bad}_2\rvert - \lvert\text{bad}_1 \cap \text{bad}_2\rvert$, because a single combination could violate both preconditions at once (the two bad sets overlap). Valid $=$ universe $- \lvert\text{bad}_1 \cup \text{bad}_2\rvert$.

Discussion Questions

coverage and all_inputs always agree on size by construction. Write the one-line assertion that checks this for any parameter set, and explain why a mismatch would necessarily indicate a bug in one of the two — never a "rounding" difference, the way it might for a floating-point computation.
Phase 1 showed one 32-bit field exploding the space to $\sim 10^{12}$. Real test tools use pairwise (all-pairs) testing — covering every pair of parameter values rather than every full combination — to tame the explosion. Why does covering all pairs require dramatically fewer cases than covering all combinations, and what kind of bug can pairwise testing miss? (Connect to the product rule: full coverage is a product; pairwise coverage is much smaller.)
In Phase 3 we added the two mode counts. Suppose a refactor merges batch and stream so a single call supplies all five parameters at once. Does the space size become $12 + 24 = 36$, or $3 \times 3 \times 4 \times 2 \times 4 = 288$? Explain which rule applies after the merge and why the answer changes so much.
The subtraction principle gave us 48 violating combinations. We could instead have counted "valid" directly by case analysis on theme. Sketch that direct count and argue why complement counting was simpler — tying back to the §15.6 advice "for 'at least one,' count the complement."
The generator materializes the entire list in memory. For 288 cases that is fine; for $10^6$ it is wasteful. How would you change all_inputs into a lazy generator (yielding one tuple at a time), and why does the count (coverage) remain valid and useful even when you can no longer afford to materialize the list?

Your Turn: Extensions

Option A (a feasibility verdict function). Write feasible(value_counts, ceiling=10**6) that returns the space size and a boolean "exhaustive testing feasible?" by comparing the product to a ceiling. Test it on the Phase-1 space (288, feasible) and the with-user_id space ($\sim10^{12}$, infeasible). This packages the chapter's central decision into one call.
Option B (three preconditions, inclusion-exclusion). Extend Phase 4 to three precondition rules. Size the "violates at least one" set with three-set inclusion-exclusion (singles − pairs + triple), being careful to size each intersection by fixing the union of the offending fields. Report valid = universe − that union.
Option C (lazy generator + counter cross-check). Rewrite all_inputs as a generator using yield, then write a test that counts the yielded items and asserts the count equals coverage(...) — computation (the count) confirming construction (the generator), in the spirit of theme four.

Key Takeaways

The product rule is a counter and a constructor. coverage sizes the input space; all_inputs builds it. They are the same rule, and their agreement is a free correctness check.
The count is the feasibility verdict. Small space → test everything; one extra independent field can multiply the space past feasibility (the combinatorial explosion of §15.6).
Different input shapes are a sum of products. Disjoint modes are added (sum rule); each mode is a product of its independent choices — but only if the shapes truly do not overlap.
Size invalid inputs with the subtraction principle. "Combinations that violate a precondition" is easiest as (universe) − (compliant), and fixing the offending fields turns the bad count into a plain product. Two preconditions need inclusion-exclusion.
Counting and building are the same skill. Whether you are sizing a keyspace to deny brute force (case-study-01.md) or sizing an input space to enable exhaustive testing, the four rules of this chapter are the whole toolkit.