Case Study: Sizing the Search Space of a Combinatorial Test Suite
"Beware of bugs in the above code; I have only proved it correct, not tried it." — Donald E. Knuth
Executive Summary
A QA lead hands you a feature with eight independent configuration options and asks the obvious question: "How many tests do we need to be sure?" The honest first answer is a counting problem, and it is a brutal one — exhaustive testing of all combinations is astronomically expensive, while testing "one setting at a time" misses interactions that cause real outages. In this analysis-heavy case study you will count three different test strategies — exhaustive, one-factor-at-a-time, and pairwise (all-pairs) coverage — using permutations, combinations, and the product/sum rules, and turn each count into a feasibility verdict. You will not build a clever test generator; you will do the thing that comes before building one: estimate the size of what you are about to enumerate, and let the numbers decide the strategy. This is §16.6's "size the search space first" discipline applied to a decision a working engineer makes constantly.
Skills applied
- Classifying a real scenario by order? and repetition? and picking the right model (§16.3).
- Counting with the product rule, $\binom{n}{r}$, and $P(n, r)$, and composing them (§16.1–16.2).
- Estimating a brute-force search space and reading the size as feasible / infeasible (§16.6).
- Using a combinatorial count ($\binom{n}{2}$, then a refinement) to justify an engineering choice.
- Distinguishing "number of pairs to cover" from "number of tests needed," and seeing why they differ.
Background
The scenario
(Illustrative — the product and numbers are constructed for teaching.) You work on Orchestrate, a
deployment tool. Its deploy command takes eight independent options, each with a small set of allowed
values:
| # | Option | Allowed values | Count |
|---|---|---|---|
| 1 | --region |
us, eu, ap, sa | 4 |
| 2 | --tier |
free, pro, enterprise | 3 |
| 3 | --cache |
on, off | 2 |
| 4 | --tls |
on, off | 2 |
| 5 | --compression |
none, gzip, brotli | 3 |
| 6 | --replicas |
1, 2, 3 | 3 |
| 7 | --rollback |
auto, manual | 2 |
| 8 | --telemetry |
on, off | 2 |
A bug was just traced to a specific interaction: --compression brotli together with
--region ap corrupted a response. Single-option testing never caught it, because each option works
fine alone. The QA lead wants a principled answer to three questions:
- How many tests would exhaustive coverage (every combination of all eight options) require?
- How many would one-factor-at-a-time (OFAT) require, and what does it miss?
- How many pairs of option-values must a test suite cover to guarantee every two-way interaction is exercised at least once — and roughly how few tests can cover them all?
Each question is a count. Let us do them in order, with the reasoning made explicit.
💡 Intuition: "How many tests?" feels like a project-management question, but underneath it is pure combinatorics. The product rule sizes exhaustive testing; the sum rule sizes OFAT; and a combination $\binom{8}{2}$ sizes the interactions a smarter strategy must cover. Counting first is what turns a vague "we need more tests" into a defensible plan.
Why this matters
Configuration spaces explode. A handful of innocent-looking options multiply into a space no team can
test exhaustively, yet interaction bugs (like the brotli + ap one) live precisely in the
combinations OFAT skips. The empirical finding that motivates combinatorial testing is that a large
fraction of interaction faults are triggered by just two interacting options — so covering all pairs
catches most of them at a tiny fraction of the exhaustive cost. To argue for that strategy, you must
count: the exhaustive size (to show it is hopeless), and the pair count (to show the alternative is
cheap). That is this chapter's machinery doing real work.
There is a second reason this analysis is worth doing carefully rather than by gut feel. Engineers routinely underestimate configuration spaces, because the product rule's growth is not intuitive: eight options that each look "small" combine into a space nearly seven times larger than the naive "$2^8$ because there are 8 flags" guess (Phase 1 makes this precise). And engineers overestimate pairwise cost, imagining one test per pair, when in fact each test covers all $\binom{8}{2}$ option-pairs at once (Phase 4's central subtlety). Both errors are corrected by the same discipline — count, do not guess — and both corrections point the same way: exhaustive is more expensive than it looks, pairwise is cheaper than it looks. The combinatorics is not decoration on the engineering decision; it is the decision, and getting the counts right is what separates a defensible test plan from folklore.
The three counts we will compute correspond exactly to the three counting rules of Chapters 15–16. The exhaustive count is a pure product (independent simultaneous choices). The OFAT count is a pure sum (mutually exclusive single-option variations, added to a baseline). And the interaction count is a combination $\binom{8}{2}$ composed with products $v_i v_j$. If you can see which rule each strategy invokes, you can re-derive every number below from scratch — which is the real, transferable skill this case study is teaching.
Phase 1: Exhaustive coverage — the product rule
Exhaustive testing runs every combination of all eight options. The choices are independent (any region with any tier with any cache setting…), and for each option the number of choices is fixed regardless of the others — exactly the proviso the product rule requires (§16.1, building on Chapter 15). So multiply the per-option counts:
$$N_{\text{exhaustive}} = 4 \times 3 \times 2 \times 2 \times 3 \times 3 \times 2 \times 2.$$
Group the factors to keep the arithmetic honest. The four 2's give $2^4 = 16$. The factors $3, 3, 3$ give $27$. Then $4$ remains. So
$$N_{\text{exhaustive}} = 16 \times 27 \times 4 = 16 \times 108 = 1{,}728.$$
🔄 Check Your Understanding Recompute the product by grouping differently — $(4 \times 3) \times (2 \times 2) \times (3 \times 3) \times (2 \times 2)$ — and confirm you still get 1,728.
Answer
$(4 \times 3) = 12$, $(2 \times 2) = 4$, $(3 \times 3) = 9$, $(2 \times 2) = 4$. Then $12 \times 4 = 48$, $48 \times 9 = 432$, $432 \times 4 = 1{,}728$. The product rule does not care about grouping order — multiplication is associative and commutative — so the count is the same.
1,728 tests is borderline. If each test takes 30 seconds to spin up, run, and tear down, that is $1{,}728 \times 30 = 51{,}840$ seconds $\approx 14.4$ hours — too long for a per-commit gate, perhaps acceptable as a nightly job. But this is the small version of the problem. The next phase shows why "just add more options" makes exhaustive testing collapse.
⚠️ Common Pitfall: It is tempting to "estimate" the exhaustive count as "about $2^8 = 256$ because there are 8 options." That treats every option as binary. Five of the eight options have more than two values, and the product rule multiplies the actual per-option counts. The true figure, 1,728, is nearly 7× the binary guess. Always multiply the real domain sizes.
What happens as the feature grows
Suppose product growth adds four more options, each with 4 values. The exhaustive count is multiplied by $4^4 = 256$:
$$1{,}728 \times 256 = 442{,}368.$$
At 30 seconds each, that is $\approx 153$ days of continuous testing. Add a single option with a 32-bit integer value (e.g., a timeout in milliseconds) and the count is multiplied by $2^{32} \approx 4.3 \times 10^9$ — astronomically infeasible. This is the §16.6 lesson in miniature: exhaustive enumeration of a product space is fine for tiny spaces and hopeless the moment the space grows, no matter how fast your test runners are.
Phase 2: One-factor-at-a-time (OFAT) — the sum rule, and what it misses
OFAT picks one baseline configuration, then varies each option individually while holding the rest at baseline. Count it with the sum rule (Chapter 15): one test for the baseline, plus, for each option, one test per non-baseline value of that option.
For an option with $v_i$ values, there are $v_i - 1$ non-baseline values, hence $v_i - 1$ OFAT tests. Summing over the eight options and adding the single baseline test:
$$N_{\text{OFAT}} = 1 + \sum_{i=1}^{8}(v_i - 1).$$
The option value-counts are $4, 3, 2, 2, 3, 3, 2, 2$, so the $(v_i - 1)$ terms are $3, 2, 1, 1, 2, 2, 1, 1$, which sum to $13$. Therefore
$$N_{\text{OFAT}} = 1 + 13 = 14.$$
Fourteen tests versus 1,728 — OFAT is 123× cheaper. So why not stop here?
🐛 Find the Error. A teammate argues: "OFAT touches every option's every value at least once (check: each value appears in some test). So if every option value is individually exercised, we have full coverage." Where is the gap? Use the
brotli+apbug to make it concrete.
Answer
OFAT covers every single value but almost no pair of non-baseline values. With baseline
--region usand--compression none, the test that sets--compression brotlistill has--region us, and the test that sets--region apstill has--compression none. The combinationbrotliandapnever co-occurs in any OFAT test — so the interaction bug slips through. OFAT exercises each option alone; interaction bugs live in combinations of options, which OFAT structurally avoids. "Every value appears somewhere" is not "every pair of values appears together."
OFAT is cheap but blind to interactions. The exhaustive suite catches every interaction but is unaffordable as the feature grows. The interesting strategy lives between them — and counting tells us exactly how much interaction coverage costs.
Phase 3: Pairwise coverage — counting the interactions with $\binom{8}{2}$
The empirical case for pairwise (all-pairs) testing is that most interaction faults involve just two options. So the goal becomes: every pair of option-values must appear together in at least one test. Two counts are needed, and confusing them is the central subtlety of this case study.
Count A — how many pairs of options are there? Choosing 2 of the 8 options to interact is an unordered, no-repetition selection (the pair {region, compression} is the same as {compression, region}; an option does not pair with itself). That is a combination (§16.2):
$$\binom{8}{2} = \frac{8 \times 7}{2} = 28.$$
There are 28 pairs of options.
Count B — how many pairs of option-values must be covered? For each pair of options, every combination of their values is an interaction that must appear in some test. For options $i$ and $j$ with $v_i$ and $v_j$ values, that is $v_i \cdot v_j$ value-pairs (product rule). The total number of value-pairs to cover is the sum, over all 28 option-pairs, of $v_i \cdot v_j$:
$$N_{\text{value-pairs}} = \sum_{1 \le i < j \le 8} v_i \, v_j.$$
That double sum looks heavy, but a classic identity makes it a two-line computation. For any numbers,
$$\sum_{i < j} v_i v_j = \frac{\left(\sum_i v_i\right)^2 - \sum_i v_i^2}{2}.$$
💡 Intuition: why that identity holds. Square the sum: $\left(\sum_i v_i\right)^2$ expands into every ordered product $v_i v_j$, including the diagonal terms $v_i^2$ (where $i = j$) and each off-diagonal pair $v_i v_j$ twice (once as $(i,j)$, once as $(j,i)$). Subtract the diagonal $\sum_i v_i^2$ to remove the $i = j$ terms, then divide by 2 to collapse each unordered pair's two appearances into one. What remains is exactly $\sum_{i
simplify rather than to prove an identity.
Now plug in the value-counts $4, 3, 2, 2, 3, 3, 2, 2$:
- $\sum_i v_i = 4 + 3 + 2 + 2 + 3 + 3 + 2 + 2 = 21$, so $\left(\sum_i v_i\right)^2 = 441$.
- $\sum_i v_i^2 = 16 + 9 + 4 + 4 + 9 + 9 + 4 + 4 = 59$.
- Therefore $N_{\text{value-pairs}} = \dfrac{441 - 59}{2} = \dfrac{382}{2} = 191.$
So there are 191 value-pairs that a pairwise-complete suite must cover.
🔄 Check Your Understanding Sanity-check the identity on a tiny case: three options with value-counts $2, 2, 2$. Compute $\sum_{i
Answer
Directly: the option-pairs are (1,2), (1,3), (2,3), each contributing $2 \times 2 = 4$, total $12$. Formula: $\sum v_i = 6$, so $(\sum v_i)^2 = 36$; $\sum v_i^2 = 4 + 4 + 4 = 12$; then $\frac{36 - 12}{2} = \frac{24}{2} = 12$. They match.
Phase 4: From "pairs to cover" to "tests needed" — a counting lower bound
Here is the subtlety the QA lead must understand: 191 is the number of value-pairs to cover, not the number of tests. A single test fixes one value for each of the 8 options, so it covers one value-pair for every pair of options at once — that is, $\binom{8}{2} = 28$ value-pairs per test (one per option-pair). This is why pairwise testing is so efficient: each test is enormously reused across the 28 option-pairs.
That gives an immediate lower bound on the number of tests by the pigeonhole-style division argument (a preview of Chapter 17's flavor): if each test covers at most 28 of the 191 required value-pairs, then the number of tests is at least
$$\left\lceil \frac{191}{28} \right\rceil = \lceil 6.82\ldots \rceil = 7.$$
So no pairwise-complete suite can have fewer than 7 tests. A sharper, standard lower bound notes that
the two largest options must have every combination of their values appear, and no single test can
cover two different value-pairs of the same option-pair; with --region (4 values) and any 3-value
option, you need at least $4 \times 3 = 12$ tests to cover that one option-pair's value-pairs. So the
true minimum is at least 12. In practice, greedy all-pairs generators produce a suite of roughly
15–20 tests for this configuration — astonishingly close to the OFAT cost (14), yet covering every
two-way interaction, including brotli + ap.
Tabulating the three strategies makes the decision obvious:
| Strategy | Count | What it guarantees | Verdict |
|---|---|---|---|
| Exhaustive | 1,728 tests | every combination of all 8 options | catches everything; too slow per-commit, explodes as feature grows |
| OFAT | 14 tests | every single value exercised | cheap; misses all interactions (would not catch the bug) |
| Pairwise | ~15–20 tests | every pair of values co-occurs | cheap and catches all 2-way interaction bugs |
The counts settle the argument. Pairwise testing costs about what OFAT costs (because each test is reused across all $\binom{8}{2} = 28$ option-pairs) while delivering the interaction coverage OFAT cannot. The QA lead now has a defensible, quantitative reason to adopt all-pairs.
🔗 Connection. The leap from "191 pairs to cover" to "but each test covers 28 of them, so at least 7 tests" is a counting lower bound — you bound the answer by dividing the work by the maximum any one step can do. That is precisely the reasoning behind the pigeonhole principle (Chapter 17) and behind the algorithmic lower bounds first met in Chapter 14: a count of the cases an algorithm must distinguish forces a floor on how much work it can take. Counting does not just size brute force; it proves that no method can do better than a certain bound.
Phase 5: How each strategy scales — the analytical punchline
The counts above are for one feature. The deeper insight is how each strategy grows as the feature gains options, because that is what decides whether a strategy survives the product's lifetime. Suppose, to make the growth visible, that all $m$ options have roughly the same number of values $v$. Then:
- Exhaustive is $v^m$ — the product of all option sizes. It grows exponentially in the number of options. Each new option multiplies the count by $v$. This is the §16.6 combinatorial explosion: no hardware outruns it.
- OFAT is $1 + m(v-1) \approx mv$ — linear in the number of options. Cheapest of all, but it covers no interactions.
- Pairwise value-pairs to cover is $\sum_{i
quadratic in the number of options (from the $\binom{m}{2}$ option-pairs). And because each test covers $\binom{m}{2}$ value-pairs at once, the number of tests a good generator needs grows only about as $v^2 \log m$ — far, far below exhaustive.
That contrast — exhaustive exponential in $m$, pairwise roughly quadratic in $m$ — is the entire quantitative case for combinatorial testing, and it is a direct reading of the combination counts. The binomial coefficient $\binom{m}{2}$ is doing the heavy lifting: it is the number of two-way interactions, it grows polynomially (degree 2) in the number of options (§16.6's "fixed $k$ → polynomial"), and it is also the per-test coverage. Counting tells you not just "how big is this feature's test space" but "which strategy will still be affordable when the feature doubles."
💡 Intuition: why pairwise is the sweet spot. OFAT ($\approx mv$) is cheapest but blind; exhaustive ($v^m$) is complete but explodes. Pairwise sits in between by a deliberate compromise: cover all two-way interactions (the ones that empirically cause most bugs) and ignore the higher-order ones. The cost of that compromise is governed by $\binom{m}{2}$ — quadratic, not exponential — which is why it stays affordable as $m$ grows. If three-way bugs mattered, you would pay $\binom{m}{3}$ (cubic), and so on; the order of the interaction is the degree of the polynomial you pay.
A second instance, to see the rule of thumb fire
Apply the same analysis to a different feature — a message queue with options of value-counts
$5, 5, 4, 3, 2$ (five options). Exhaustive: $5 \times 5 \times 4 \times 3 \times 2 = 600$. OFAT:
$1 + (4 + 4 + 3 + 2 + 1) = 1 + 14 = 15$. Value-pairs to cover: $\sum v_i = 19$, $(\sum v_i)^2 = 361$,
$\sum v_i^2 = 25 + 25 + 16 + 9 + 4 = 79$, so $\frac{361 - 79}{2} = \frac{282}{2} = 141$. Option-pairs:
$\binom{5}{2} = 10$, so a crude lower bound on pairwise tests is $\lceil 141/10 \rceil = 15$, and the
two largest options force at least $5 \times 5 = 25$ tests. Even at 600, this feature's exhaustive suite
is feasible (small $m$) — the rule of thumb correctly predicts that exhaustive only dies once $m$ is
large or some option's $v$ is huge. The discipline is the same every time: compute $v^m$ (exhaustive),
$\sum_{i Here is a tiny program that computes the three strategy sizes from the value-count list, so the analysis
above is reproducible for any feature, not just this one: Every number matches the hand derivations: exhaustive 1,728 (Phase 1), OFAT 14 (Phase 2), value-pairs
191 and option-pairs $\binom{8}{2} = 28$ (Phase 3). The function is general — feed it any feature's
value-counts and it sizes all three strategies, which is exactly the reusable artifact a counting
analysis should leave behind.from math import comb
def strategy_sizes(value_counts):
"""Return (exhaustive, ofat, value_pairs_to_cover) for a list of per-option value counts."""
n = len(value_counts)
exhaustive = 1
for v in value_counts:
exhaustive *= v # product rule
ofat = 1 + sum(v - 1 for v in value_counts) # baseline + (v-1) per option (sum rule)
s1 = sum(value_counts) # sum of v_i
s2 = sum(v * v for v in value_counts) # sum of v_i^2
value_pairs = (s1 * s1 - s2) // 2 # sum_{i<j} v_i v_j
return exhaustive, ofat, value_pairs, comb(n, 2)
print(strategy_sizes([4, 3, 2, 2, 3, 3, 2, 2]))
# Expected output:
# (1728, 14, 191, 28)
Discussion Questions
Your Turn: Extensions
strategy_sizes (by hand or in your head for the small parts). Report the
exhaustive count and the value-pairs-to-cover, and decide which strategy you would mandate.strategy_sizes to also return the number of
value-triples to cover, $\sum_{iKey Takeaways