Case Study 2: Quality Control and the CLT — How Factories Trust Sample Averages
The Scenario
Alex Rivera is in a meeting at StreamVibe when the engineering team presents a problem that has nothing to do with recommendation algorithms.
"Our streaming servers process video into different quality levels," explains Priya, the infrastructure lead. "Each transcode operation should take about 200 milliseconds on average. But lately, some users are reporting buffering. We need to figure out if the average processing time has actually increased, or if we're just seeing normal variation."
Priya pulls up a dashboard. It shows the processing times for the last 10,000 transcode operations: the distribution is right-skewed (most operations are fast, but some are slow), with a mean of 206 ms and a standard deviation of 85 ms.
"Is 206 ms significantly higher than our target of 200 ms?" Priya asks.
Alex recognizes this immediately. It's the same type of question Sam faces with Daria's shooting percentage — has something actually changed, or is this just random variation? And the CLT is the tool that makes the answer possible.
The Factory Analogy
Before we tackle StreamVibe's problem, let's look at the place where this kind of analysis was invented: the factory floor.
Walter Shewhart and the Birth of Statistical Process Control
In the 1920s, a physicist named Walter Shewhart was working at Bell Telephone Laboratories. The company manufactured telephone equipment, and they had a problem: how do you know if a manufacturing process is working properly?
Individual items vary — that's a fact of life. No two light bulbs are identical. No two phone handsets weigh exactly the same. Shewhart's insight was that this variation comes from two sources:
-
Common cause variation: The natural, inherent randomness of the process. Even a well-tuned machine produces items that vary slightly. This variation is stable and predictable.
-
Special cause variation: Something has gone wrong — a tool has worn down, a raw material batch is defective, a setting has drifted. This variation is unusual and signals a problem that needs fixing.
Shewhart realized that the CLT could distinguish between these two. Here's how.
Control Charts: The CLT in Action
Suppose a factory produces ball bearings with a target diameter of 10.00 mm. Under normal operating conditions, the population mean is $\mu = 10.00$ mm and the standard deviation is $\sigma = 0.05$ mm. The distribution of individual bearing diameters is slightly right-skewed (occasionally a bearing comes out a bit too large).
The quality engineer periodically takes a sample of $n = 25$ bearings and computes the sample mean diameter.
By the CLT: The sampling distribution of $\bar{x}$ is approximately normal with: - Mean: $\mu_{\bar{x}} = 10.00$ mm - Standard error: $\text{SE} = 0.05 / \sqrt{25} = 0.01$ mm
This means: - 99.7% of sample means should fall within $\mu \pm 3\text{SE} = 10.00 \pm 0.03$ mm - If a sample mean falls outside this range, something unusual has happened
Shewhart formalized this into control charts: graphs that plot sample means over time with horizontal lines at $\mu$ (the center line) and $\mu \pm 3\text{SE}$ (the upper and lower control limits).
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
np.random.seed(42)
mu = 10.00 # Target diameter (mm)
sigma = 0.05 # Population SD
n = 25 # Sample size for each batch
se = sigma / np.sqrt(n) # Standard error
# Simulate 50 batches of 25 bearings each
# First 35 batches: process is in control
# Last 15 batches: process has shifted (tool wear)
n_batches = 50
shift_point = 35
sample_means = []
for i in range(n_batches):
if i < shift_point:
batch = np.random.normal(loc=mu, scale=sigma, size=n)
else:
# Process mean has shifted up by 0.02 mm (tool wearing down)
batch = np.random.normal(loc=mu + 0.02, scale=sigma, size=n)
sample_means.append(batch.mean())
sample_means = np.array(sample_means)
# Control chart
fig, ax = plt.subplots(figsize=(14, 6))
# Plot sample means
colors = ['steelblue' if i < shift_point else 'coral'
for i in range(n_batches)]
ax.scatter(range(1, n_batches + 1), sample_means, c=colors,
s=50, zorder=5, edgecolors='white')
ax.plot(range(1, n_batches + 1), sample_means, 'gray',
alpha=0.3, linewidth=1)
# Center line and control limits
ax.axhline(mu, color='green', linewidth=2, label=f'Center line (μ = {mu})')
ax.axhline(mu + 3 * se, color='red', linewidth=1.5, linestyle='--',
label=f'UCL = {mu + 3*se:.3f}')
ax.axhline(mu - 3 * se, color='red', linewidth=1.5, linestyle='--',
label=f'LCL = {mu - 3*se:.3f}')
ax.axhline(mu + 2 * se, color='orange', linewidth=1, linestyle=':',
alpha=0.7, label=f'±2 SE = {mu + 2*se:.3f}')
ax.axhline(mu - 2 * se, color='orange', linewidth=1, linestyle=':',
alpha=0.7)
# Mark the shift
ax.axvline(shift_point + 0.5, color='purple', linewidth=1.5,
linestyle='-.', alpha=0.6, label='Process shift occurs')
ax.set_xlabel('Batch Number', fontsize=12)
ax.set_ylabel('Sample Mean Diameter (mm)', fontsize=12)
ax.set_title('Control Chart for Ball Bearing Diameter\n'
'(n = 25 per batch, CLT-based control limits)',
fontsize=13, fontweight='bold')
ax.legend(loc='upper left', fontsize=9)
ax.set_ylim(mu - 5 * se, mu + 5 * se)
plt.tight_layout()
plt.show()
# Count out-of-control points
ooc_before = sum(1 for i, m in enumerate(sample_means)
if i < shift_point and abs(m - mu) > 3 * se)
ooc_after = sum(1 for i, m in enumerate(sample_means)
if i >= shift_point and abs(m - mu) > 3 * se)
print(f"Out-of-control points before shift: {ooc_before}/{shift_point}")
print(f"Out-of-control points after shift: {ooc_after}/{n_batches - shift_point}")
Visual description (control chart): The chart shows 50 data points (sample means) plotted over time. A green horizontal line marks the target mean (10.000 mm). Two red dashed lines mark the upper and lower control limits (10.030 and 9.970 mm). For batches 1-35 (shown in blue), the sample means scatter randomly around the center line, all within the control limits — the process is in control. After batch 35 (marked by a purple dashed line), the points (shown in coral) shift upward. Several of the later points exceed the upper control limit, signaling that the process mean has changed. The shift is only 0.02 mm — invisible in individual measurements, but detectable in sample means thanks to the CLT reducing the standard error to 0.01 mm.
Why the CLT Matters Here
Without the CLT, Shewhart couldn't have built control charts. Here's what the CLT provides:
-
Normality: Even though individual bearing diameters aren't perfectly normal, the sampling distribution of the mean IS approximately normal (for $n = 25$). This means the 68-95-99.7 rule applies to sample means.
-
Known spread: The standard error $\sigma / \sqrt{n}$ tells us exactly how much variation to expect from batch to batch. If a sample mean falls more than 3 SE from the target, there's only a 0.3% chance that's due to common cause variation.
-
Detection power: By using samples of $n = 25$ rather than individual measurements, the SE shrinks from $\sigma = 0.05$ mm to $\text{SE} = 0.01$ mm. This makes it possible to detect a shift of just 0.02 mm — a shift that would be invisible in individual measurements but stands out clearly in the control chart.
Back to StreamVibe
Now let's help Alex and Priya answer their question.
The Setup
- Target mean processing time: $\mu_0 = 200$ ms
- Observed sample: $n = 10{,}000$ transcode operations
- Observed mean: $\bar{x} = 206$ ms
- Observed standard deviation: $s = 85$ ms
The Analysis
Step 1: Calculate the standard error.
$$\widehat{\text{SE}} = \frac{s}{\sqrt{n}} = \frac{85}{\sqrt{10{,}000}} = \frac{85}{100} = 0.85 \text{ ms}$$
Step 2: How many standard errors from the target?
$$z = \frac{\bar{x} - \mu_0}{\text{SE}} = \frac{206 - 200}{0.85} = 7.06$$
Step 3: Interpret.
The observed mean is more than 7 standard errors above the target. Under the CLT, the probability of being this far from the target by pure chance is essentially zero — far less than one in a billion.
"This isn't random variation," Alex tells Priya. "The processing time has genuinely increased. The six-millisecond increase is small in absolute terms, but with ten thousand observations, the standard error is tiny — less than one millisecond. We're extremely confident the true average has shifted."
The Practical Question
But here's where statistical significance meets practical significance — a distinction we'll explore in Chapter 17.
"Okay, so it's real," Priya says. "But is 6 milliseconds actually a problem? Our buffering threshold is 500 ms. Most users won't notice 6 ms."
Alex nods. "You're right that 6 ms itself isn't causing buffering. But it might be a symptom of something bigger — maybe a subset of operations are taking much longer than usual, and they're pulling the mean up slightly while causing buffering for the users they affect."
They investigate further and discover that the distribution's right tail has gotten heavier — the 99th percentile of processing time has jumped from 420 ms to 510 ms. Those operations are causing the buffering. The small shift in the mean was a signal pointing to a much more specific problem.
import numpy as np
from scipy import stats
# StreamVibe processing time analysis
n = 10_000
x_bar = 206
s = 85
mu_target = 200
se = s / np.sqrt(n)
z = (x_bar - mu_target) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Sample mean: {x_bar} ms")
print(f"Target mean: {mu_target} ms")
print(f"Sample SD: {s} ms")
print(f"Standard error: {se:.2f} ms")
print(f"Z-score: {z:.2f}")
print(f"P-value: {p_value:.2e}")
print(f"\nInterpretation: The {x_bar - mu_target} ms increase is")
print(f"statistically significant (z = {z:.1f}, p ≈ 0),")
print(f"but investigate whether it's practically meaningful.")
Output:
Sample mean: 206 ms
Target mean: 200 ms
Sample SD: 85 ms
Standard error: 0.85 ms
Z-score: 7.06
P-value: 1.66e-12
Interpretation: The 6 ms increase is
statistically significant (z = 7.1, p ≈ 0),
but investigate whether it's practically meaningful.
Lessons from Manufacturing for Data Science
The parallel between Shewhart's factory and Alex's streaming platform reveals something profound: the same statistical framework — built on the CLT — applies everywhere.
The Universal Pattern
-
Establish baseline: What does the process look like when it's working normally? (Factory: target diameter. StreamVibe: target latency.)
-
Quantify normal variation: Use the CLT to determine how much sample statistics should vary. (Factory: $\text{SE} = \sigma / \sqrt{n}$. StreamVibe: same formula.)
-
Set control limits: Define the boundaries of "normal variation." (Factory: $\mu \pm 3\text{SE}$. StreamVibe: same approach.)
-
Monitor and detect: When a statistic falls outside the limits, investigate. (Factory: check tools, materials. StreamVibe: check servers, code changes.)
This pattern shows up in:
- Healthcare: Monitoring hospital infection rates to detect outbreaks (Dr. Maya Chen's world)
- Finance: Detecting unusual trading patterns that might indicate fraud
- Tech: A/B testing, performance monitoring, anomaly detection
- Sports: Detecting when a player's performance has genuinely changed (Sam's world)
- Criminal justice: Monitoring whether an algorithm's error rates have shifted (Professor Washington's world)
AI and Automated Monitoring
Modern tech companies run automated systems that apply this CLT-based framework continuously. At StreamVibe, algorithms monitor thousands of metrics — latency, error rates, user engagement — and trigger alerts when sample means drift beyond control limits.
These systems are, fundamentally, automated control charts. They work because the CLT guarantees that sample means of these metrics follow predictable, normal patterns. When the pattern breaks, something has changed in the underlying process.
Professor Washington notes a parallel in criminal justice: some jurisdictions use similar monitoring to check whether algorithmic recommendations are drifting — for example, whether a risk assessment algorithm starts assigning systematically higher scores to a particular demographic group. The CLT provides the mathematical framework for deciding whether an observed shift is "real" or just noise.
Discussion Questions
-
Shewhart used 3 standard errors (the "3-sigma rule") for control limits, accepting a 0.3% false alarm rate. Would you choose a wider or narrower limit in these contexts: (a) nuclear reactor monitoring, (b) social media engagement tracking, (c) pharmaceutical manufacturing? Justify your choices.
-
The StreamVibe example showed that a statistically significant shift of 6 ms wasn't practically meaningful by itself but pointed to a real problem. How does this challenge the common belief that "statistically significant = important"?
-
Why does the factory use samples of $n = 25$ bearings rather than measuring every single bearing? Consider both the cost argument and the statistical argument (what would happen to control limits if $n$ were much larger?).
-
In Professor Washington's monitoring scenario, what are the ethical stakes of setting control limits too wide (missing a real shift in algorithmic bias) versus too narrow (constantly investigating false alarms)? How should those stakes influence the choice of 2-sigma vs. 3-sigma limits?
-
Control charts assume the process is stable during the "in control" phase. What could go wrong if the baseline itself was established during a period of poor performance?