Case Study: Auditing a Telemetry Pipeline Against Its Theoretical Limits

DataField.Dev

Case Study: Auditing a Telemetry Pipeline Against Its Theoretical Limits

"You cannot manage what you cannot measure — and information, it turns out, is measurable." — paraphrasing a maxim often attributed to management folklore, made literal by Shannon

Executive Summary

A fleet-telemetry team is paying too much for bandwidth and wants to know whether the fault is their code or the laws of physics. You will play the role of the engineer brought in to audit two stages of their pipeline: a compressor that shrinks event logs before sending, and a noisy radio link that sometimes corrupts the bytes in flight. Using nothing past this chapter, you will compute the information-theoretic limits — Shannon entropy (§40.2) for the compressor and channel capacity (§40.2) for the link — and compare them to what the team actually achieves. The verdict is diagnostic: where the system sits far from its limit, there is engineering work to do; where it sits at the limit, no amount of cleverness will help and the team should stop trying. Along the way you will see why "recognize the problem's structure first" is the whole skill this book trained, and you will name two of the team's other headaches as a combinatorial-optimization problem and a functor law in disguise.

This is an analysis-heavy study: you will compute, compare, and decide, not build a new system.

Skills applied

Computing Shannon entropy and reading it as a hard floor on compression (§40.2).
Computing binary-symmetric-channel capacity $C = 1 - H(p)$ and reading it as a hard wall on throughput (§40.2).
Using the gap between achieved performance and a theoretical limit to localize an engineering problem.
Recognizing two unrelated-looking subproblems as combinatorial optimization (§40.1) and a functor law (§40.4).

Background

The scenario

Tier 3 — this is a constructed teaching scenario. "Fleetwise" runs a few thousand delivery vehicles, each emitting a steady stream of status events over a cellular/radio uplink. Bandwidth is metered, and the bill is climbing. Two complaints landed on your desk:

"Our compressor barely helps." Each event is one of four status codes — OK, IDLE, WARN, FAULT — and the team currently sends them as raw 8-bit ASCII-ish tags, but even after their homegrown compressor the stream averages about 1.9 bits per event. Is that good? Could a better compressor do meaningfully better, or are they near the floor?
"The link keeps corrupting data." The radio link flips bits. Measured over a long window, each transmitted bit is independently flipped with probability $p = 0.11$. The team is sending at a code rate of $0.80$ information bits per channel bit and still seeing occasional uncorrectable frames. Are they asking the impossible, or is there a code that would fix it?

Your job is not to write a compressor or a code. It is to compute the limits and tell the team, with proof-grade certainty, which complaint is a bug (fixable) and which is a law (not).

💡 Intuition: Information theory gives you two walls. One is below you — the entropy floor: you cannot compress a source below its entropy, on average. One is to the side — the capacity wall: you cannot push information through a noisy channel faster than its capacity, reliably. An audit is just measuring how close the system stands to each wall.

Why this matters

Every storage and networking decision you will ever make lives between these two walls. Knowing where the walls are turns "let's try a fancier algorithm and hope" into "the floor is $1.75$ bits; we're at $1.9$; there's $0.15$ to win and not a bit more." That is the difference between an engineer who guesses and one who knows the budget. It is theme four of this book — computation and proof are complementary — applied to a bandwidth bill: code measures what you achieve, theory proves what is achievable.

Phase 1: The compressor — find the entropy floor

First we need the source distribution. The team's logs over a representative day show the four status codes occur with these frequencies:

Status	Probability
`OK`	$1/2$
`IDLE`	$1/4$
`WARN`	$1/8$
`FAULT`	$1/8$

This is exactly the four-symbol source of §40.2, which is convenient — we can check our arithmetic against the chapter. The entropy is the floor on average bits per event for any lossless code:

$$H = -\left(\tfrac12 \log_2 \tfrac12 + \tfrac14 \log_2 \tfrac14 + \tfrac18 \log_2 \tfrac18 + \tfrac18 \log_2 \tfrac18\right).$$

Using $\log_2 \tfrac12 = -1$, $\log_2 \tfrac14 = -2$, $\log_2 \tfrac18 = -3$:

$$H = \tfrac12(1) + \tfrac14(2) + \tfrac18(3) + \tfrac18(3) = \tfrac12 + \tfrac12 + \tfrac38 + \tfrac38 = \tfrac{14}{8} = 1.75 \text{ bits/event.}$$

from math import log2

def entropy(probs):
    """Shannon entropy in bits. Assumes each p > 0 and the probs sum to 1."""
    return -sum(p * log2(p) for p in probs)

status_dist = [1/2, 1/4, 1/8, 1/8]   # OK, IDLE, WARN, FAULT
print(round(entropy(status_dist), 3))
# Expected output:
# 1.75

The floor is 1.75 bits per event. The team achieves 1.9. So there is room — but only $0.15$ bits per event, about an 8% reduction. We can even hand the team the optimal code: a Huffman code (Chapter 31) for this distribution is OK → 0, IDLE → 10, WARN → 110, FAULT → 111, whose average length is

$$\tfrac12(1) + \tfrac14(2) + \tfrac18(3) + \tfrac18(3) = 1.75 \text{ bits/event,}$$

hitting the entropy floor exactly (because every probability is a power of $1/2$ — see §40.2's Connection callout). The team's 1.9 is not a law; it is a code that hasn't reached the floor.

🔄 Check Your Understanding Suppose a new firmware made all four statuses equally likely ($1/4$ each). What is the new entropy, and would the team's homegrown compressor look better or worse against that floor?

Answer

Uniform over four outcomes gives $H = \log_2 4 = 2$ bits/event — the maximum for four symbols. The floor rises to 2, so a fixed 2-bit code would now be optimal and the team's 1.9 would be impossible (you cannot beat 2 here losslessly). Against the original skewed source, though, the floor is 1.75 and 1.9 is merely suboptimal. The lesson: the floor depends on the distribution, not just the alphabet size.

Verdict on complaint 1: a bug (or at least leftover slack). The floor is 1.75; the team is at 1.9; an optimal prefix code closes the gap. Recommend replacing the homegrown compressor with a Huffman (or arithmetic) coder. Expected win: ~8%.

Phase 2: The link — find the capacity wall

Now the harder, more interesting verdict. The radio link is a binary symmetric channel that flips each bit independently with probability $p = 0.11$. Its capacity — the maximum information bits per channel bit that can ever be sent with vanishing error — is $C = 1 - H(p)$, where $H(p)$ is the binary entropy function.

First compute the binary entropy at $p = 0.11$:

$$H(0.11) = -0.11\log_2 0.11 - 0.89 \log_2 0.89.$$

We have $\log_2 0.11 \approx -3.184$ and $\log_2 0.89 \approx -0.168$, so

$$H(0.11) \approx -0.11(-3.184) - 0.89(-0.168) \approx 0.350 + 0.150 = 0.500.$$

Therefore

$$C = 1 - H(0.11) \approx 1 - 0.500 = 0.500 \text{ information bits per channel bit.}$$

from math import log2

def binary_entropy(p):
    """Binary entropy H(p) in bits; H(0)=H(1)=0 by the 0*log0=0 convention."""
    if p in (0.0, 1.0):
        return 0.0
    return -p * log2(p) - (1 - p) * log2(1 - p)

def capacity_bsc(p):
    """Capacity of a binary symmetric channel: C = 1 - H(p)."""
    return 1 - binary_entropy(p)

print(round(binary_entropy(0.11), 3))
print(round(capacity_bsc(0.11), 3))
# Expected output:
# 0.5
# 0.5

Now the punchline. The capacity is about 0.50, but the team is transmitting at a code rate of 0.80. Shannon's noisy-channel coding theorem (§40.2) is two-sided: below capacity, some code achieves arbitrarily small error; above capacity, no code can, however clever. The team is running at $0.80 > 0.50 = C$. They are not fighting a weak code — they are fighting a theorem.

💡 Intuition: Sending at rate $0.80$ over a channel of capacity $0.50$ is like trying to pour $0.80$ liters per second through a pipe that physically passes $0.50$. The overflow is not a plumbing defect you can engineer away; the pipe's cross-section is the capacity, and you set it the moment you chose a channel with $p = 0.11$.

Verdict on complaint 2: a law, not a bug. At $p = 0.11$ the wall is at rate $0.50$, and $0.80$ is on the wrong side of it. No error-correcting code will rescue a rate above capacity. The honest options are structural, not algorithmic:

Lower the rate below $0.50$ (send more redundancy per information bit) — then a good code (LDPC, Reed–Solomon, §40.2) can drive errors down.
Lower $p$ — improve the physical link (more power, better antenna, lower bit rate per symbol) so capacity rises. Halving $p$ to $\approx 0.055$ pushes capacity up toward $\sim 0.69$, which might make even rate $0.80$... still infeasible (check: $H(0.055)\approx 0.31$, $C\approx 0.69 < 0.80$), so they would need both a lower rate and a better link.

This is the most valuable thing an audit can deliver: stop optimizing the code. The team was about to spend a quarter tuning error-correction parameters that cannot possibly work above capacity.

🔗 Connection: Notice the asymmetry in our two verdicts. The entropy floor was slack the team left on the table (fixable). The capacity wall was a limit the team had walked through (not fixable by code). Both came from the same field — information theory — and both required only the two formulas $H = -\sum p_i\log_2 p_i$ and $C = 1 - H(p)$. Two formulas, two go/no-go engineering decisions.

Phase 3: Two more headaches, recognized

While auditing, you overhear two other complaints. Neither is "information theory," but both are something you now recognize — which is the entire point of the chapter.

Headache A: "Assigning trucks to depots is exploding."

The dispatch team assigns each of $n$ trucks to one of $n$ depots to minimize total drive-in cost, one truck per depot. They "tried all assignments" and it was fine at $n = 5$ but hung at $n = 30$.

You recognize this instantly: it is combinatorial optimization (§40.1) — specifically an assignment problem, the search for a minimum-cost perfect matching. "Try all assignments" enumerates $n!$ permutations: $5! = 120$ (trivial) but $30! \approx 2.65 \times 10^{32}$ (hopeless — more than the counting bounds of Chapters 15–17 allow anyone to enumerate). The standard attack is to model it as an integer linear program (or use the polynomial-time Hungarian algorithm, a matching method from the Chapter 34 family). With variables $x_{ij} \in \{0,1\}$ for "truck $i$ at depot $j$":

$$\text{minimize} \sum_{i,j} c_{ij}\,x_{ij} \quad \text{s.t.} \quad \sum_j x_{ij} = 1\ (\forall i), \quad \sum_i x_{ij} = 1\ (\forall j), \quad x_{ij} \in \{0,1\}.$$

The recommendation writes itself: don't enumerate; model it as an LP/matching problem. That is §40.1's "recognize the genre, apply the standard attack" in one move.

Headache B: "Our data transform pipeline runs the same loop twice."

A data engineer maps a parse function f over a million records, then maps a normalize function g over the result — two full passes — and asks whether it is safe to fuse them into one pass that applies g(f(x)).

You recognize this too: it is the functor composition law (§40.4). Mapping g after mapping f equals mapping the single function "do f, then g" — map(g, map(f, xs)) == map(lambda x: g(f(x)), xs) — for every f, g, and list, because list/map is a functor and $F(g \circ f) = F(g) \circ F(f)$. The fusion is map fusion, a sound optimization justified by category theory, not a lucky coincidence of this pipeline.

f = lambda x: x + 1      # stand-in for "parse"
g = lambda y: y * 10     # stand-in for "normalize"
xs = [1, 2, 3]

two_passes = list(map(g, map(f, xs)))            # current pipeline
one_pass   = list(map(lambda x: g(f(x)), xs))    # fused
print(two_passes, one_pass, two_passes == one_pass)
# Expected output:
# [20, 30, 40] [20, 30, 40] True

Recommendation: fuse the passes — the functor law guarantees identical output. You just saved a full pass over a million records with a one-line argument from §40.4.

🔄 Check Your Understanding The data engineer worries: "What if f and g have side effects, like logging?" Does map fusion still preserve the output list? Does it preserve the number of log lines?

Answer

It preserves the output list (the functor law is about returned values). It does not necessarily preserve observable side effects: two passes might log in a different interleaving or count than one fused pass. The functor laws are stated for pure functions; side effects live outside that guarantee — which is exactly why functional programming pushes effects into monads (§40.4) where their sequencing is controlled. A fair caution to give the engineer.

Discussion Questions

The entropy verdict said "8% of slack remains." Suppose squeezing out that 8% requires switching from a simple byte-aligned code to a bit-packed arithmetic coder that is harder to debug. Using the numbers, argue for or against doing it — and contrast that with the capacity verdict, where the numbers forbid the work outright. What does this say about how a limit should change a roadmap?
We computed capacity for a binary symmetric channel. Real radio links are not exactly symmetric. In one or two sentences, why is $C = 1 - H(p)$ still a useful first audit even when the model is imperfect? (Hint: which direction does a more realistic, noisier model push the wall?)
Headache A was "minimize cost," an assignment problem. If the dispatch team instead wanted to know the value of adding one more depot (a "what is one more unit of capacity worth?" question), which §40.1 concept answers it, and what is that quantity called?
All three of your verdicts — entropy floor, capacity wall, and the assignment-problem genre — are forms of the same meta-move. State that meta-move in one sentence, and connect it to the chapter's closing advice ("ask what is this, structurally?").

Your Turn: Extensions

Option A (re-audit a new fleet). A second fleet's statuses are $\{0.4, 0.3, 0.2, 0.1\}$ and its compressor achieves $2.0$ bits/event. Compute the entropy floor by hand and with the entropy function above, and report how much slack remains. (You'll find the floor is below 2, so there is real room — quantify it.)
Option B (capacity sweep). Tabulate $C = 1 - H(p)$ for $p \in \{0.01, 0.05, 0.11, 0.25, 0.5\}$ using capacity_bsc, and find the largest $p$ for which a rate-$0.5$ system is still below capacity (and thus rescuable by a good code). Hand-derive each value's expected output.
Option C (the certificate angle). The assignment problem in Headache A is an LP. Read §40.1's weak duality argument and explain, in a short paragraph, how a dual solution to the assignment LP would let dispatch prove their chosen assignment is optimal — not just believe it. (This is the optimality-certificate pattern; you don't need to compute the dual, just explain its role.)

Key Takeaways

Two formulas localize two engineering decisions. Entropy $H = -\sum p_i\log_2 p_i$ sets the compression floor; capacity $C = 1 - H(p)$ sets the throughput wall. Each turns "try harder" into a precise go/no-go.
A gap to the floor is fixable; a breach of the wall is not. The team's compressor had 8% of slack (fix it); its link ran above capacity (no code can save that — change the rate or the channel).
The highest-value audit output is "stop." Telling the team to stop tuning a code that cannot work above capacity saved more than any tuning would have.
Recognition is the skill. An assignment headache was combinatorial optimization; a double-pass pipeline was a functor law. Naming the structure — §40.1, §40.4 — delivered the fix in one move each.