Case Study: Auditing a Social Graph Before You Trust It
"All models are wrong, but some are useful." — George E. P. Box
Executive Summary
A data engineer hands you a CSV export of a small social network — a list of friendships — and asks you to "do some analysis." Before you compute a single fancy metric, you will audit the graph: turn the raw edge list into a precise $G = (V, E)$, sanity-check it with the handshaking lemma, find the most-connected people by degree, decide whether the network is in one piece or several components, measure degrees of separation between specific people along shortest paths, and use a cheap invariant to detect two "different" exported snapshots that are actually the same graph relabeled. Every step is pure Chapter 27 vocabulary doing real work. The lesson — Theme 2 of this book — is that a metric computed on a graph you have not validated is a confident-looking lie.
This is the anchor example for all of Part V (see §27.6): the graph you audit here is exactly the kind of object Chapter 28 will run breadth-first search on, Chapter 29 will weight, and Chapter 33 will color into communities.
Skills applied
- Translating a raw edge list into a formal graph $G = (V, E)$ and an adjacency dictionary.
- Using the handshaking lemma ($\sum \deg = 2\lvert E\rvert$) and its parity corollary as a data integrity check.
- Computing degree, neighborhoods, and the most-connected vertex.
- Finding connected components and reasoning about disconnection (cut vertices).
- Measuring shortest-path length ("degrees of separation") by hand on a small graph.
- Using an isomorphism invariant (degree sequence + component structure) to deduplicate snapshots.
Background
The scenario
You join a small startup that runs a study-buddy app. The data engineer exports the friendship table as a list of unordered pairs (a friendship is mutual, so each line is one undirected edge):
ana,ben
ana,cam
ben,cam
cam,dev
dev,eve
eve,fin
fin,dev
gus,hal
Eight friendships among the users ana, ben, cam, dev, eve, fin, gus, hal. The product manager wants
three things: "who are our most-connected users (for a beta invite)?", "is everyone reachable from
everyone — is our network one community or several?", and "how many degrees of separation are ana and
hal?" You will answer all three — but first you validate the export, because a metric on bad data is
worse than no metric.
💡 Intuition: An edge list is the rawest graph format — just pairs. Your first job is always to lift it into a structure you can ask questions of: the vertex set $V$, the edge set $E$, and an adjacency dictionary $N(\cdot)$. Until you do, "the graph" is just text.
Why this matters
Real social-graph pipelines ingest edge lists by the billion, and they routinely contain duplicate rows, self-loops (a user "friending" themselves through a bug), and dangling references. The cheap invariants of Chapter 27 — is the degree sum even? does the edge count match? are there self-loops? — are the first line of defense, and they cost nothing. Skipping them is how a "viral growth" dashboard ends up double-counting every friendship.
Phase 1: From edge list to a formal graph
First, lift the text into sets, exactly as §27.1 prescribes. The vertices are the distinct names; each
edge is an unordered pair, which we model as a frozenset so that {"ana","ben"} and {"ben","ana"}
are the same edge.
edges_raw = [("ana","ben"), ("ana","cam"), ("ben","cam"), ("cam","dev"),
("dev","eve"), ("eve","fin"), ("fin","dev"), ("gus","hal")]
V = set()
E = set()
for u, v in edges_raw:
V.add(u); V.add(v)
E.add(frozenset({u, v}))
print(len(V), "vertices,", len(E), "edges")
# Expected output:
# 8 vertices, 8 edges
We hand-trace: the names that appear are ana, ben, cam, dev, eve, fin, gus, hal — eight distinct
vertices. Eight raw rows, all distinct unordered pairs, give eight edges. So
$$V = \{\text{ana, ben, cam, dev, eve, fin, gus, hal}\}, \qquad \lvert V\rvert = 8, \quad \lvert E\rvert = 8.$$
Using frozenset would also silently absorb a duplicate row — if the export listed ben,ana and
ana,ben, the set E would still hold one edge. That is the deduplication property from §27.1 (Check
Your Understanding #2) doing data hygiene for free.
🔄 Check Your Understanding If the export had accidentally included the row
cam,cam, what wouldfrozenset({"cam","cam"})evaluate to, and why is that a red flag for a simple graph?
Answer
frozenset({"cam","cam"})isfrozenset({"cam"})— a one-element set, not a two-element edge. A simple graph forbids loops $\{v,v\}$ (§27.1), so any one-element "edge" signals a self-loop bug in the data; you would flag and drop it (or escalate to a multigraph deliberately).
Phase 2: The handshaking lemma as an integrity check
Now the cheapest, most powerful validation in graph theory. Build the adjacency dictionary and check that $\sum_v \deg(v) = 2\lvert E\rvert$ (§27.4). If the two sides disagree, the export — or your ingestion code — is internally inconsistent, and you stop before computing anything else.
adj = {v: set() for v in V}
for e in E:
u, w = tuple(e) # an edge has exactly two endpoints
adj[u].add(w)
adj[w].add(u)
degree_sum = sum(len(nbrs) for nbrs in adj.values())
print("sum of degrees =", degree_sum)
print("2 * |E| =", 2 * len(E))
print("consistent? ", degree_sum == 2 * len(E))
# Expected output:
# sum of degrees = 16
# 2 * |E| = 16
# consistent? True
Hand-derive each side. First the degrees, straight from the edge list:
| Vertex | Neighbors | Degree |
|---|---|---|
| ana | ben, cam | 2 |
| ben | ana, cam | 2 |
| cam | ana, ben, dev | 3 |
| dev | cam, eve, fin | 3 |
| eve | dev, fin | 2 |
| fin | eve, dev | 2 |
| gus | hal | 1 |
| hal | gus | 1 |
Sum of degrees $= 2+2+3+3+2+2+1+1 = 16$. And $2\lvert E\rvert = 2 \times 8 = 16$. They match, so the
export is internally consistent and we may proceed. We also confirm the parity corollary: the only
odd-degree vertices are gus (1) and hal (1) — exactly two, an even count, as §27.4 guarantees.
Two free integrity checks, both green.
Now imagine the engineer had accidentally exported cam,dev twice. The set E would still hold the
edge once (a frozenset deduplicates), so len(E) stays 8 — but a buggy ingestion loop that added
to adj once per raw row would record an extra cam↔dev link, pushing the degree sum to 18 while
$2\lvert E\rvert$ stayed 16. The check degree_sum == 2 * len(E) would print False, and you would
catch the double-count before it inflated a "viral growth" dashboard. That is the lemma earning its
keep as a one-line data test.
⚠️ Common Pitfall: A failed handshaking check almost never means "the lemma is wrong" — the lemma is a theorem, true of every graph. It means your data or your counting code is wrong: a duplicated row counted once in $E$ but twice in the degree loop, a self-loop, or an edge inserted in only one direction in
adj. Treat a handshaking mismatch as a stack trace pointing straight at your ingestion code.
We also check the parity corollary: the odd-degree vertices are gus (1) and hal (1) — exactly
two of them, an even count, as §27.4 guarantees. Another quiet confirmation that the export is sane.
Phase 3: Most-connected users (degree ranking)
The product manager's first question — "who should get the beta invite?" — is, in its crudest form, "who has the highest degree?" Degree is the first proxy for influence (§27.2).
degree = {v: len(nbrs) for v, nbrs in adj.items()}
ranked = sorted(degree, key=degree.get, reverse=True)
print("degrees:", dict(sorted(degree.items())))
print("top 2 by degree:", ranked[:2])
# Expected output:
# degrees: {'ana': 2, 'ben': 2, 'cam': 3, 'dev': 3, 'eve': 2, 'fin': 2, 'gus': 1, 'hal': 1}
# top 2 by degree: ['cam', 'dev']
Hand-check: from the table in Phase 2, cam and dev are the only degree-3 vertices, so they top the
ranking (ties broken arbitrarily by the sort). They are the natural beta candidates — each sits at a
junction of the network.
🔗 Connection. Degree is the simplest centrality measure. Smarter notions — betweenness (who lies on the most shortest paths?), and PageRank-style influence — build on the path and neighbor language of this chapter. Degree is where every centrality story starts (§27.6).
Phase 4: Is the network one community? (connected components)
The manager's second question — "is everyone reachable from everyone?" — is precisely "is the graph
connected?" (§27.2). We trace reachability by hand. Start from ana and see whom we can reach by
walking edges:
ana → {ben, cam} → cam reaches dev → dev reaches {eve, fin} → eve, fin reach each other.
So from ana we reach $\{$ana, ben, cam, dev, eve, fin$\}$ — six users. But gus and hal appear in
no edge touching that group; their only friendship is to each other. The graph therefore splits into
two connected components:
$$C_1 = \{\text{ana, ben, cam, dev, eve, fin}\}, \qquad C_2 = \{\text{gus, hal}\}.$$
def component_of(adj, start):
seen, stack = set(), [start]
while stack:
v = stack.pop()
if v in seen:
continue
seen.add(v)
stack.extend(adj[v] - seen) # push unvisited neighbors
return seen
c1 = component_of(adj, "ana")
print("component of ana:", sorted(c1), "size", len(c1))
print("ana reaches hal?", "hal" in c1)
# Expected output:
# component of ana: ['ana', 'ben', 'cam', 'dev', 'eve', 'fin'] size 6
# ana reaches hal? False
The answer to the manager: no, the network is not one community. It is two: a main cluster of six,
and an isolated pair (gus–hal) who joined together and never connected to anyone else. That is a
product insight — those two need a "suggested friends" nudge — and it is a direct reading of the
component structure.
💡 Intuition: "Reachable by a path" is an equivalence relation (§27.2, Chapter 12): reflexive, symmetric, transitive. Its equivalence classes are exactly the connected components, which is why they partition the users with no overlaps and no one left out. The little
component_ofroutine is a first, informal traversal — Chapter 28 formalizes it as breadth-first / depth-first search.🔄 Check Your Understanding Within the main component $C_1$, which single vertex, if deleted, would split it into two pieces? (Such a vertex is a cut vertex — a single point of failure.)
Answer
cam. Deletingcamremoves edges ana–cam, ben–cam, cam–dev. Then $\{$ana, ben$\}$ are cut off from $\{$dev, eve, fin$\}$ — two pieces. Socamis a cut vertex linking the ana–ben side to the dev–eve–fin side. (devis also worth checking; deleting it isolates the eve–fin pair from the rest only if eve–fin had no other route — here eve and fin still connect to each other, but lose the path to ana/ben, sodevis a cut vertex too.)
Phase 5: Degrees of separation (shortest path by hand)
The third question — "how many degrees of separation are ana and hal?" — is "what is the length
of the shortest path from ana to hal?" (§27.2). But Phase 4 already answered the hard part: ana
and hal are in different components, so no path exists at all. Their degrees of separation are
undefined (often reported as $\infty$). This is a trap the chapter prepares you for: you cannot ask
for a shortest path before confirming the two vertices are even in the same component.
So instead, measure a pair that is connected: ana to fin. Trace shortest routes:
ana–cam–dev–fin: length 3.ana–cam–dev–eve–fin: length 4 (longer).
No route from ana to fin uses fewer than 3 edges (every path must leave the ana–ben–cam triangle
through cam, then cross to dev, then reach fin). So the shortest-path length — the degrees of
separation — is 3.
def shortest_len(adj, src, dst):
# informal breadth-first layering; formalized in Chapter 28
frontier, dist = {src}, {src: 0}
while frontier:
nxt = set()
for v in frontier:
for w in adj[v]:
if w not in dist:
dist[w] = dist[v] + 1
nxt.add(w)
frontier = nxt
return dist.get(dst, None) # None == unreachable
print("ana–fin:", shortest_len(adj, "ana", "fin"))
print("ana–hal:", shortest_len(adj, "ana", "hal"))
# Expected output:
# ana–fin: 3
# ana–hal: None
Hand-trace the layering from ana: layer 0 = {ana}; layer 1 = {ben, cam}; layer 2 = {dev} (cam's new
neighbor); layer 3 = {eve, fin} (dev's new neighbors). So fin first appears in layer 3 → distance 3,
matching our by-hand path. hal never appears → None → unreachable, exactly as Phase 4 predicted.
🔗 Connection. That layer-by-layer expansion is breadth-first search, and it computes the literal "six degrees of separation." Chapter 28 turns this sketch into the real
bfs(g, s)ingraphs.pyand proves it returns the minimum path length.
Phase 6: Deduplicating snapshots with an invariant
A month later the engineer hands you a second export, taken from a different server, with the users renamed to numeric IDs:
1,2
1,3
2,3
3,4
4,5
5,6
6,4
7,8
Is this a genuinely new network, or the same graph as before with relabeled vertices (an isomorphism, §27.5)? You must not pay to re-ingest a duplicate. Trying all $8! = 40{,}320$ relabelings is silly; instead compute a cheap invariant and compare.
def signature(adj):
"""A cheap isomorphism invariant: (#vertices, #edges, sorted degree seq,
sorted component sizes)."""
n = len(adj)
m = sum(len(s) for s in adj.values()) // 2
degs = tuple(sorted(len(s) for s in adj.values()))
# component sizes via the traversal from Phase 4
seen, sizes = set(), []
for v in adj:
if v not in seen:
comp = component_of(adj, v)
seen |= comp
sizes.append(len(comp))
return (n, m, degs, tuple(sorted(sizes)))
print(signature(adj))
# Expected output:
# (8, 8, (1, 1, 2, 2, 2, 2, 3, 3), (2, 6))
The second export, built into its own adj2, has the identical structure: vertices 1..8, eight
edges, the same triangle 1-2-3 (matching ana-ben-cam), the same 3–4–5–6 cluster with the 4–6 edge
(matching cam-dev-eve-fin... let us check carefully), and the isolated pair 7-8 (matching
gus-hal). Its signature is also (8, 8, (1,1,2,2,2,2,3,3), (2,6)).
print("signatures match?", signature(adj) == signature(adj2))
# Expected output:
# signatures match? True
⚠️ Common Pitfall: Matching signatures do not prove the two graphs are isomorphic — the degree sequence and component sizes are necessary but not sufficient (§27.5). They earn you the right to keep looking: now you would attempt an explicit adjacency-preserving bijection (e.g.
ana↦?by matching the unique degree-3 vertices and their neighborhoods). A mismatch, by contrast, would have settled it instantly: not the same graph, no re-ingest. Here the cheap test flags a likely duplicate worth a closer check — turning a $40{,}320$-way search into a one-line comparison plus a targeted verification.
To finish the deduplication you would exhibit the bijection. The two degree-3 vertices are {cam, dev}
in snapshot 1 and {3, 4} in snapshot 2. cam's neighbors are {ana, ben, dev} (two degree-2, one
degree-3); 3's neighbors are {1, 2, 4} (two degree-2, one degree-3) — the same local shape. Mapping
cam↦3, dev↦4, ana↦1, ben↦2, eve↦5, fin↦6, gus↦7, hal↦8 and checking all eight edges confirms the
isomorphism. Same network, relabeled — do not re-ingest.
Discussion Questions
- The handshaking check in Phase 2 caught a reporting bug (a wrong
print), not a data bug. Describe a concrete data bug — a malformed edge list — that the check $\sum\deg = 2\lvert E\rvert$ would catch, and one it would not catch. (Hint: the lemma sees totals, not structure.) - In Phase 5 you reported
ana–halseparation asNone(undefined). Some systems report it as a large sentinel like999999. What bug could that sentinel cause downstream if a later average treats it as a real distance? - Degree ranked
camanddevas most-connected. Give a small modification to the friendship data that would make a low-degree user arguably more important than a high-degree one. (Hint: think about who connects the two components.) - The signature in Phase 6 used (vertices, edges, degree sequence, component sizes). Propose one more cheap invariant you could add to make false "matches" rarer, and argue it is genuinely preserved by relabeling.
- Everything in this case study was undirected. If the export were a "follows" graph instead of a "friends" graph, which phases change, and which stay the same? (Reference in-degree vs. out-degree.)
Your Turn: Extensions
- Option A (real ingest). Write
load_graph(rows)that builds aGraph(from the Project Checkpoint) from an edge list, rejecting self-loops and reporting the count of duplicate rows it skipped. Then assert the handshaking lemma on the result. - Option B (component report). Extend
component_ofinto acomponents(adj)that returns a list of all component vertex-sets. Run it on the eight-user graph and confirm it returns the two sets from Phase 4. (You are previewing Chapter 28's connected-components routine.) - Option C (mutual-friend suggester). For the isolated pair
gus–hal, the "people you may know" idea (§27.6) fails — they have no friends-of-friends. Writesuggest(adj, person)returning distance-2 vertices ranked by number of common neighbors, run it forana, and explain why it returnsdev(and with what count). - Option D (toward isomorphism). Implement
same_degree_multiset(adj1, adj2)and a partial bijection checker that maps the unique-degree vertices first. Apply it to the two snapshots and report how far you get before needing to guess.
Key Takeaways
- Validate before you analyze. Lift the edge list into $V$, $E$, and an adjacency dict; then run the handshaking lemma and the parity corollary as free integrity checks. A metric on unvalidated data is a confident lie (Theme 2).
- A handshaking mismatch indicts your code, not the theorem. $\sum\deg = 2\lvert E\rvert$ always holds; a failure points at duplicate rows, self-loops, or a one-directional insert.
- Components come before paths. "Degrees of separation" is undefined across components, so check reachability first; an unreachable pair has no shortest path, not a huge one.
- Invariants replace brute force. A degree-sequence-plus-component signature turns a $40{,}320$-way isomorphism search into a one-line filter — but a match only licenses a closer look, never a conclusion (§27.5). Abstraction, not exhaustion, is the tool (Theme 3).
- This is the Part V anchor. The same graph and the same traversals reappear as BFS (Chapter 28), weighted shortest paths (Chapter 29), and community coloring (Chapter 33).