Case Study: Auditing a Social Graph Before You Trust It

DataField.Dev

"All models are wrong, but some are useful." — George E. P. Box

Executive Summary

A data engineer hands you a CSV export of a small social network — a list of friendships — and asks you to "do some analysis." Before you compute a single fancy metric, you will audit the graph: turn the raw edge list into a precise $G = (V, E)$, sanity-check it with the handshaking lemma, find the most-connected people by degree, decide whether the network is in one piece or several components, measure degrees of separation between specific people along shortest paths, and use a cheap invariant to detect two "different" exported snapshots that are actually the same graph relabeled. Every step is pure Chapter 27 vocabulary doing real work. The lesson — Theme 2 of this book — is that a metric computed on a graph you have not validated is a confident-looking lie.

This is the anchor example for all of Part V (see §27.6): the graph you audit here is exactly the kind of object Chapter 28 will run breadth-first search on, Chapter 29 will weight, and Chapter 33 will color into communities.

Skills applied

Translating a raw edge list into a formal graph $G = (V, E)$ and an adjacency dictionary.
Using the handshaking lemma ($\sum \deg = 2\lvert E\rvert$) and its parity corollary as a data integrity check.
Computing degree, neighborhoods, and the most-connected vertex.
Finding connected components and reasoning about disconnection (cut vertices).
Measuring shortest-path length ("degrees of separation") by hand on a small graph.
Using an isomorphism invariant (degree sequence + component structure) to deduplicate snapshots.

Background

The scenario

You join a small startup that runs a study-buddy app. The data engineer exports the friendship table as a list of unordered pairs (a friendship is mutual, so each line is one undirected edge):

ana,ben
ana,cam
ben,cam
cam,dev
dev,eve
eve,fin
fin,dev
gus,hal

Eight friendships among the users ana, ben, cam, dev, eve, fin, gus, hal. The product manager wants three things: "who are our most-connected users (for a beta invite)?", "is everyone reachable from everyone — is our network one community or several?", and "how many degrees of separation are ana and hal?" You will answer all three — but first you validate the export, because a metric on bad data is worse than no metric.

💡 Intuition: An edge list is the rawest graph format — just pairs. Your first job is always to lift it into a structure you can ask questions of: the vertex set $V$, the edge set $E$, and an adjacency dictionary $N(\cdot)$. Until you do, "the graph" is just text.

Why this matters

Real social-graph pipelines ingest edge lists by the billion, and they routinely contain duplicate rows, self-loops (a user "friending" themselves through a bug), and dangling references. The cheap invariants of Chapter 27 — is the degree sum even? does the edge count match? are there self-loops? — are the first line of defense, and they cost nothing. Skipping them is how a "viral growth" dashboard ends up double-counting every friendship.

Phase 1: From edge list to a formal graph

First, lift the text into sets, exactly as §27.1 prescribes. The vertices are the distinct names; each edge is an unordered pair, which we model as a frozenset so that {"ana","ben"} and {"ben","ana"} are the same edge.

edges_raw = [("ana","ben"), ("ana","cam"), ("ben","cam"), ("cam","dev"),
             ("dev","eve"), ("eve","fin"), ("fin","dev"), ("gus","hal")]

V = set()
E = set()
for u, v in edges_raw:
    V.add(u); V.add(v)
    E.add(frozenset({u, v}))

print(len(V), "vertices,", len(E), "edges")
# Expected output:
# 8 vertices, 8 edges

We hand-trace: the names that appear are ana, ben, cam, dev, eve, fin, gus, hal — eight distinct vertices. Eight raw rows, all distinct unordered pairs, give eight edges. So $$V = \{\text{ana, ben, cam, dev, eve, fin, gus, hal}\}, \qquad \lvert V\rvert = 8, \quad \lvert E\rvert = 8.$$

Using frozenset would also silently absorb a duplicate row — if the export listed ben,ana and ana,ben, the set E would still hold one edge. That is the deduplication property from §27.1 (Check Your Understanding #2) doing data hygiene for free.

🔄 Check Your Understanding If the export had accidentally included the row cam,cam, what would frozenset({"cam","cam"}) evaluate to, and why is that a red flag for a simple graph?

Answer

frozenset({"cam","cam"}) is frozenset({"cam"}) — a one-element set, not a two-element edge. A simple graph forbids loops $\{v,v\}$ (§27.1), so any one-element "edge" signals a self-loop bug in the data; you would flag and drop it (or escalate to a multigraph deliberately).

Phase 2: The handshaking lemma as an integrity check

Now the cheapest, most powerful validation in graph theory. Build the adjacency dictionary and check that $\sum_v \deg(v) = 2\lvert E\rvert$ (§27.4). If the two sides disagree, the export — or your ingestion code — is internally inconsistent, and you stop before computing anything else.

adj = {v: set() for v in V}
for e in E:
    u, w = tuple(e)            # an edge has exactly two endpoints
    adj[u].add(w)
    adj[w].add(u)

degree_sum = sum(len(nbrs) for nbrs in adj.values())
print("sum of degrees =", degree_sum)
print("2 * |E|        =", 2 * len(E))
print("consistent?    ", degree_sum == 2 * len(E))
# Expected output:
# sum of degrees = 16
# 2 * |E|        = 16
# consistent?     True

Hand-derive each side. First the degrees, straight from the edge list:

Vertex	Neighbors	Degree
ana	ben, cam	2
ben	ana, cam	2
cam	ana, ben, dev	3
dev	cam, eve, fin	3
eve	dev, fin	2
fin	eve, dev	2
gus	hal	1
hal	gus	1

Sum of degrees $= 2+2+3+3+2+2+1+1 = 16$. And $2\lvert E\rvert = 2 \times 8 = 16$. They match, so the export is internally consistent and we may proceed. We also confirm the parity corollary: the only odd-degree vertices are gus (1) and hal (1) — exactly two, an even count, as §27.4 guarantees. Two free integrity checks, both green.

Now imagine the engineer had accidentally exported cam,dev twice. The set E would still hold the edge once (a frozenset deduplicates), so len(E) stays 8 — but a buggy ingestion loop that added to adj once per raw row would record an extra cam↔dev link, pushing the degree sum to 18 while $2\lvert E\rvert$ stayed 16. The check degree_sum == 2 * len(E) would print False, and you would catch the double-count before it inflated a "viral growth" dashboard. That is the lemma earning its keep as a one-line data test.

⚠️ Common Pitfall: A failed handshaking check almost never means "the lemma is wrong" — the lemma is a theorem, true of every graph. It means your data or your counting code is wrong: a duplicated row counted once in $E$ but twice in the degree loop, a self-loop, or an edge inserted in only one direction in adj. Treat a handshaking mismatch as a stack trace pointing straight at your ingestion code.

We also check the parity corollary: the odd-degree vertices are gus (1) and hal (1) — exactly two of them, an even count, as §27.4 guarantees. Another quiet confirmation that the export is sane.

Phase 3: Most-connected users (degree ranking)

The product manager's first question — "who should get the beta invite?" — is, in its crudest form, "who has the highest degree?" Degree is the first proxy for influence (§27.2).

degree = {v: len(nbrs) for v, nbrs in adj.items()}
ranked = sorted(degree, key=degree.get, reverse=True)
print("degrees:", dict(sorted(degree.items())))
print("top 2 by degree:", ranked[:2])
# Expected output:
# degrees: {'ana': 2, 'ben': 2, 'cam': 3, 'dev': 3, 'eve': 2, 'fin': 2, 'gus': 1, 'hal': 1}
# top 2 by degree: ['cam', 'dev']

Hand-check: from the table in Phase 2, cam and dev are the only degree-3 vertices, so they top the ranking (ties broken arbitrarily by the sort). They are the natural beta candidates — each sits at a junction of the network.

🔗 Connection. Degree is the simplest centrality measure. Smarter notions — betweenness (who lies on the most shortest paths?), and PageRank-style influence — build on the path and neighbor language of this chapter. Degree is where every centrality story starts (§27.6).

Phase 4: Is the network one community? (connected components)

The manager's second question — "is everyone reachable from everyone?" — is precisely "is the graph connected?" (§27.2). We trace reachability by hand. Start from ana and see whom we can reach by walking edges:

ana → {ben, cam} → cam reaches dev → dev reaches {eve, fin} → eve, fin reach each other.

So from ana we reach $\{$ana, ben, cam, dev, eve, fin$\}$ — six users. But gus and hal appear in no edge touching that group; their only friendship is to each other. The graph therefore splits into two connected components: $$C_1 = \{\text{ana, ben, cam, dev, eve, fin}\}, \qquad C_2 = \{\text{gus, hal}\}.$$

def component_of(adj, start):
    seen, stack = set(), [start]
    while stack:
        v = stack.pop()
        if v in seen:
            continue
        seen.add(v)
        stack.extend(adj[v] - seen)   # push unvisited neighbors
    return seen

c1 = component_of(adj, "ana")
print("component of ana:", sorted(c1), "size", len(c1))
print("ana reaches hal?", "hal" in c1)
# Expected output:
# component of ana: ['ana', 'ben', 'cam', 'dev', 'eve', 'fin'] size 6
# ana reaches hal? False

The answer to the manager: no, the network is not one community. It is two: a main cluster of six, and an isolated pair (gus–hal) who joined together and never connected to anyone else. That is a product insight — those two need a "suggested friends" nudge — and it is a direct reading of the component structure.

💡 Intuition: "Reachable by a path" is an equivalence relation (§27.2, Chapter 12): reflexive, symmetric, transitive. Its equivalence classes are exactly the connected components, which is why they partition the users with no overlaps and no one left out. The little component_of routine is a first, informal traversal — Chapter 28 formalizes it as breadth-first / depth-first search.

🔄 Check Your Understanding Within the main component $C_1$, which single vertex, if deleted, would split it into two pieces? (Such a vertex is a cut vertex — a single point of failure.)

Answer

cam. Deleting cam removes edges ana–cam, ben–cam, cam–dev. Then $\{$ana, ben$\}$ are cut off from $\{$dev, eve, fin$\}$ — two pieces. So cam is a cut vertex linking the ana–ben side to the dev–eve–fin side. (dev is also worth checking; deleting it isolates the eve–fin pair from the rest only if eve–fin had no other route — here eve and fin still connect to each other, but lose the path to ana/ben, so dev is a cut vertex too.)

Phase 5: Degrees of separation (shortest path by hand)

The third question — "how many degrees of separation are ana and hal?" — is "what is the length of the shortest path from ana to hal?" (§27.2). But Phase 4 already answered the hard part: ana and hal are in different components, so no path exists at all. Their degrees of separation are undefined (often reported as $\infty$). This is a trap the chapter prepares you for: you cannot ask for a shortest path before confirming the two vertices are even in the same component.

So instead, measure a pair that is connected: ana to fin. Trace shortest routes:

ana–cam–dev–fin: length 3.
ana–cam–dev–eve–fin: length 4 (longer).

No route from ana to fin uses fewer than 3 edges (every path must leave the ana–ben–cam triangle through cam, then cross to dev, then reach fin). So the shortest-path length — the degrees of separation — is 3.

def shortest_len(adj, src, dst):
    # informal breadth-first layering; formalized in Chapter 28
    frontier, dist = {src}, {src: 0}
    while frontier:
        nxt = set()
        for v in frontier:
            for w in adj[v]:
                if w not in dist:
                    dist[w] = dist[v] + 1
                    nxt.add(w)
        frontier = nxt
    return dist.get(dst, None)   # None == unreachable

print("ana–fin:", shortest_len(adj, "ana", "fin"))
print("ana–hal:", shortest_len(adj, "ana", "hal"))
# Expected output:
# ana–fin: 3
# ana–hal: None

Hand-trace the layering from ana: layer 0 = {ana}; layer 1 = {ben, cam}; layer 2 = {dev} (cam's new neighbor); layer 3 = {eve, fin} (dev's new neighbors). So fin first appears in layer 3 → distance 3, matching our by-hand path. hal never appears → None → unreachable, exactly as Phase 4 predicted.

🔗 Connection. That layer-by-layer expansion is breadth-first search, and it computes the literal "six degrees of separation." Chapter 28 turns this sketch into the real bfs(g, s) in graphs.py and proves it returns the minimum path length.

Phase 6: Deduplicating snapshots with an invariant

A month later the engineer hands you a second export, taken from a different server, with the users renamed to numeric IDs:

1,2
1,3
2,3
3,4
4,5
5,6
6,4
7,8

Is this a genuinely new network, or the same graph as before with relabeled vertices (an isomorphism, §27.5)? You must not pay to re-ingest a duplicate. Trying all $8! = 40{,}320$ relabelings is silly; instead compute a cheap invariant and compare.

def signature(adj):
    """A cheap isomorphism invariant: (#vertices, #edges, sorted degree seq,
    sorted component sizes)."""
    n = len(adj)
    m = sum(len(s) for s in adj.values()) // 2
    degs = tuple(sorted(len(s) for s in adj.values()))
    # component sizes via the traversal from Phase 4
    seen, sizes = set(), []
    for v in adj:
        if v not in seen:
            comp = component_of(adj, v)
            seen |= comp
            sizes.append(len(comp))
    return (n, m, degs, tuple(sorted(sizes)))

print(signature(adj))
# Expected output:
# (8, 8, (1, 1, 2, 2, 2, 2, 3, 3), (2, 6))

The second export, built into its own adj2, has the identical structure: vertices 1..8, eight edges, the same triangle 1-2-3 (matching ana-ben-cam), the same 3–4–5–6 cluster with the 4–6 edge (matching cam-dev-eve-fin... let us check carefully), and the isolated pair 7-8 (matching gus-hal). Its signature is also (8, 8, (1,1,2,2,2,2,3,3), (2,6)).

print("signatures match?", signature(adj) == signature(adj2))
# Expected output:
# signatures match? True

⚠️ Common Pitfall: Matching signatures do not prove the two graphs are isomorphic — the degree sequence and component sizes are necessary but not sufficient (§27.5). They earn you the right to keep looking: now you would attempt an explicit adjacency-preserving bijection (e.g. ana↦? by matching the unique degree-3 vertices and their neighborhoods). A mismatch, by contrast, would have settled it instantly: not the same graph, no re-ingest. Here the cheap test flags a likely duplicate worth a closer check — turning a $40{,}320$-way search into a one-line comparison plus a targeted verification.

To finish the deduplication you would exhibit the bijection. The two degree-3 vertices are {cam, dev} in snapshot 1 and {3, 4} in snapshot 2. cam's neighbors are {ana, ben, dev} (two degree-2, one degree-3); 3's neighbors are {1, 2, 4} (two degree-2, one degree-3) — the same local shape. Mapping cam↦3, dev↦4, ana↦1, ben↦2, eve↦5, fin↦6, gus↦7, hal↦8 and checking all eight edges confirms the isomorphism. Same network, relabeled — do not re-ingest.

Discussion Questions

The handshaking check in Phase 2 caught a reporting bug (a wrong print), not a data bug. Describe a concrete data bug — a malformed edge list — that the check $\sum\deg = 2\lvert E\rvert$ would catch, and one it would not catch. (Hint: the lemma sees totals, not structure.)
In Phase 5 you reported ana–hal separation as None (undefined). Some systems report it as a large sentinel like 999999. What bug could that sentinel cause downstream if a later average treats it as a real distance?
Degree ranked cam and dev as most-connected. Give a small modification to the friendship data that would make a low-degree user arguably more important than a high-degree one. (Hint: think about who connects the two components.)
The signature in Phase 6 used (vertices, edges, degree sequence, component sizes). Propose one more cheap invariant you could add to make false "matches" rarer, and argue it is genuinely preserved by relabeling.
Everything in this case study was undirected. If the export were a "follows" graph instead of a "friends" graph, which phases change, and which stay the same? (Reference in-degree vs. out-degree.)

Your Turn: Extensions

Option A (real ingest). Write load_graph(rows) that builds a Graph (from the Project Checkpoint) from an edge list, rejecting self-loops and reporting the count of duplicate rows it skipped. Then assert the handshaking lemma on the result.
Option B (component report). Extend component_of into a components(adj) that returns a list of all component vertex-sets. Run it on the eight-user graph and confirm it returns the two sets from Phase 4. (You are previewing Chapter 28's connected-components routine.)
Option C (mutual-friend suggester). For the isolated pair gus–hal, the "people you may know" idea (§27.6) fails — they have no friends-of-friends. Write suggest(adj, person) returning distance-2 vertices ranked by number of common neighbors, run it for ana, and explain why it returns dev (and with what count).
Option D (toward isomorphism). Implement same_degree_multiset(adj1, adj2) and a partial bijection checker that maps the unique-degree vertices first. Apply it to the two snapshots and report how far you get before needing to guess.

Key Takeaways

Validate before you analyze. Lift the edge list into $V$, $E$, and an adjacency dict; then run the handshaking lemma and the parity corollary as free integrity checks. A metric on unvalidated data is a confident lie (Theme 2).
A handshaking mismatch indicts your code, not the theorem. $\sum\deg = 2\lvert E\rvert$ always holds; a failure points at duplicate rows, self-loops, or a one-directional insert.
Components come before paths. "Degrees of separation" is undefined across components, so check reachability first; an unreachable pair has no shortest path, not a huge one.
Invariants replace brute force. A degree-sequence-plus-component signature turns a $40{,}320$-way isomorphism search into a one-line filter — but a match only licenses a closer look, never a conclusion (§27.5). Abstraction, not exhaustion, is the tool (Theme 3).
This is the Part V anchor. The same graph and the same traversals reappear as BFS (Chapter 28), weighted shortest paths (Chapter 29), and community coloring (Chapter 33).