Case Study: Measuring Degrees of Separation in a Social Network
"How far apart are two people in a social network? Knock on every door one ring at a time, and count the rings."
Executive Summary
You will take a small but realistic friendship graph and measure it — not change it, not optimize it, but interrogate it the way a network analyst would on the first day with a new dataset. Using only the breadth-first search you built in §28.2, you will compute one person's degrees of separation to everyone else, find the network's longest shortest path (its diameter), discover that the graph splits into separate friend groups (its connected components), and confront a subtle reporting trap: "average separation" is meaningless across disconnected pieces. This is the analytical heart of the social-network thread the book has been building toward since Chapter 27 — and it is the exact computation behind the famous "six degrees of separation" claim.
The lesson is that BFS is not just an algorithm; it is a measuring instrument. One sweep from a single vertex yields an entire column of a distance table. Reading that table correctly — knowing what an absent key means, why distances satisfy a triangle-like bound, and where an "average" silently lies — is the analyst's real skill.
Skills applied
- Modeling a real relationship dataset as an undirected graph (Chapter 27) and storing it as an adjacency list (§28.1).
- Running BFS to compute single-source shortest unweighted distances and reconstruct paths (§28.2).
- Deriving the eccentricity, diameter, and component structure from repeated BFS sweeps (§28.2, §28.4).
- Interpreting traversal output: absent keys = unreachable; distances obey a triangle inequality; averages must be taken within a component.
- Reasoning about cost: why $V$ BFS sweeps cost $\Theta(V(V+E))$, not $\Theta(V^2 \cdot E)$ (Chapter 14, §28.6).
Background
The scenario
A small online community — call it a study-group app — has fourteen users. The product team wants to answer questions a growth analyst would ask:
- For a given user, how socially "central" are they — what is the largest number of friend-hops to reach anyone they can reach?
- What is the worst-case separation in the whole network — the two users who are farthest apart?
- Is the community one connected whole, or has it fractured into cliques that never touch?
- Is the "six degrees" folklore even true here?
The friendship relation is symmetric (if Ana is friends with Bo, Bo is friends with Ana), so we model it as an undirected graph $G = (V, E)$: one vertex per user, one edge per friendship. Because the network is sparse — fourteen people are not friends with everyone — we store it as an adjacency list (the §28.1 default for traversal-heavy work).
🔗 Connection. This case study advances the social-graph anchor from Chapter 27 (where you first modeled a network) to its first real measurement. "Degrees of separation" between two people is precisely their BFS distance. In Chapter 29 we will attach weights (how strong a friendship is, or travel time in a road network) and BFS becomes Dijkstra; here every friendship counts the same, so BFS is exactly the right instrument.
The data
Here is the friendship graph as an adjacency list. Users are numbered $0$–$13$. Notice up front that users $12$ and $13$ are friends with each other and no one else — a foreshadowing we will confirm with code, not eyeballing.
friends = {
0: [1, 2],
1: [0, 3, 4],
2: [0, 4],
3: [1, 5],
4: [1, 2, 5, 6],
5: [3, 4, 7],
6: [4, 7, 8],
7: [5, 6, 9],
8: [6, 9],
9: [7, 8, 10, 11],
10: [9],
11: [9],
12: [13],
13: [12],
}
💡 Intuition: Picture the first twelve users as a loose chain of overlapping triangles winding from user $0$ down to users $10$ and $11$, with users $12$ and $13$ off in a corner holding hands and ignoring everyone. BFS will prove that mental picture — or correct it.
Phase 1: One sweep — degrees of separation from user 0
Start with the simplest question: how far is everyone from user $0$? This is one BFS, and it produces a
whole column of the eventual distance matrix in a single pass. We use the chapter's bfs_distances
verbatim.
from collections import deque
def bfs_distances(adj, s):
"""Shortest-path distances (in edges) from s, plus a predecessor map."""
dist = {s: 0}
parent = {s: None}
queue = deque([s])
while queue:
v = queue.popleft()
for w in adj[v]:
if w not in dist: # 'in dist' doubles as 'visited'
dist[w] = dist[v] + 1
parent[w] = v
queue.append(w)
return dist, parent
dist0, parent0 = bfs_distances(friends, 0)
print(dict(sorted(dist0.items())))
print("12 reachable from 0?", 12 in dist0)
# Expected output:
# {0: 0, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 5, 10: 6, 11: 6}
# 12 reachable from 0? False
Let us hand-derive that distance map ring by ring, exactly as the §28.2 trace table does, so you trust the numbers rather than the machine.
| Ring (distance) | Vertices first reached at this distance | How they were reached |
|---|---|---|
| 0 | $0$ | the source |
| 1 | $1, 2$ | neighbors of $0$ |
| 2 | $3, 4$ | $3$ from $1$; $4$ from $1$ (then $2$ also sees $4$, already claimed) |
| 3 | $5, 6$ | $5$ from $3$ (and from $4$); $6$ from $4$ |
| 4 | $7, 8$ | $7$ from $5$ (and $6$); $8$ from $6$ |
| 5 | $9$ | from $7$ (and $8$) |
| 6 | $10, 11$ | both from $9$ |
Two facts jump out, and both are analysis, not mechanics:
- User $0$'s eccentricity is 6. The farthest reachable users ($10$ and $11$) sit six hops away. The eccentricity of a vertex is the greatest shortest-path distance from it to any vertex it can reach: $\mathrm{ecc}(0) = \max_v \mathrm{dist}_0(v) = 6$.
- Users $12$ and $13$ are absent from the map. Per the §28.2 rule, an absent key means unreachable: $\delta(0, 12) = \infty$. The network is not connected. We will quantify the fracture in Phase 3.
⚠️ Common Pitfall: A tempting one-liner for "average separation from user 0" is
sum(dist0.values()) / len(dist0). Butdist0only contains the reachable users, so this averages over $12$ people and silently omits users $12$ and $13$ — for whom the distance is infinite, not zero and not "skip." Any average distance is only meaningful within a single component. We return to this in Phase 4.🔄 Check Your Understanding User $4$ appears at distance $2$, reached from user $1$. User $4$ is also a neighbor of user $2$, who is also at distance $1$. Could user $4$'s distance have come out as anything other than $2$? Could its
parenthave differed?
Answer
No to the distance: user $4$ is adjacent to two distance-$1$ vertices ($1$ and $2$), so it is reached in the second ring no matter which claims it — $\mathrm{dist}[4] = 2$ is forced. Yes to the parent: had user $2$ been processed before user $1$,
parent[4]would be $2$ instead of $1$. Shortest paths need not be unique (§28.2); only the distance is.
Phase 2: Reconstructing a path, and the triangle inequality
A distance is a number; a path is the evidence. The parent map from one BFS lets us reconstruct an
actual shortest chain of friends — the "who introduced whom" trail — using the §28.2 reconstruction.
def shortest_path(adj, s, t):
"""Reconstruct one shortest s-t path (list of vertices), or None."""
dist, parent = bfs_distances(adj, s)
if t not in parent:
return None
path = []
v = t
while v is not None:
path.append(v)
v = parent[v]
path.reverse()
return path
print(shortest_path(friends, 0, 11))
print(shortest_path(friends, 0, 12))
# Expected output:
# [0, 1, 3, 5, 7, 9, 11]
# None
Walk the parent0 pointers back from $11$: $\mathrm{parent}[11] = 9$, $\mathrm{parent}[9] = 7$,
$\mathrm{parent}[7] = 5$, $\mathrm{parent}[5] = 3$, $\mathrm{parent}[3] = 1$, $\mathrm{parent}[1] = 0$.
Reversed, that is $0 \to 1 \to 3 \to 5 \to 7 \to 9 \to 11$ — six edges, matching
$\mathrm{dist}_0(11) = 6$. (A different but equally short chain, $0 \to 2 \to 4 \to 6 \to 7 \to 9 \to 11$,
also has length 6; which one BFS reports depends only on neighbor order.) And $\delta(0,12)$ has no
path, so shortest_path returns None.
A property worth checking, because analysts lean on it constantly: BFS distances satisfy a triangle inequality. For any three vertices $s$, $u$, $w$ in the same component, $$\delta(s, w) \le \delta(s, u) + \delta(u, w).$$ The reason is immediate from the definition of shortest path: concatenating a shortest $s$–$u$ path with a shortest $u$–$w$ path yields some $s$–$w$ walk of length $\delta(s,u) + \delta(u,w)$, and the shortest $s$–$w$ path can only be shorter or equal. We can spot-check it on our data: $\delta(0, 9) = 5$, and $\delta(0, 7) + \delta(7, 9) = 4 + 1 = 5$, so equality holds here (because $7$ happens to lie on a shortest $0$–$9$ path). Route through a vertex past the target and the inequality goes strict.
🧩 Productive Struggle. Before reading on: find a vertex $u$ for which the triangle inequality from $s = 0$ to $w = 9$ is strict — i.e. $\delta(0, 9) < \delta(0, u) + \delta(u, 9)$. (Hint: pick a $u$ that lies beyond $9$, like $u = 10$, so reaching $u$ overshoots and coming back wastes hops.)
Phase 3: Component structure — is the network one piece?
Phase 1 already hinted the graph is disconnected (users $12$, $13$ never appeared). Let us settle it with the §28.4 component finder, which relaunches a traversal from each not-yet-seen vertex; one launch = one component.
def bfs_order(adj, s):
visited = {s}
queue = deque([s])
order = []
while queue:
v = queue.popleft()
order.append(v)
for w in adj[v]:
if w not in visited:
visited.add(w)
queue.append(w)
return order
def connected_components(adj):
seen = set()
components = []
for start in adj:
if start not in seen:
comp = bfs_order(adj, start)
seen.update(comp)
components.append(sorted(comp))
return components
comps = connected_components(friends)
print("number of components:", len(comps))
for c in comps:
print("size", len(c), "->", c)
# Expected output:
# number of components: 2
# size 12 -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
# size 2 -> [12, 13]
The outer loop first lands on vertex $0$, whose BFS sweeps up the entire chain $\{0, \dots, 11\}$ — a
twelve-person component. It then skips $1$–$11$ (all now in seen), reaches $12$, and a sweep from $12$
finds only $\{12, 13\}$. Two components: a giant component of 12 and an isolated pair. This is the
shape of almost every real social network — one dominant connected mass plus a scattering of tiny
detached groups.
🔗 Connection. The recursive structure here is exactly flood fill (the paint-bucket tool floods one connected region of like-colored pixels) and the problem union-find solves more cleverly — both previewed in §28.4. In Chapter 32 you will meet union-find as a faster engine for the same "which-things-are-connected" question when building minimum spanning trees.
🔄 Check Your Understanding The whole
connected_componentscall touches fourteen vertices and starts BFS twice. Why is its total cost still $\Theta(V + E)$ and not (number of components) $\times \Theta(V+E)$?
Answer
The
seenguard ensures every vertex is traversed exactly once across all launches: a launch from an already-seen vertex never happens, and inside a launch the visited check stops re-exploration. Total work is proportional to vertices plus edges regardless of how many launches occur — $\Theta(V+E)$ overall (§28.4).
Phase 4: Diameter — the network's worst-case separation
Now the headline metric. The diameter of a graph is the greatest shortest-path distance between any two vertices — the eccentricity, maximized over all vertices. But diameter is only defined within a connected piece (across components the distance is infinite). So we compute the diameter of the giant component: run BFS from every vertex of that component, take each vertex's eccentricity, and take the max.
def component_diameter(adj, component):
"""Diameter of one connected component (list of its vertices)."""
best = 0
for s in component:
dist, _ = bfs_distances(adj, s)
ecc = max(dist.values()) # farthest reachable vertex from s
best = max(best, ecc)
return best
giant = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
print("diameter of giant component:", component_diameter(friends, giant))
# Expected output:
# diameter of giant component: 6
The diameter is $6$, realized by user $0$ and the bottom leaves $10$ and $11$ — $\mathrm{dist}_0(10) = \mathrm{dist}_0(11) = 6$ from Phase 1. So here the diameter equals $\mathrm{ecc}(0)$, but that is luck, not law: the farthest-apart pair need not include the vertex you happened to start from, and a single BFS from a badly chosen source will silently understate the diameter. To see the trap, run BFS from user $2$ — who sits on a shortcut down to the bottom:
dist2, _ = bfs_distances(friends, 2)
print(dict(sorted(dist2.items())))
# Expected output:
# {2: 0, 0: 1, 4: 1, 1: 2, 5: 2, 6: 2, 3: 3, 7: 3, 8: 3, 9: 4, 10: 5, 11: 5}
Hand-check: from $2$, one hop to $\{0, 4\}$, two to $\{1, 5, 6\}$, three to $\{3, 7, 8\}$, four to $9$, five to $\{10, 11\}$. So $\mathrm{ecc}(2) = 5$ — strictly less than the diameter $6$. Why is $2$ closer to the bottom than $0$ is? Because $2 \to 4 \to 6 \to 7 \to 9 \to 10$ is a five-hop shortcut, whereas $0$'s shortest route to $10$ takes six hops. An analyst who computed "the diameter" from a single BFS rooted at $2$ would report $5$ and be wrong. Confirm the other end with BFS from a bottom leaf:
dist10, _ = bfs_distances(friends, 10)
print("ecc(10) =", max(dist10.values()), " dist(10,0) =", dist10[0])
# Expected output:
# ecc(10) = 6 dist(10,0) = 6
From $10$ the farthest vertex is $0$, at distance $6$ — matching $\mathrm{dist}_0(10) = 6$ (distance is
symmetric in an undirected graph). The point of the code is that you do not guess the pair:
component_diameter runs BFS from all $12$ sources, takes each eccentricity, and reports the true maximum
$6$ — no inspection, no luck. The diameter is the rigorous version of "how spread out is this community?"
⚠️ Common Pitfall — the lying average. Suppose a dashboard reports "average degrees of separation = 2.1." If that average was taken over
bfs_distances(friends, 0)(the 12 reachable users) it excludes the unreachable pair entirely, overstating cohesion. If someone "fixed" it by treating missing distances as $0$, it would understate separation wildly. The honest statistic is the average pairwise distance within each component, reported per component (or the diameter, which has no such averaging hazard). Disconnected graphs have no finite average separation, full stop.🐛 Find the Error. An intern computes the diameter as
max(bfs_distances(friends, 2)[0].values())— one BFS from vertex $2$ — and reports $5$. Why is this wrong, and what is the cheapest correct fix?
Answer
One BFS gives the eccentricity of the chosen source, not the diameter. Here $\mathrm{ecc}(2) = 5$ but the true diameter is $6$ (realized by the pair $\{0, 10\}$, which does not include $2$). The farthest-apart pair need not pass through your start vertex, so a single BFS can understate the diameter. The correct (if not the asymptotically cheapest) fix is to run BFS from every vertex of the component and take the maximum eccentricity, as
component_diameterdoes. (Cheaper diameter algorithms exist for trees and via "double BFS," but for a general graph the all-sources sweep is the simple correct method.)
Phase 5: How much did all this cost?
We ran BFS many times. Was that affordable? One BFS on an adjacency list is $\Theta(V + E)$ (§28.6). Computing every vertex's eccentricity runs BFS once per vertex — $V$ sweeps — so the diameter computation is $\Theta(V \cdot (V + E))$. For our toy graph ($V = 14$, $E = 17$) that is trivial. For a real social network with $V = 10^6$ and $E = 10^7$, a single BFS (~$1.1 \times 10^7$ operations) is instant, but $V$ of them is ~$10^{13}$ operations — minutes to hours, the reason production systems estimate diameter by sampling sources rather than sweeping all of them.
🔄 Check Your Understanding A teammate worries the diameter computation is $\Theta(V^2 \cdot E)$ "because it nests BFS inside a loop over vertices." Correct them.
Answer
Each BFS is $\Theta(V + E)$, and we run it $V$ times in sequence, so the costs add: $V \cdot \Theta(V + E) = \Theta(V(V+E)) = \Theta(V^2 + VE)$. There is no extra factor of $E$ or $V$ — the loop multiplies one BFS's cost by $V$, it does not square it. (This is the Chapter 14 distinction between nesting that multiplies and sequencing that adds.)
Discussion Questions
- We modeled friendship as undirected. A "follows" relation (as on a microblog) is directed — Ana can follow Bo without Bo following Ana. How would the analysis change? Which metrics (reachable set, eccentricity, components) become direction-dependent, and why does "reachable from $s$" stop being symmetric?
- The "six degrees of separation" claim is that random pairs in a real social graph are typically within six hops. Our giant component has diameter $6$ — yet that is the worst-case pair, not the typical one. Is a diameter of $6$ consistent with "six degrees"? Distinguish diameter (worst case) from average distance (typical case) in your answer, and explain why the average is the quantity the folklore really refers to.
- BFS gave correct distances because every friendship counts as one hop. Suppose instead each edge carried a "closeness score" and you wanted the strongest chain of friendships. Why does BFS no longer apply, and which Chapter 29 algorithm would you reach for?
- We computed the diameter by $V$ separate BFS sweeps. The component is almost a tree (few extra edges). Does that structure suggest a faster diameter method? (Sketch the idea behind "two BFS passes find a tree's diameter" — you need not prove it.)
- The component finder reported sizes $12$ and $2$. As a product analyst, what action would each
finding suggest — and how might you use the
parenttrails (the introduction chains) to suggest new friendships that would merge the isolated pair into the giant component?
Your Turn: Extensions
- Option A (closeness centrality). For each vertex $v$ in the giant component, compute its average
distance to all other reachable vertices, $\frac{1}{|C|-1}\sum_{u} \delta(v,u)$. The vertex with the
smallest average is the most "central." Implement it on top of
bfs_distances, hand-derive the averages for vertices $4$, $7$, and $0$, and rank them. (This is closeness centrality, a standard social-network metric.) - Option B (radius and center). The radius is the minimum eccentricity, and a center is any
vertex achieving it. Modify
component_diameterto also return the radius and one center of the giant component. Which user is the most central by this measure, and does it match Option A's winner? - Option C (bridge the components). Find the single edge you could add between the giant component and the $\{12, 13\}$ pair that would minimize the new graph's diameter. Argue (with a couple of BFS sweeps) which giant-component vertex is the best attachment point, and report the resulting diameter.
- Option D (degrees-of-separation histogram). Run BFS from vertex $0$ and tabulate how many users sit at each distance $0, 1, 2, \dots$. This histogram is the empirical "rings" of §28.2. Confirm the ring sizes match Phase 1's trace table.
Key Takeaways
- BFS is a measuring instrument, not just a traversal. One sweep from a source yields that source's entire distance column; the analyst's job is to read the table — eccentricity, diameter, and reachability all fall out of repeated BFS (§28.2, §28.4).
- An absent key means infinity. A vertex missing from
bfs_distancesoutput is unreachable; that is precisely how disconnection is detected (§28.2). Never treat a missing distance as zero. - Averages live inside components. Diameter, radius, and average separation are only meaningful within a connected component; reporting them across a disconnected graph silently lies (§28.4).
- Sequencing adds, nesting multiplies. $V$ BFS sweeps cost $\Theta(V(V+E))$ — the loop multiplies one $\Theta(V+E)$ BFS by $V$, it does not square it (§28.6, Chapter 14). On large graphs this is why diameter is estimated by sampling sources.
- This is the social-graph anchor's first real payoff. "Degrees of separation" is BFS distance; the six-degrees folklore is an empirical statement about these distances. Chapter 29 generalizes one hop to weighted edges, turning BFS into Dijkstra.