Case Study: Measuring Degrees of Separation in a Social Network

DataField.Dev

"How far apart are two people in a social network? Knock on every door one ring at a time, and count the rings."

Executive Summary

You will take a small but realistic friendship graph and measure it — not change it, not optimize it, but interrogate it the way a network analyst would on the first day with a new dataset. Using only the breadth-first search you built in §28.2, you will compute one person's degrees of separation to everyone else, find the network's longest shortest path (its diameter), discover that the graph splits into separate friend groups (its connected components), and confront a subtle reporting trap: "average separation" is meaningless across disconnected pieces. This is the analytical heart of the social-network thread the book has been building toward since Chapter 27 — and it is the exact computation behind the famous "six degrees of separation" claim.

The lesson is that BFS is not just an algorithm; it is a measuring instrument. One sweep from a single vertex yields an entire column of a distance table. Reading that table correctly — knowing what an absent key means, why distances satisfy a triangle-like bound, and where an "average" silently lies — is the analyst's real skill.

Skills applied

Modeling a real relationship dataset as an undirected graph (Chapter 27) and storing it as an adjacency list (§28.1).
Running BFS to compute single-source shortest unweighted distances and reconstruct paths (§28.2).
Deriving the eccentricity, diameter, and component structure from repeated BFS sweeps (§28.2, §28.4).
Interpreting traversal output: absent keys = unreachable; distances obey a triangle inequality; averages must be taken within a component.
Reasoning about cost: why $V$ BFS sweeps cost $\Theta(V(V+E))$, not $\Theta(V^2 \cdot E)$ (Chapter 14, §28.6).

Background

The scenario

A small online community — call it a study-group app — has fourteen users. The product team wants to answer questions a growth analyst would ask:

For a given user, how socially "central" are they — what is the largest number of friend-hops to reach anyone they can reach?
What is the worst-case separation in the whole network — the two users who are farthest apart?
Is the community one connected whole, or has it fractured into cliques that never touch?
Is the "six degrees" folklore even true here?

The friendship relation is symmetric (if Ana is friends with Bo, Bo is friends with Ana), so we model it as an undirected graph $G = (V, E)$: one vertex per user, one edge per friendship. Because the network is sparse — fourteen people are not friends with everyone — we store it as an adjacency list (the §28.1 default for traversal-heavy work).

🔗 Connection. This case study advances the social-graph anchor from Chapter 27 (where you first modeled a network) to its first real measurement. "Degrees of separation" between two people is precisely their BFS distance. In Chapter 29 we will attach weights (how strong a friendship is, or travel time in a road network) and BFS becomes Dijkstra; here every friendship counts the same, so BFS is exactly the right instrument.

The data

Here is the friendship graph as an adjacency list. Users are numbered $0$–$13$. Notice up front that users $12$ and $13$ are friends with each other and no one else — a foreshadowing we will confirm with code, not eyeballing.

friends = {
    0:  [1, 2],
    1:  [0, 3, 4],
    2:  [0, 4],
    3:  [1, 5],
    4:  [1, 2, 5, 6],
    5:  [3, 4, 7],
    6:  [4, 7, 8],
    7:  [5, 6, 9],
    8:  [6, 9],
    9:  [7, 8, 10, 11],
    10: [9],
    11: [9],
    12: [13],
    13: [12],
}

💡 Intuition: Picture the first twelve users as a loose chain of overlapping triangles winding from user $0$ down to users $10$ and $11$, with users $12$ and $13$ off in a corner holding hands and ignoring everyone. BFS will prove that mental picture — or correct it.

Phase 1: One sweep — degrees of separation from user 0

Start with the simplest question: how far is everyone from user $0$? This is one BFS, and it produces a whole column of the eventual distance matrix in a single pass. We use the chapter's bfs_distances verbatim.

from collections import deque

def bfs_distances(adj, s):
    """Shortest-path distances (in edges) from s, plus a predecessor map."""
    dist = {s: 0}
    parent = {s: None}
    queue = deque([s])
    while queue:
        v = queue.popleft()
        for w in adj[v]:
            if w not in dist:          # 'in dist' doubles as 'visited'
                dist[w] = dist[v] + 1
                parent[w] = v
                queue.append(w)
    return dist, parent

dist0, parent0 = bfs_distances(friends, 0)
print(dict(sorted(dist0.items())))
print("12 reachable from 0?", 12 in dist0)
# Expected output:
# {0: 0, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 5, 10: 6, 11: 6}
# 12 reachable from 0? False

Let us hand-derive that distance map ring by ring, exactly as the §28.2 trace table does, so you trust the numbers rather than the machine.

Ring (distance)	Vertices first reached at this distance	How they were reached
0	$0$	the source
1	$1, 2$	neighbors of $0$
2	$3, 4$	$3$ from $1$; $4$ from $1$ (then $2$ also sees $4$, already claimed)
3	$5, 6$	$5$ from $3$ (and from $4$); $6$ from $4$
4	$7, 8$	$7$ from $5$ (and $6$); $8$ from $6$
5	$9$	from $7$ (and $8$)
6	$10, 11$	both from $9$

Two facts jump out, and both are analysis, not mechanics:

User $0$'s eccentricity is 6. The farthest reachable users ($10$ and $11$) sit six hops away. The eccentricity of a vertex is the greatest shortest-path distance from it to any vertex it can reach: $\mathrm{ecc}(0) = \max_v \mathrm{dist}_0(v) = 6$.
Users $12$ and $13$ are absent from the map. Per the §28.2 rule, an absent key means unreachable: $\delta(0, 12) = \infty$. The network is not connected. We will quantify the fracture in Phase 3.

⚠️ Common Pitfall: A tempting one-liner for "average separation from user 0" is sum(dist0.values()) / len(dist0). But dist0 only contains the reachable users, so this averages over $12$ people and silently omits users $12$ and $13$ — for whom the distance is infinite, not zero and not "skip." Any average distance is only meaningful within a single component. We return to this in Phase 4.

🔄 Check Your Understanding User $4$ appears at distance $2$, reached from user $1$. User $4$ is also a neighbor of user $2$, who is also at distance $1$. Could user $4$'s distance have come out as anything other than $2$? Could its parent have differed?

Answer

No to the distance: user $4$ is adjacent to two distance-$1$ vertices ($1$ and $2$), so it is reached in the second ring no matter which claims it — $\mathrm{dist}[4] = 2$ is forced. Yes to the parent: had user $2$ been processed before user $1$, parent[4] would be $2$ instead of $1$. Shortest paths need not be unique (§28.2); only the distance is.

Phase 2: Reconstructing a path, and the triangle inequality

A distance is a number; a path is the evidence. The parent map from one BFS lets us reconstruct an actual shortest chain of friends — the "who introduced whom" trail — using the §28.2 reconstruction.

def shortest_path(adj, s, t):
    """Reconstruct one shortest s-t path (list of vertices), or None."""
    dist, parent = bfs_distances(adj, s)
    if t not in parent:
        return None
    path = []
    v = t
    while v is not None:
        path.append(v)
        v = parent[v]
    path.reverse()
    return path

print(shortest_path(friends, 0, 11))
print(shortest_path(friends, 0, 12))
# Expected output:
# [0, 1, 3, 5, 7, 9, 11]
# None

Walk the parent0 pointers back from $11$: $\mathrm{parent}[11] = 9$, $\mathrm{parent}[9] = 7$, $\mathrm{parent}[7] = 5$, $\mathrm{parent}[5] = 3$, $\mathrm{parent}[3] = 1$, $\mathrm{parent}[1] = 0$. Reversed, that is $0 \to 1 \to 3 \to 5 \to 7 \to 9 \to 11$ — six edges, matching $\mathrm{dist}_0(11) = 6$. (A different but equally short chain, $0 \to 2 \to 4 \to 6 \to 7 \to 9 \to 11$, also has length 6; which one BFS reports depends only on neighbor order.) And $\delta(0,12)$ has no path, so shortest_path returns None.

A property worth checking, because analysts lean on it constantly: BFS distances satisfy a triangle inequality. For any three vertices $s$, $u$, $w$ in the same component, $$\delta(s, w) \le \delta(s, u) + \delta(u, w).$$ The reason is immediate from the definition of shortest path: concatenating a shortest $s$–$u$ path with a shortest $u$–$w$ path yields some $s$–$w$ walk of length $\delta(s,u) + \delta(u,w)$, and the shortest $s$–$w$ path can only be shorter or equal. We can spot-check it on our data: $\delta(0, 9) = 5$, and $\delta(0, 7) + \delta(7, 9) = 4 + 1 = 5$, so equality holds here (because $7$ happens to lie on a shortest $0$–$9$ path). Route through a vertex past the target and the inequality goes strict.

🧩 Productive Struggle. Before reading on: find a vertex $u$ for which the triangle inequality from $s = 0$ to $w = 9$ is strict — i.e. $\delta(0, 9) < \delta(0, u) + \delta(u, 9)$. (Hint: pick a $u$ that lies beyond $9$, like $u = 10$, so reaching $u$ overshoots and coming back wastes hops.)

Phase 3: Component structure — is the network one piece?

Phase 1 already hinted the graph is disconnected (users $12$, $13$ never appeared). Let us settle it with the §28.4 component finder, which relaunches a traversal from each not-yet-seen vertex; one launch = one component.

def bfs_order(adj, s):
    visited = {s}
    queue = deque([s])
    order = []
    while queue:
        v = queue.popleft()
        order.append(v)
        for w in adj[v]:
            if w not in visited:
                visited.add(w)
                queue.append(w)
    return order

def connected_components(adj):
    seen = set()
    components = []
    for start in adj:
        if start not in seen:
            comp = bfs_order(adj, start)
            seen.update(comp)
            components.append(sorted(comp))
    return components

comps = connected_components(friends)
print("number of components:", len(comps))
for c in comps:
    print("size", len(c), "->", c)
# Expected output:
# number of components: 2
# size 12 -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
# size 2 -> [12, 13]

The outer loop first lands on vertex $0$, whose BFS sweeps up the entire chain $\{0, \dots, 11\}$ — a twelve-person component. It then skips $1$–$11$ (all now in seen), reaches $12$, and a sweep from $12$ finds only $\{12, 13\}$. Two components: a giant component of 12 and an isolated pair. This is the shape of almost every real social network — one dominant connected mass plus a scattering of tiny detached groups.

🔗 Connection. The recursive structure here is exactly flood fill (the paint-bucket tool floods one connected region of like-colored pixels) and the problem union-find solves more cleverly — both previewed in §28.4. In Chapter 32 you will meet union-find as a faster engine for the same "which-things-are-connected" question when building minimum spanning trees.

🔄 Check Your Understanding The whole connected_components call touches fourteen vertices and starts BFS twice. Why is its total cost still $\Theta(V + E)$ and not (number of components) $\times \Theta(V+E)$?

Answer

The seen guard ensures every vertex is traversed exactly once across all launches: a launch from an already-seen vertex never happens, and inside a launch the visited check stops re-exploration. Total work is proportional to vertices plus edges regardless of how many launches occur — $\Theta(V+E)$ overall (§28.4).

Phase 4: Diameter — the network's worst-case separation

Now the headline metric. The diameter of a graph is the greatest shortest-path distance between any two vertices — the eccentricity, maximized over all vertices. But diameter is only defined within a connected piece (across components the distance is infinite). So we compute the diameter of the giant component: run BFS from every vertex of that component, take each vertex's eccentricity, and take the max.

def component_diameter(adj, component):
    """Diameter of one connected component (list of its vertices)."""
    best = 0
    for s in component:
        dist, _ = bfs_distances(adj, s)
        ecc = max(dist.values())          # farthest reachable vertex from s
        best = max(best, ecc)
    return best

giant = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
print("diameter of giant component:", component_diameter(friends, giant))
# Expected output:
# diameter of giant component: 6

The diameter is $6$, realized by user $0$ and the bottom leaves $10$ and $11$ — $\mathrm{dist}_0(10) = \mathrm{dist}_0(11) = 6$ from Phase 1. So here the diameter equals $\mathrm{ecc}(0)$, but that is luck, not law: the farthest-apart pair need not include the vertex you happened to start from, and a single BFS from a badly chosen source will silently understate the diameter. To see the trap, run BFS from user $2$ — who sits on a shortcut down to the bottom:

dist2, _ = bfs_distances(friends, 2)
print(dict(sorted(dist2.items())))
# Expected output:
# {2: 0, 0: 1, 4: 1, 1: 2, 5: 2, 6: 2, 3: 3, 7: 3, 8: 3, 9: 4, 10: 5, 11: 5}

Hand-check: from $2$, one hop to $\{0, 4\}$, two to $\{1, 5, 6\}$, three to $\{3, 7, 8\}$, four to $9$, five to $\{10, 11\}$. So $\mathrm{ecc}(2) = 5$ — strictly less than the diameter $6$. Why is $2$ closer to the bottom than $0$ is? Because $2 \to 4 \to 6 \to 7 \to 9 \to 10$ is a five-hop shortcut, whereas $0$'s shortest route to $10$ takes six hops. An analyst who computed "the diameter" from a single BFS rooted at $2$ would report $5$ and be wrong. Confirm the other end with BFS from a bottom leaf:

dist10, _ = bfs_distances(friends, 10)
print("ecc(10) =", max(dist10.values()), " dist(10,0) =", dist10[0])
# Expected output:
# ecc(10) = 6  dist(10,0) = 6

From $10$ the farthest vertex is $0$, at distance $6$ — matching $\mathrm{dist}_0(10) = 6$ (distance is symmetric in an undirected graph). The point of the code is that you do not guess the pair: component_diameter runs BFS from all $12$ sources, takes each eccentricity, and reports the true maximum $6$ — no inspection, no luck. The diameter is the rigorous version of "how spread out is this community?"

⚠️ Common Pitfall — the lying average. Suppose a dashboard reports "average degrees of separation = 2.1." If that average was taken over bfs_distances(friends, 0) (the 12 reachable users) it excludes the unreachable pair entirely, overstating cohesion. If someone "fixed" it by treating missing distances as $0$, it would understate separation wildly. The honest statistic is the average pairwise distance within each component, reported per component (or the diameter, which has no such averaging hazard). Disconnected graphs have no finite average separation, full stop.

🐛 Find the Error. An intern computes the diameter as max(bfs_distances(friends, 2)[0].values()) — one BFS from vertex $2$ — and reports $5$. Why is this wrong, and what is the cheapest correct fix?

Answer

One BFS gives the eccentricity of the chosen source, not the diameter. Here $\mathrm{ecc}(2) = 5$ but the true diameter is $6$ (realized by the pair $\{0, 10\}$, which does not include $2$). The farthest-apart pair need not pass through your start vertex, so a single BFS can understate the diameter. The correct (if not the asymptotically cheapest) fix is to run BFS from every vertex of the component and take the maximum eccentricity, as component_diameter does. (Cheaper diameter algorithms exist for trees and via "double BFS," but for a general graph the all-sources sweep is the simple correct method.)

Phase 5: How much did all this cost?

We ran BFS many times. Was that affordable? One BFS on an adjacency list is $\Theta(V + E)$ (§28.6). Computing every vertex's eccentricity runs BFS once per vertex — $V$ sweeps — so the diameter computation is $\Theta(V \cdot (V + E))$. For our toy graph ($V = 14$, $E = 17$) that is trivial. For a real social network with $V = 10^6$ and $E = 10^7$, a single BFS (~$1.1 \times 10^7$ operations) is instant, but $V$ of them is ~$10^{13}$ operations — minutes to hours, the reason production systems estimate diameter by sampling sources rather than sweeping all of them.

🔄 Check Your Understanding A teammate worries the diameter computation is $\Theta(V^2 \cdot E)$ "because it nests BFS inside a loop over vertices." Correct them.

Answer

Each BFS is $\Theta(V + E)$, and we run it $V$ times in sequence, so the costs add: $V \cdot \Theta(V + E) = \Theta(V(V+E)) = \Theta(V^2 + VE)$. There is no extra factor of $E$ or $V$ — the loop multiplies one BFS's cost by $V$, it does not square it. (This is the Chapter 14 distinction between nesting that multiplies and sequencing that adds.)

Discussion Questions

We modeled friendship as undirected. A "follows" relation (as on a microblog) is directed — Ana can follow Bo without Bo following Ana. How would the analysis change? Which metrics (reachable set, eccentricity, components) become direction-dependent, and why does "reachable from $s$" stop being symmetric?
The "six degrees of separation" claim is that random pairs in a real social graph are typically within six hops. Our giant component has diameter $6$ — yet that is the worst-case pair, not the typical one. Is a diameter of $6$ consistent with "six degrees"? Distinguish diameter (worst case) from average distance (typical case) in your answer, and explain why the average is the quantity the folklore really refers to.
BFS gave correct distances because every friendship counts as one hop. Suppose instead each edge carried a "closeness score" and you wanted the strongest chain of friendships. Why does BFS no longer apply, and which Chapter 29 algorithm would you reach for?
We computed the diameter by $V$ separate BFS sweeps. The component is almost a tree (few extra edges). Does that structure suggest a faster diameter method? (Sketch the idea behind "two BFS passes find a tree's diameter" — you need not prove it.)
The component finder reported sizes $12$ and $2$. As a product analyst, what action would each finding suggest — and how might you use the parent trails (the introduction chains) to suggest new friendships that would merge the isolated pair into the giant component?

Your Turn: Extensions

Option A (closeness centrality). For each vertex $v$ in the giant component, compute its average distance to all other reachable vertices, $\frac{1}{|C|-1}\sum_{u} \delta(v,u)$. The vertex with the smallest average is the most "central." Implement it on top of bfs_distances, hand-derive the averages for vertices $4$, $7$, and $0$, and rank them. (This is closeness centrality, a standard social-network metric.)
Option B (radius and center). The radius is the minimum eccentricity, and a center is any vertex achieving it. Modify component_diameter to also return the radius and one center of the giant component. Which user is the most central by this measure, and does it match Option A's winner?
Option C (bridge the components). Find the single edge you could add between the giant component and the $\{12, 13\}$ pair that would minimize the new graph's diameter. Argue (with a couple of BFS sweeps) which giant-component vertex is the best attachment point, and report the resulting diameter.
Option D (degrees-of-separation histogram). Run BFS from vertex $0$ and tabulate how many users sit at each distance $0, 1, 2, \dots$. This histogram is the empirical "rings" of §28.2. Confirm the ring sizes match Phase 1's trace table.

Key Takeaways

BFS is a measuring instrument, not just a traversal. One sweep from a source yields that source's entire distance column; the analyst's job is to read the table — eccentricity, diameter, and reachability all fall out of repeated BFS (§28.2, §28.4).
An absent key means infinity. A vertex missing from bfs_distances output is unreachable; that is precisely how disconnection is detected (§28.2). Never treat a missing distance as zero.
Averages live inside components. Diameter, radius, and average separation are only meaningful within a connected component; reporting them across a disconnected graph silently lies (§28.4).
Sequencing adds, nesting multiplies. $V$ BFS sweeps cost $\Theta(V(V+E))$ — the loop multiplies one $\Theta(V+E)$ BFS by $V$, it does not square it (§28.6, Chapter 14). On large graphs this is why diameter is estimated by sampling sources.
This is the social-graph anchor's first real payoff. "Degrees of separation" is BFS distance; the six-degrees folklore is an empirical statement about these distances. Chapter 29 generalizes one hop to weighted edges, turning BFS into Dijkstra.