Chapter 14: Key Takeaways

  1. Graphs are the most general structured data type, and graph neural networks are the mathematically natural architecture for relational learning. MLPs assume fixed-size unstructured inputs. CNNs assume regular grids with translation symmetry. RNNs assume sequential ordering. Graphs impose none of these constraints — they handle variable-size inputs, irregular connectivity, and arbitrary topology. Any data with pairwise relationships (user-item interactions, molecular bonds, citations, social connections, knowledge triples) is naturally a graph. The question for any relational dataset is not whether graph structure exists, but whether ignoring it costs you performance. On StreamRec, the GNN's ability to propagate collaborative signal through multi-hop paths yields 40-60% improvement in Recall@20 over matrix factorization. On molecular property prediction, operating on the molecule's native topology rather than a lossy fingerprint yields 5-8 points in accuracy.

  2. The spectral and spatial perspectives on graph convolution are complementary, and understanding both is essential. The spectral view — graph Fourier transform, Laplacian eigendecomposition, frequency-domain filtering — explains why graph convolution works (it is a principled signal processing operation on graph-structured data) and why it fails at depth (iterated low-pass filtering destroys high-frequency signal, causing over-smoothing). The spatial view — message passing, neighborhood aggregation — explains how to implement it efficiently ($O(E)$ sparse matrix multiplication instead of $O(N^3)$ eigendecomposition) and how to design new architectures (by choosing message, aggregation, and update functions). The Chebyshev polynomial approximation bridges the two: a degree-$K$ spectral filter is equivalent to $K$-hop spatial aggregation, no eigendecomposition required.

  3. The message passing framework unifies GNN architectures as different choices of message, aggregation, and update functions. GCN uses degree-normalized messages with sum aggregation. GraphSAGE uses sampled neighborhoods with mean/max aggregation and concatenation. GAT uses learned attention weights. GIN uses sum aggregation with MLP updates for maximum expressiveness. Understanding the framework lets you design new architectures by asking: what message function captures the right pairwise interactions? What aggregation preserves the right information? What update function combines self and neighborhood signals appropriately? Every new GNN paper is a variation on this theme.

  4. GNN expressiveness is bounded by the Weisfeiler-Leman hierarchy, and this bound has practical consequences. No message passing GNN can distinguish graphs that the 1-WL isomorphism test cannot distinguish. GIN achieves this upper bound through sum aggregation (multiset-injective) and MLP updates (universal approximators). GCN and GraphSAGE with mean or max aggregation are strictly less powerful. This is not merely a theoretical curiosity — on molecular benchmarks, the expressiveness hierarchy translates directly to accuracy gaps. Moving beyond 1-WL requires fundamentally different architectures (higher-order GNNs, graph transformers) at significantly higher computational cost.

  5. Over-smoothing is the fundamental depth limitation of GNNs, and it has a clean spectral explanation. Each message passing layer acts as a low-pass filter on graph signals, attenuating high-frequency components. After $K$ layers, all node representations converge toward the graph-level mean. This is why GNNs are shallow (2-3 layers) compared to CNNs (50-150 layers) and transformers (12-96 layers). Residual connections, JKNet, and DropEdge delay but do not eliminate the effect. Graph transformers avoid it entirely by replacing local aggregation with global attention, at the cost of $O(N^2)$ complexity and the loss of the graph's local inductive bias. GNN depth is not a free hyperparameter — it must be balanced against information loss from smoothing.

  6. For production recommendation systems, the GNN architecture must match the operational requirements: scalability, inductiveness, and heterogeneity. StreamRec's 5M-user graph requires mini-batch training (GraphSAGE-style sampling, not full-batch GCN). New users and items require inductive capability (learned aggregation functions, not fixed embedding tables). Multiple node and edge types require heterogeneous message passing (R-GCN, heterogeneous transformers). LightGCN — which strips GCN down to pure neighborhood averaging of learnable embeddings — is the strongest collaborative filtering model precisely because it removes unnecessary parameters from a setting where the GNN's role is purely structural.

  7. The geometric deep learning perspective reveals that CNNs, RNNs, transformers, and GNNs are all instances of the same principle: equivariant function approximation under different symmetry groups. CNNs are equivariant to grid translations. GNNs are equivariant to node permutations. Equivariant networks for 3D point clouds respect rotational symmetry. This unification is not merely aesthetic — it provides a principled design methodology: identify the data's symmetry group, then build the architecture to respect it. The fundamentals of graph signal processing, message passing, and WL expressiveness do not change with each new architecture paper. Master them once and every new development is a variation, not a revolution.