Chapter 37: Key Takeaways

Core Concepts

Graphs represent data with arbitrary relational structure. Unlike grids (images) and sequences (text), graphs have no fixed ordering, no regular spatial structure, and varying connectivity patterns. A graph $G = (V, E)$ consists of nodes with features and edges connecting them, represented algebraically by an adjacency matrix $\mathbf{A}$ and a node feature matrix $\mathbf{X}$.
Message passing is the fundamental GNN operation. Each node aggregates information from its neighbors, transforms it, and updates its own representation. After $L$ layers, each node's representation incorporates information from all nodes within $L$ hops. The aggregation must be permutation-invariant (sum, mean, max) to ensure the model respects the unordered nature of node neighborhoods.
Over-smoothing limits GNN depth. Repeated neighborhood aggregation causes node representations to converge, making nodes indistinguishable. Most practical GNNs use only 2--4 layers. Mitigation strategies include residual connections, Jumping Knowledge Networks, DropEdge, and explicit normalization.

GCN uses symmetric-normalized aggregation. The propagation rule $\mathbf{H}^{(\ell)} = \sigma(\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}\mathbf{H}^{(\ell-1)}\mathbf{W}^{(\ell)})$ adds self-loops and normalizes by degree. GCN is simple, fast, and effective but is transductive and assigns fixed weights to all neighbors.
GraphSAGE enables inductive learning and scalability. By sampling a fixed number of neighbors and concatenating self-features with aggregated neighbor features, GraphSAGE can generalize to unseen nodes and train on massive graphs via mini-batching. It is widely used in industry.
GAT uses learned attention weights. Rather than treating all neighbors equally, GAT computes attention coefficients that determine each neighbor's influence. Multi-head attention stabilizes learning. GAT provides interpretable attention patterns but is more computationally expensive.

Node classification is semi-supervised. A small fraction of nodes are labeled, and the GNN propagates label information through the graph structure. On citation networks like Cora, GNNs achieve 80%+ accuracy with only 5% labeled data, far outperforming MLPs that ignore graph structure.
Graph classification requires readout functions. To make graph-level predictions (e.g., molecular property prediction), node representations must be aggregated. Sum readout is the most expressive, preserving multiset and size information. Hierarchical pooling (DiffPool, TopKPooling) can capture structural information at multiple scales.
Molecular property prediction is a flagship GNN application. Molecules are natural graphs (atoms as nodes, bonds as edges). GNNs outperform traditional fingerprint methods by learning task-specific representations. Edge features (bond types) significantly improve performance through MPNN-style architectures.
Knowledge graphs use multi-relational GNNs. Knowledge graphs store facts as (head, relation, tail) triples. R-GCN extends GCN with relation-specific weight matrices, using basis decomposition to manage the parameter explosion from many relation types.

Use PyTorch Geometric for GNN development. PyG provides efficient message passing, 40+ built-in GNN layers, standard datasets, and elegant mini-batching of variable-size graphs through the batch tensor. The Data object is the fundamental data structure.
Feature engineering matters more than architecture choice. In molecular graphs, careful atom and bond featurization often provides larger gains than switching between GCN, GAT, and MPNN. Similarly, for citation networks, the quality of node features significantly impacts performance.
Always compare against simple baselines. An MLP on node features (ignoring graph structure) and Morgan fingerprints (for molecules) are essential baselines. If these perform competitively, the graph structure may not be informative for your task.
GIN maximizes expressiveness among message-passing GNNs. By using sum aggregation and an injective MLP update, GIN matches the power of the 1-WL graph isomorphism test---the theoretical upper bound for standard message-passing architectures.
Use scaffold splits for molecular datasets. Random train/test splits overestimate generalization by allowing structurally similar molecules in both sets. Scaffold splits group molecules by core structure, providing realistic estimates of performance on novel chemistry.