Chapter 14: Further Reading
Essential Sources
1. Thomas N. Kipf and Max Welling, "Semi-Supervised Classification with Graph Convolutional Networks" (ICLR, 2017)
The paper that made graph neural networks practical and accessible. Kipf and Welling derive the GCN layer as a first-order Chebyshev approximation of spectral graph convolution, introduce the renormalization trick (self-loops + symmetric normalization), and demonstrate strong semi-supervised node classification on citation networks with remarkably few labeled examples (20 per class on Cora). The paper is concise (10 pages) and the derivation — from spectral theory through polynomial approximation to the final spatial formula — is a model of mathematical exposition.
Reading guidance: Section 2 contains the spectral-to-spatial derivation. Follow it step by step: start with the spectral convolution $g_\theta \star x = \mathbf{U} g_\theta(\mathbf{\Lambda}) \mathbf{U}^\top x$, apply the Chebyshev approximation, set $K = 1$, apply the renormalization trick, and arrive at $\hat{\mathbf{A}} \mathbf{X} \mathbf{W}$. This derivation is the intellectual core of the entire GCN literature. Section 3 presents the semi-supervised learning setup, which is unusual: the model is trained on a tiny fraction of labels but has access to the full graph structure (transductive). Table 1 compares GCN to the spectral methods it subsumes (ChebNet, DeepWalk) and the MLP baseline it dominates. For the spectral foundations, read the companion paper: Defferrard, Bresson, and Vandergheynst, "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering" (NeurIPS, 2016), which introduces the Chebyshev polynomial framework. For a comprehensive survey connecting spectral and spatial perspectives, see Wu et al., "A Comprehensive Survey on Graph Neural Networks" (IEEE Transactions on Neural Networks and Learning Systems, 2021).
2. William L. Hamilton, Rex Ying, and Jure Leskovec, "Inductive Representation Learning on Large Graphs" (NeurIPS, 2017)
The GraphSAGE paper addresses two critical limitations of GCN: scalability and inductiveness. Hamilton et al. introduce neighborhood sampling (enabling mini-batch training on graphs with millions of nodes) and an inductive learning framework (enabling generalization to unseen nodes). The paper is important not just for the architecture but for the evaluation methodology: it demonstrates that inductive graph learning is viable and practical, opening the door to production GNN systems.
Reading guidance: Section 2 defines the GraphSAGE algorithm: the forward propagation, the three aggregation functions (mean, LSTM, pooling), and the sampling procedure. The key insight is in Section 2.2: instead of aggregating all neighbors, sample a fixed-size subset at each layer, creating a computation tree with bounded size. Algorithm 2 (mini-batch forward propagation) is the practical workhorse — study it carefully, as it is the foundation of all large-scale GNN training. Section 4's experiments on the Reddit and PPI datasets demonstrate that GraphSAGE scales to graphs with 200K+ nodes while maintaining competitive accuracy. For the production implementation of this idea at scale, see Ying et al., "Graph Convolutional Neural Networks for Web-Scale Recommender Systems" (KDD, 2018), which describes PinSage — Pinterest's billion-node GNN deployed in production. For heterogeneous extensions, see Schlichtkrull et al., "Modeling Relational Data with Graph Convolutional Networks" (ESWC, 2018), which introduces Relational GCN (R-GCN) for multi-relational graphs.
3. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio, "Graph Attention Networks" (ICLR, 2018)
The GAT paper introduces learned, data-dependent attention weights to graph neural networks, replacing GCN's fixed degree-based normalization with adaptive neighbor importance. The architecture draws directly from the attention mechanism (Chapter 10 of this book) and adapts it to graph neighborhoods. GAT achieves state-of-the-art results on citation benchmarks while providing interpretable attention distributions that reveal which neighbors are most informative.
Reading guidance: Section 2.1 defines the attention mechanism: the linear transformation, the shared attention vector $\mathbf{a}$, the LeakyReLU activation, and the softmax normalization. Compare this carefully to the transformer's scaled dot-product attention — the similarities (multi-head, learned weights) and differences (additive vs. multiplicative, local vs. global) are instructive. Section 2.2 introduces multi-head attention with concatenation (intermediate layers) and averaging (output layer). The experiments in Section 3 show modest improvements over GCN on citation benchmarks but more significant gains on the inductive PPI dataset, suggesting that attention matters most when neighbor quality is variable. For the follow-up work that significantly improves on the original GAT, see Brody, Alon, and Yahav, "How Attentive are Graph Attention Networks?" (ICLR, 2022), which shows that the original GAT's additive attention computes a static ranking of neighbors (independent of the query node) and proposes GATv2 with a truly dynamic attention mechanism. For the connection between graph attention and transformers, see Dwivedi and Bresson, "A Generalization of Transformer Networks to Graphs" (AAAI Workshop, 2021).
4. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka, "How Powerful are Graph Neural Networks?" (ICLR, 2019)
The most important theoretical paper in the GNN literature. Xu et al. establish the connection between message passing GNNs and the Weisfeiler-Leman graph isomorphism test, proving that (1) no message passing GNN can exceed the 1-WL test in distinguishing non-isomorphic graphs, and (2) a GNN with sum aggregation and an injective update function (MLP) achieves this upper bound. The paper then introduces the Graph Isomorphism Network (GIN) as the maximally powerful message passing architecture.
Reading guidance: Theorem 3 is the key negative result: it establishes the 1-WL upper bound on MPNN expressiveness. Lemma 5 and Theorem 5 together give the positive result: sum aggregation is multiset-injective, and composing it with an MLP produces an injective update, achieving 1-WL power. The proof strategy — reducing multiset distinction to the injectivity of the hash function in WL — is elegant and worth following in detail. Section 4 demonstrates that GIN outperforms GCN, GraphSAGE, and other architectures on graph classification benchmarks, validating the theory. For extensions beyond 1-WL, see Morris et al., "Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks" (AAAI, 2019), which introduces $k$-WL GNNs. For a comprehensive survey of GNN expressiveness results, see Morris et al., "Weisfeiler and Leman Go Machine Learning: The Story So Far" (JMLR, 2023). For practical implications of expressiveness in molecular property prediction, see Corso et al., "Principal Neighbourhood Aggregation for Graph Nets" (NeurIPS, 2020).
5. Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic, "Geometric Deep Learning: Grids, Groups, Graphs, and Geodesics" (arXiv, 2021; based on ICLR 2021 keynote)
The definitive survey of geometric deep learning, unifying CNNs, RNNs, GNNs, and equivariant networks under the principle of symmetry. Bronstein et al. argue that all successful neural network architectures can be understood as equivariant function approximators for specific symmetry groups: translation equivariance (CNNs), permutation equivariance (GNNs), rotation equivariance (SchNet, EGNN). This perspective provides both a retrospective explanation for why these architectures work and a constructive methodology for designing new ones.
Reading guidance: This is a 150-page monograph, not a short paper. The "5G" framework (Grids, Groups, Graphs, Gauges, Geodesics) provides the organizational structure. For this chapter's purposes, focus on: Section 3 (group theory and equivariance — the mathematical foundation), Section 5 (GNNs as permutation-equivariant architectures), and Section 5.5 (the WL hierarchy and expressiveness). The paper's greatest contribution is conceptual: it shows that the progression from MLPs to CNNs to GNNs is not a series of unrelated inventions but a systematic exploration of equivariance under different symmetry groups. For a more accessible introduction to the same ideas, see the accompanying lecture series (available online). For equivariant networks on 3D molecular data, see Satorras, Hoogeboom, and Welling, "E(n) Equivariant Graph Neural Networks" (ICML, 2021), which introduces EGNN — a simple yet powerful architecture that respects rotational and translational symmetry. For graph transformers that go beyond local message passing, see Ying et al., "Do Transformers Really Perform Bad for Graph Representation?" (NeurIPS, 2021), which introduces Graphormer, and Rampasek et al., "Recipe for a General, Powerful, Scalable Graph Transformer" (NeurIPS, 2022), which provides a systematic framework for combining GNNs with transformers.