Chapter 14: Quiz

DataField.Dev

Chapter 14: Quiz

Test your understanding of graph neural networks. Answers follow each question.

Question 1

What three properties of graphs prevent the direct application of MLPs, CNNs, and RNNs?

Answer

1. **Variable size.** Graphs have different numbers of nodes and edges, but MLPs require fixed-size inputs. There is no natural way to pad a graph to a fixed dimension. 2. **No canonical ordering.** Graph nodes have no natural order (unlike sequence positions or pixel coordinates), so any representation must be **permutation invariant** (for graph-level tasks) or **permutation equivariant** (for node-level tasks). 3. **No regular connectivity.** CNNs assume a fixed, regular neighborhood structure (e.g., a $3 \times 3$ grid). Graph nodes have variable-degree neighborhoods with no spatial layout. These are not engineering problems — they are fundamental mathematical constraints that require architectures designed for graph-structured data.

Question 2

What is the graph Laplacian $\mathbf{L}$, and what does its quadratic form $\mathbf{f}^\top \mathbf{L} \mathbf{f}$ measure?

Answer

The graph Laplacian is $\mathbf{L} = \mathbf{D} - \mathbf{A}$, where $\mathbf{D}$ is the diagonal degree matrix and $\mathbf{A}$ is the adjacency matrix. Its quadratic form for a signal $\mathbf{f}$ on the nodes is: $$\mathbf{f}^\top \mathbf{L} \mathbf{f} = \sum_{(v_i, v_j) \in \mathcal{E}} (f_i - f_j)^2$$ This measures the **total variation** of the signal on the graph — how much the signal changes across edges. A constant signal has zero total variation (it is "smooth" on the graph); a signal that oscillates sharply across edges has high total variation. The Laplacian eigenvectors ordered by eigenvalue provide a notion of graph frequency: small eigenvalues correspond to smooth (low-frequency) signals, and large eigenvalues correspond to oscillatory (high-frequency) signals.

Question 3

Write the GCN layer update equation and explain each component.

Answer

$$\mathbf{H}^{(\ell+1)} = \sigma\!\left(\hat{\mathbf{A}} \mathbf{H}^{(\ell)} \mathbf{W}^{(\ell)}\right)$$ - $\mathbf{H}^{(\ell)} \in \mathbb{R}^{N \times d_\ell}$: node representation matrix at layer $\ell$ (each row is a node's feature vector; $\mathbf{H}^{(0)} = \mathbf{X}$, the input features). - $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$: symmetrically normalized adjacency matrix with self-loops ($\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$). This performs neighborhood aggregation with degree-based normalization. - $\mathbf{W}^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell+1}}$: learnable weight matrix that transforms the aggregated features. - $\sigma$: nonlinear activation function (typically ReLU). For a single node $v_i$: $\mathbf{h}_i^{(\ell+1)} = \sigma\!\left(\sum_{j \in \mathcal{N}(i) \cup \{i\}} \frac{1}{\sqrt{\tilde{d}_i \tilde{d}_j}} \mathbf{h}_j^{(\ell)} \mathbf{W}^{(\ell)}\right)$. Each node's new representation is a normalized sum of its neighbors' (and its own) transformed features.

Question 4

How does the GCN layer relate to spectral graph convolution?

Answer

The GCN layer is a **first-order Chebyshev approximation** of spectral graph convolution. Spectral graph convolution applies a learnable filter $g_\theta(\mathbf{\Lambda})$ to a signal in the graph Fourier domain: $\mathbf{y} = \mathbf{U} g_\theta(\mathbf{\Lambda}) \mathbf{U}^\top \mathbf{f}$. Computing this requires the $O(N^3)$ eigendecomposition of the Laplacian. The Chebyshev approximation replaces $g_\theta$ with a polynomial of degree $K$, which is $K$-localized in the spatial domain. Setting $K = 1$ and further simplifying (constraining two parameters to one, applying the renormalization trick with self-loops) yields the GCN update: $\hat{\mathbf{A}} \mathbf{H} \mathbf{W}$. This bypasses the eigendecomposition entirely — the computation is a sparse matrix multiplication in $O(E)$ time — while retaining the spectral interpretation as a first-order low-pass filter on the graph.

Question 5

Describe the three operations of the message passing framework.

Answer

1. **Message construction.** For each edge $(v_j, v_i)$, compute a message $\mathbf{m}_{j \to i} = \phi_{\text{msg}}(\mathbf{h}_i, \mathbf{h}_j, \mathbf{e}_{ij})$ using the source node features, target node features, and edge features. The message function $\phi_{\text{msg}}$ is learnable (e.g., a linear transformation, an MLP, or an attention-weighted projection). 2. **Aggregation.** Aggregate all incoming messages for each node: $\mathbf{m}_i = \bigoplus_{j \in \mathcal{N}(i)} \mathbf{m}_{j \to i}$. The aggregation $\bigoplus$ must be **permutation invariant** (sum, mean, max) because neighbors have no canonical ordering. 3. **Update.** Compute the new node representation: $\mathbf{h}_i^{(\ell+1)} = \phi_{\text{upd}}(\mathbf{h}_i^{(\ell)}, \mathbf{m}_i)$. The update function $\phi_{\text{upd}}$ combines the node's own representation with the aggregated neighborhood information. Different GNN architectures (GCN, GraphSAGE, GAT, GIN) are different choices of $\phi_{\text{msg}}$, $\bigoplus$, and $\phi_{\text{upd}}$.

Question 6

What are the two key innovations of GraphSAGE over GCN?

Answer

1. **Neighborhood sampling.** Instead of aggregating all neighbors (which can require materializing the entire graph for high-degree nodes), GraphSAGE samples a fixed-size subset of neighbors at each layer. For $L$ layers with sample size $k$, the computation graph per target node has at most $k^L$ nodes — a controllable, fixed budget. This enables mini-batch training on graphs with millions of nodes. 2. **Inductive learning.** GCN in its original form is transductive: it learns fixed embeddings for the specific nodes in the training graph. GraphSAGE learns an aggregation *function* that can be applied to any node — including new nodes unseen during training. This is critical for dynamic graphs where new nodes (users, items, molecules) continuously appear. Additionally, GraphSAGE explicitly **concatenates** the node's own representation with the aggregated neighborhood (rather than combining them symmetrically as GCN does), preserving the node's identity more strongly.

Question 7

How does GAT compute attention weights, and how does this compare to transformer attention?

Answer

GAT uses **additive attention**: for each edge $(v_j, v_i)$, it computes a score $e_{ij} = \text{LeakyReLU}(\mathbf{a}^\top [\mathbf{W}\mathbf{h}_i \| \mathbf{W}\mathbf{h}_j])$, normalizes via softmax over neighbors $\alpha_{ij} = \text{softmax}_j(e_{ij})$, and aggregates: $\mathbf{h}_i' = \sigma(\sum_j \alpha_{ij} \mathbf{W}\mathbf{h}_j)$. **Transformer attention** ([Chapter 10](../chapter-10-the-transformer-architecture/index.md)) uses **scaled dot-product**: $\alpha_{ij} = \text{softmax}_j(\mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d_k})$, where $\mathbf{q}_i = \mathbf{W}_Q \mathbf{h}_i$ and $\mathbf{k}_j = \mathbf{W}_K \mathbf{h}_j$. Key differences: (1) **Scope** — transformer attention is global (all positions attend to all), while GAT attention is local (only neighbors attend). (2) **Score function** — GAT uses an additive scoring function with a single attention vector $\mathbf{a}$, while transformers use separate query/key projections with dot-product scoring. (3) **Inductive bias** — the graph provides a sparsity mask that the transformer must learn or have engineered. A transformer with a full attention mask on graph nodes is equivalent to a GAT on a complete graph.

Question 8

What is the Weisfeiler-Leman (WL) test, and what does it tell us about GNN expressiveness?

Answer

The **1-WL test** is a classical graph isomorphism algorithm that iteratively refines node labels: at each round, each node's label is updated to a hash of its current label and the **multiset** of its neighbors' labels. After convergence, the multiset of all node labels forms a graph fingerprint. If two graphs produce different fingerprints, they are provably non-isomorphic. Xu et al. (2019) proved that **message passing GNNs are at most as powerful as the 1-WL test**: no MPNN can distinguish graphs that 1-WL cannot. This is a fundamental **upper bound** on the expressiveness of the message passing framework. The bound is **tight**: GIN (Graph Isomorphism Network) achieves the 1-WL expressiveness through sum aggregation (which is injective on multisets) and MLP updates (which are universal function approximators). This means GNNs with mean or max aggregation are strictly less powerful than GIN, and no message passing architecture can exceed 1-WL without fundamentally departing from the MPNN framework.

Question 9

Why is sum aggregation more expressive than mean or max aggregation?

Answer

Expressiveness here means the ability to distinguish different **multisets** of neighbor features. Consider two multisets: $S_1 = \{a, a, b\}$ and $S_2 = \{a, b, b\}$. - **Sum:** $\text{sum}(S_1) = 2a + b \neq a + 2b = \text{sum}(S_2)$ — sum can distinguish them. - **Mean:** If $S_1' = \{a, a, b\}$ and $S_2' = \{a, b\}$, then $\text{mean}(S_1') = (2a + b)/3$ and $\text{mean}(S_2') = (a + b)/2$ — mean cannot distinguish multisets that differ only in repeated elements when the ratios coincide. - **Max:** $\text{max}(\{a, a, a\}) = a = \text{max}(\{a\})$ — max discards all information about multiplicity and can only see the "most extreme" elements. Sum preserves the full multiset structure: given $\text{sum}(S)$ and a universal approximator (MLP) downstream, you can in principle recover any function of the multiset. Mean and max lose information that cannot be recovered. This is why GIN (sum + MLP) achieves 1-WL expressiveness while GCN (normalized mean) and max-pool GNNs do not.

Question 10

What is over-smoothing, and why does it occur?

Answer

**Over-smoothing** is the phenomenon where all node representations converge to the same vector as the number of GNN layers increases. After $K$ layers, node $i$'s representation depends on its $K$-hop neighborhood. For a connected graph, a sufficiently deep GNN's receptive field covers the entire graph for every node, causing all representations to collapse to a global average. The mathematical explanation is that the normalized adjacency matrix $\hat{\mathbf{A}}$ is a stochastic (or doubly stochastic) matrix, so $\hat{\mathbf{A}}^K \to \frac{1}{N}\mathbf{1}\mathbf{1}^\top$ as $K \to \infty$. From the spectral perspective, each multiplication by $\hat{\mathbf{A}}$ acts as a **low-pass filter**, attenuating high-frequency components (eigenvectors with large eigenvalues). After $K$ applications, only the DC component (the constant eigenvector) survives. Unlike CNNs and transformers, where depth generally improves performance, GNN depth is fundamentally limited by over-smoothing — typically to 2-4 layers.

Question 11

Name three strategies for mitigating over-smoothing. Which addresses the root cause?

Answer

1. **Residual connections:** $\mathbf{h}_i^{(\ell+1)} = \mathbf{h}_i^{(\ell+1)} + \mathbf{h}_i^{(\ell)}$. Preserves the original signal alongside the smoothed aggregation. 2. **Jumping Knowledge Networks (JKNet):** Concatenate or attention-pool representations from all layers, allowing the model to use whichever layer's receptive field is most informative for each node. 3. **DropEdge:** Randomly remove edges during training, reducing the effective propagation range. None of these addresses the **root cause** — that message passing is inherently a low-pass filtering operation. They delay over-smoothing but do not eliminate it. **Graph transformers**, which replace local message passing with global attention, fundamentally avoid the depth-smoothing tradeoff but sacrifice the sparsity advantage and local inductive bias of graph structure.

Question 12

What is the difference between transductive and inductive graph learning?

Answer

**Transductive learning:** The test nodes (and their edges) are present in the graph during training — only their labels are hidden. The model can see the test nodes' features and connectivity and use them to learn better representations. The standard semi-supervised benchmark setup (Cora, CiteSeer) is transductive. GCN in its original form is transductive. **Inductive learning:** The test nodes are entirely absent during training — they are not in the graph, and their edges are not visible. At test time, the model must generalize to unseen nodes. This is the realistic setting for applications like recommendation (new users), drug discovery (new molecules), and dynamic social networks (new members). GraphSAGE is designed for inductive learning because it learns an aggregation *function* rather than fixed per-node embeddings. Inductive evaluation is strictly harder and more realistic than transductive evaluation.

Question 13

How does link prediction with GNNs work? Describe the encoder-decoder paradigm.

Answer

Link prediction uses an **encoder-decoder** framework: 1. **Encoder (GNN):** A GNN processes the graph (with some edges held out for evaluation) to produce node embeddings $\mathbf{h}_i$ for all nodes. The GNN propagates information through the graph structure, enriching each node's representation with neighborhood context. 2. **Decoder (scoring function):** For a candidate edge $(v_i, v_j)$, a scoring function computes $\text{score}(v_i, v_j) = f(\mathbf{h}_i, \mathbf{h}_j)$. Common choices: dot product ($\mathbf{h}_i^\top \mathbf{h}_j$), bilinear ($\mathbf{h}_i^\top \mathbf{W} \mathbf{h}_j$), or MLP ($\text{MLP}([\mathbf{h}_i \| \mathbf{h}_j])$). 3. **Training:** The model is trained with binary cross-entropy loss, using observed edges as positive examples and randomly sampled non-edges as negative examples. The ratio and strategy of negative sampling significantly affects performance. 4. **Evaluation:** Held-out edges are scored and ranked; metrics include AUC-ROC and average precision.

Question 14

What is a heterogeneous graph, and why does it matter for recommendation systems?

Answer

A **heterogeneous graph** has multiple node types and/or multiple edge types. Formally, $\mathcal{G} = (\mathcal{V}, \mathcal{E}, \tau, \phi)$ where $\tau$ maps nodes to types and $\phi$ maps edges to relation types. For recommendation systems like StreamRec, the interaction graph is inherently heterogeneous: - **Node types:** users and items (different feature spaces, different semantics). - **Edge types:** views, likes, purchases, shares (different interaction signals with different strengths and meanings). - Additional types: content creators, categories, tags. A homogeneous GNN treats all nodes and edges identically, losing this type information. **Relational GCN (R-GCN)** and **heterogeneous graph transformers** use separate weight matrices per relation type, allowing the model to learn type-specific aggregation patterns — e.g., the influence of a "purchase" edge on item representation should differ from that of a "view" edge.

Question 15

Explain the readout (pooling) operation for graph-level tasks. Why must it be permutation invariant?

Answer

For graph-level tasks (e.g., molecular property prediction, graph classification), we need a single representation $\mathbf{h}_\mathcal{G}$ for the entire graph. The **readout** function aggregates all node embeddings: $$\mathbf{h}_\mathcal{G} = \text{READOUT}(\{\mathbf{h}_i : v_i \in \mathcal{V}\})$$ The readout **must be permutation invariant** because graph-level properties do not depend on how we label or order the nodes. A molecule's toxicity is the same regardless of whether we label the first carbon atom as node 0 or node 5. Common choices: sum (preserves multiset structure, sensitive to graph size), mean (size-invariant, loses cardinality), max (captures extremes, ignores frequency), and Set2Set (learnable attention-based, most expressive). Multi-scale readout — concatenating sum, mean, and max — often outperforms any single choice because different properties depend on different aggregation semantics.

Question 16

In what way is a Graph Attention Network related to a transformer? In what way is it different?

Answer

**Similarities:** Both use attention mechanisms to compute data-dependent weights for aggregating information. Both support multi-head attention for representational diversity. Both are permutation equivariant in their core computation. **Differences:** (1) **Scope:** A transformer computes attention over all positions (global, $O(n^2)$), while a GAT computes attention only over graph neighbors (local, $O(E)$). A full-attention transformer on a graph is equivalent to a GAT on a complete graph. (2) **Positional information:** Transformers require explicit positional encodings because self-attention is permutation equivariant; GATs derive positional information implicitly from the graph structure (which nodes are neighbors). (3) **Scoring function:** Original GAT uses additive attention ($\mathbf{a}^\top[\mathbf{z}_i \| \mathbf{z}_j]$) while transformers use scaled dot-product attention ($\mathbf{q}_i^\top \mathbf{k}_j / \sqrt{d_k}$). (4) **Inductive bias:** The graph adjacency provides a structured sparsity pattern that encodes relational information; the transformer has no built-in structural bias.

Question 17

What is the computational complexity of a single GCN layer, and how does it compare to a transformer layer?

Answer

A single GCN layer computes $\hat{\mathbf{A}} \mathbf{H} \mathbf{W}$: - The sparse matrix multiplication $\hat{\mathbf{A}} \mathbf{H}$ takes $O(E \cdot d)$ time (each of the $E$ edges contributes a $d$-dimensional message), where $E$ is the number of edges and $d$ is the feature dimension. - The dense matrix multiplication $\mathbf{H} \mathbf{W}$ takes $O(N \cdot d \cdot d')$ time. - **Total: $O(E \cdot d + N \cdot d \cdot d')$.** A transformer layer's self-attention computes attention over all pairs of positions: - Query-key dot products: $O(N^2 \cdot d)$. - Attention-weighted value aggregation: $O(N^2 \cdot d)$. - **Total: $O(N^2 \cdot d)$** (assuming $d = d'$). For sparse graphs where $E \ll N^2$ (most real-world graphs), the GCN is significantly cheaper. For a social network with $N = 10^6$ and average degree 100 ($E = 5 \times 10^7$), the GCN is $N / \bar{d} = 10{,}000$ times cheaper than a full-attention transformer.

Question 18

Why do standard GNNs assume homophily, and what happens on heterophilous graphs?

Answer

Standard GNNs (GCN, GraphSAGE, GAT) aggregate neighbor features and apply them to update each node's representation. This works well under **homophily** — the assumption that connected nodes tend to have similar labels or features (e.g., friends share political views, papers in the same field cite each other). Aggregation effectively "borrows" useful information from similar nodes. On **heterophilous graphs** — where connected nodes tend to have *different* labels (e.g., in dating networks, bipartite buyer-seller networks, or amino acid interaction networks) — aggregation mixes features from dissimilar nodes, which can destroy the discriminative signal. In extreme cases, a GCN on a heterophilous graph performs *worse* than an MLP that ignores graph structure entirely, because the aggregation actively hurts by mixing class-distinct features toward a common mean. Mitigation strategies include separating ego and neighbor representations, using high-pass filters (negative attention weights), incorporating multi-hop neighborhoods, or using architectures specifically designed for heterophily (e.g., H2GCN, GPRGNN).

Question 19

What is the geometric deep learning perspective, and how does it unify different neural network architectures?

Answer

**Geometric deep learning** (Bronstein et al., 2021) unifies neural network architectures through the principle of **symmetry**. The key insight: the architecture should be equivariant (or invariant) to the symmetries of the data domain. - **Sets** have permutation symmetry ($S_n$) $\to$ DeepSets (permutation-equivariant) - **Graphs** have node permutation symmetry ($S_n$) $\to$ GNNs / MPNNs - **Grids** have translation symmetry ($\mathbb{Z}^2$) $\to$ CNNs (translation-equivariant) - **Sequences** have translation symmetry ($\mathbb{Z}$) $\to$ RNNs / 1D CNNs - **3D point clouds** have rotation + translation symmetry ($SE(3)$) $\to$ EGNN, SchNet - **Spherical data** has rotation symmetry ($SO(3)$) $\to$ Spherical CNNs This perspective reveals that MLPs, CNNs, RNNs, and GNNs are all instances of the same principle — **equivariant function approximation** — applied to different symmetry groups. Weight sharing in CNNs (translation equivariance) and permutation-invariant aggregation in GNNs (permutation equivariance) are both consequences of imposing the data's symmetry on the architecture. The transformer, which has no built-in symmetry and uses positional encoding to break equivariance, is the degenerate case where symmetry is learned rather than imposed.

Question 20

For the StreamRec content platform (millions of users, hundreds of thousands of items, continuously growing), which GNN architecture would you choose and why?

Answer

**GraphSAGE** (or a heterogeneous variant like HeteroSAGE) is the best choice for StreamRec, for four reasons: 1. **Scalability.** The graph has millions of nodes — full-batch GCN is impossible due to memory constraints. GraphSAGE's mini-batch training with neighborhood sampling reduces memory to $O(B \cdot k^L)$, making it feasible on standard GPU hardware. 2. **Inductive capability.** StreamRec continuously adds new users and new items. GraphSAGE learns an aggregation *function* that can embed new nodes at inference time without retraining, while transductive GCN would require retraining on the full graph for every new node. 3. **Heterogeneity support.** The user-item graph is bipartite with multiple edge types (view, like, purchase). A heterogeneous GraphSAGE variant can use separate aggregation weights per edge type. 4. **Production maturity.** GraphSAGE's sampling-based training is well-suited to distributed training pipelines, and its fixed-budget computation graph makes latency predictable — a requirement for online recommendation serving. GAT could provide better accuracy through learned attention, but its full-neighborhood attention computation does not scale as well. GIN is more expressive but designed for graph classification (small graphs), not node/link prediction on large graphs. A LightGCN variant (simplified GCN for collaborative filtering) is also a strong practical choice if the focus is purely on interaction-based recommendation.