Chapter 37 Quiz

Test your understanding of graph neural networks. Each question has one best answer unless stated otherwise. Try to answer each question before revealing the solution.

Question 1

In the COO (coordinate) format used by PyTorch Geometric, an edge list for an undirected graph with 3 undirected edges would have shape:

A) [3, 2]
B) [2, 3]
C) [2, 6]
D) [6, 2]

Show Answer

**C) [2, 6].** In COO format, the edge index has shape [2, num_edges]. For an undirected graph, each undirected edge is stored as two directed edges (one in each direction), so 3 undirected edges become 6 directed entries, giving shape [2, 6].

Question 2

What is the primary purpose of adding self-loops ($\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$) in GCN?

A) To make the adjacency matrix invertible
B) To ensure each node includes its own features during aggregation
C) To prevent over-smoothing
D) To increase the number of edges for better message passing

Show Answer

**B) To ensure each node includes its own features during aggregation.** Without self-loops, the aggregation step only considers neighbor features, discarding the node's own representation. Adding self-loops makes each node its own neighbor, naturally incorporating self-information into the update.

Question 3

Which of the following is the GCN propagation rule?

A) $\mathbf{H}^{(\ell)} = \sigma(\mathbf{A} \mathbf{H}^{(\ell-1)} \mathbf{W}^{(\ell)})$
B) $\mathbf{H}^{(\ell)} = \sigma(\tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{H}^{(\ell-1)} \mathbf{W}^{(\ell)})$
C) $\mathbf{H}^{(\ell)} = \sigma(\mathbf{D}^{-1} \mathbf{A} \mathbf{H}^{(\ell-1)} \mathbf{W}^{(\ell)})$
D) $\mathbf{H}^{(\ell)} = \sigma(\mathbf{H}^{(\ell-1)} \mathbf{W}^{(\ell)} \tilde{\mathbf{A}})$

Show Answer

**B) $\mathbf{H}^{(\ell)} = \sigma(\tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2} \mathbf{H}^{(\ell-1)} \mathbf{W}^{(\ell)})$.** This is the symmetric normalization form introduced by Kipf and Welling (2017). It uses the self-loop-augmented adjacency matrix $\tilde{\mathbf{A}}$ with symmetric degree normalization.

Question 4

What problem does over-smoothing cause in deep GNNs?

A) Gradient explosion during backpropagation
B) Node representations become indistinguishable as layers increase
C) The model cannot learn edge features
D) Training becomes too slow for practical use

Show Answer

**B) Node representations become indistinguishable as layers increase.** Over-smoothing occurs because repeated neighborhood aggregation causes all node features to converge toward a common value. After many layers, every node has seen the entire graph and their representations become nearly identical, making node-level discrimination impossible.

Question 5

How does GraphSAGE differ from the original GCN?

A) GraphSAGE uses attention weights; GCN does not
B) GraphSAGE samples neighbors and concatenates self-features; GCN uses the full neighborhood with additive self-loops
C) GraphSAGE is spectral; GCN is spatial
D) GraphSAGE requires edge features; GCN does not

Show Answer

**B) GraphSAGE samples neighbors and concatenates self-features; GCN uses the full neighborhood with additive self-loops.** GraphSAGE's two key innovations are: (1) sampling a fixed number of neighbors for scalability, and (2) concatenating self-features with aggregated neighbor features rather than mixing them through self-loops. This also makes GraphSAGE inductive.

Question 6

In Graph Attention Networks (GAT), what determines how much influence a neighbor node has?

A) Its degree in the graph
B) Its Euclidean distance from the target node
C) Learned attention coefficients based on both source and target node features
D) The edge weight in the adjacency matrix

Show Answer

**C) Learned attention coefficients based on both source and target node features.** GAT computes attention coefficients using a learnable attention mechanism that considers both the source and target node features after linear transformation. These coefficients are normalized via softmax across each node's neighborhood.

Question 7

Why does GAT use LeakyReLU (instead of ReLU) in the attention computation?

A) To prevent dead neurons
B) To allow negative attention scores before softmax, enabling differentiation between neighbors
C) To reduce computational cost
D) For numerical stability in the softmax

Show Answer

**B) To allow negative attention scores before softmax, enabling differentiation between neighbors.** With standard ReLU, attention scores would be clipped to zero for many neighbor pairs, reducing the model's ability to distinguish between neighbors. LeakyReLU allows small negative values to pass through, preserving relative ordering of attention scores.

Question 8

What is the purpose of multi-head attention in GAT?

A) To reduce the number of parameters
B) To enable parallel processing on GPU
C) To stabilize learning and capture different types of relationships
D) To increase the receptive field

Show Answer

**C) To stabilize learning and capture different types of relationships.** Multiple attention heads independently compute attention patterns, each potentially capturing different relational structures. Concatenating or averaging the heads stabilizes training (reducing variance) and provides a richer representation.

Question 9

For graph-level classification, which readout function is the most expressive according to Xu et al. (2019)?

A) Mean pooling
B) Max pooling
C) Sum pooling
D) Attention-weighted pooling

Show Answer

**C) Sum pooling.** Xu et al. showed that sum pooling can distinguish multisets (e.g., it can tell that {1, 1, 2} is different from {1, 2, 2}), while mean and max cannot. Sum pooling also captures graph size information, making it strictly more expressive.

Question 10

In semi-supervised node classification on a citation network like Cora, what happens during the forward pass?

A) Only labeled nodes are processed through the GNN
B) All nodes are processed, but loss is computed only on labeled training nodes
C) The graph is split into subgraphs around labeled nodes
D) Edges between train and test nodes are removed

Show Answer

**B) All nodes are processed, but loss is computed only on labeled training nodes.** The forward pass runs message passing over the entire graph, because unlabeled nodes contribute to the representations of labeled nodes (and vice versa). The loss function, however, only considers the small subset of labeled training nodes.

Question 11

Which of the following is NOT a valid strategy for addressing over-smoothing in GNNs?

A) Adding residual connections between layers
B) Using DropEdge during training
C) Increasing the hidden dimension size
D) Using Jumping Knowledge Networks

Show Answer

**C) Increasing the hidden dimension size.** Larger hidden dimensions do not address over-smoothing; the problem is that repeated aggregation causes feature convergence regardless of dimensionality. Residual connections, DropEdge (randomly removing edges), and Jumping Knowledge (combining representations from all layers) are all established methods for mitigating over-smoothing.

Question 12

In a molecular graph, atoms are nodes and bonds are edges. What information is typically encoded as an edge feature?

A) Atom type and atomic number
B) Bond type (single, double, triple), stereochemistry, ring membership
C) Molecular weight and LogP
D) SMILES string representation

Show Answer

**B) Bond type (single, double, triple), stereochemistry, ring membership.** Edge features in molecular graphs encode bond-level properties. Atom-level properties (type, atomic number) are node features, and molecular-level properties (weight, LogP) are targets or global features, not edge features.

Question 13

What is the key advantage of the Message Passing Neural Network (MPNN) framework for molecular property prediction?

A) It avoids the need for 3D coordinates
B) It incorporates edge features into the message function
C) It uses fewer parameters than GCN
D) It guarantees equivariance to 3D rotations

Show Answer

**B) It incorporates edge features into the message function.** The MPNN framework explicitly includes edge features (like bond type) in the message computation: $m_{u \to v} = \text{MLP}(\mathbf{h}_u, \mathbf{e}_{uv})$. This is crucial for molecules where the type of bond fundamentally changes atom interactions.

Question 14

In a knowledge graph, what does the triple (Einstein, born_in, Ulm) represent?

A) A node with three features
B) A directed edge from Einstein to Ulm with relation type "born_in"
C) An undirected edge between Einstein and Ulm
D) A subgraph with three nodes

Show Answer

**B) A directed edge from Einstein to Ulm with relation type "born_in".** Knowledge graph triples are (head, relation, tail), representing a directed, typed relationship from the head entity to the tail entity.

Question 15

How does TransE model the relationship between entities?

A) As a rotation in complex space
B) As a bilinear interaction
C) As a translation: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$
D) As a neural network mapping

Show Answer

**C) As a translation: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$.** TransE represents relations as translations in embedding space. A true triple should satisfy $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$, and the score is based on the distance $\|\mathbf{h} + \mathbf{r} - \mathbf{t}\|$.

Question 16

How does R-GCN handle the large number of parameters when there are many relation types?

A) Weight sharing across all relation types
B) Basis decomposition: $\mathbf{W}_r = \sum_b a_{rb} \mathbf{V}_b$
C) Pruning infrequent relation types
D) Using a single weight matrix with relation-specific biases

Show Answer

**B) Basis decomposition: $\mathbf{W}_r = \sum_b a_{rb} \mathbf{V}_b$.** With many relation types, having a separate weight matrix per relation leads to parameter explosion. Basis decomposition expresses each relation-specific matrix as a linear combination of a small number of shared basis matrices $\mathbf{V}_b$, drastically reducing the parameter count.

Question 17

In PyTorch Geometric, what does the batch tensor in a mini-batch of graphs represent?

A) The batch size
B) A mapping from each node to the graph it belongs to
C) The edge indices for each graph
D) The order in which graphs were loaded

Show Answer

**B) A mapping from each node to the graph it belongs to.** When multiple graphs are batched together into a single disconnected graph, the `batch` vector assigns each node to its source graph. This is essential for graph-level readout operations like `global_mean_pool(x, batch)`.

Question 18

What makes GIN (Graph Isomorphism Network) more expressive than standard GCN?

A) It uses attention instead of fixed weights
B) It uses sum aggregation with an injective MLP update, matching the power of the WL test
C) It uses edge features in message passing
D) It applies deeper networks with more layers

Show Answer

**B) It uses sum aggregation with an injective MLP update, matching the power of the WL test.** Xu et al. proved that GNNs with mean or max aggregation are strictly less powerful than the 1-WL test. GIN uses sum aggregation (which preserves multiset information) with an MLP update (which provides injectivity), achieving maximal expressiveness among message-passing GNNs.

Question 19

For molecular datasets, why is scaffold splitting preferred over random splitting?

A) It is faster to compute
B) It produces balanced class distributions
C) It better estimates generalization to structurally novel molecules
D) It preserves the temporal order of molecule discovery

Show Answer

**C) It better estimates generalization to structurally novel molecules.** Random splits allow structurally similar molecules to appear in both train and test sets, leading to overoptimistic performance estimates. Scaffold splits group molecules by their core structure, ensuring the model is evaluated on its ability to generalize to genuinely new scaffolds.

Question 20

What is the computational complexity of a single GCN layer?

A) $O(N^2 d)$ where $N$ is the number of nodes and $d$ is the feature dimension
B) $O(|E| d + N d d')$ where $|E|$ is the number of edges
C) $O(N d^2)$
D) $O(N^3)$

Show Answer

**B) $O(|E| d + N d d')$.** The GCN layer involves two operations: sparse matrix multiplication with the adjacency matrix ($O(|E|d)$, iterating over edges) and the linear transformation ($O(Ndd')$, applied to each node). Since most real graphs are sparse ($|E| \ll N^2$), this is much cheaper than the dense $O(N^2d)$ alternative.

Question 21

Which of the following approaches is designed for 3D molecular geometry?

A) GraphSAGE
B) GIN
C) SchNet
D) R-GCN

Show Answer

**C) SchNet.** SchNet uses continuous-filter convolutions based on interatomic distances, making it suitable for 3D molecular structures. GraphSAGE and GIN operate on graph topology without 3D information, and R-GCN is designed for multi-relational graphs like knowledge graphs.

Question 22

Permutation equivariance in GNNs means that:

A) The model produces the same output regardless of input order
B) Reordering the nodes produces correspondingly reordered outputs
C) The model can handle graphs of any size
D) The message passing is commutative

Show Answer

**B) Reordering the nodes produces correspondingly reordered outputs.** Permutation equivariance means if you permute the node ordering by a permutation $\pi$, the output node representations are permuted in the same way. This is distinct from permutation *invariance* (same output regardless of order), which applies to graph-level readout functions.

Question 23

What is the key limitation of the 1-WL (Weisfeiler-Lehman) test that also limits standard message-passing GNNs?

A) It cannot handle weighted graphs
B) It cannot distinguish certain non-isomorphic graphs, such as regular graphs with the same degree sequence
C) It is computationally intractable for large graphs
D) It requires node features to work

Show Answer

**B) It cannot distinguish certain non-isomorphic graphs, such as regular graphs with the same degree sequence.** The 1-WL test (and hence all standard message-passing GNNs) cannot distinguish graphs that have the same multiset of local neighborhoods at all levels. Classic examples include pairs of non-isomorphic regular graphs or the "decalin vs. bicyclopentyl" molecular pair.

Question 24

In Graph Transformers like Graphormer, how is graph structure incorporated?

A) By replacing attention with message passing
B) Through positional encodings based on centrality, spatial distances, and edge information
C) By using the adjacency matrix as the attention mask
D) By converting the graph to a sequence

Show Answer

**B) Through positional encodings based on centrality, spatial distances, and edge information.** Graphormer augments the standard Transformer with graph-specific encodings: centrality encoding (based on node degree), spatial encoding (based on shortest-path distance between nodes), and edge encoding (incorporating edge features along shortest paths). This allows global attention while being aware of graph structure.

Question 25

When should you NOT use a GNN?

A) When your data has a natural graph structure
B) When your graph is essentially complete (all nodes connected to all others)
C) When you need inductive learning on new graphs
D) When relationships between entities matter

Show Answer

**B) When your graph is essentially complete (all nodes connected to all others).** A complete graph provides no structural information---every node is connected to every other node. In this case, GNN message passing degenerates to a global aggregation at every layer, and standard architectures (Transformers, MLPs) are more appropriate and efficient.