Chapter 37 Exercises: Graph Neural Networks and Structured Data

Conceptual Exercises

Exercise 1: Graph Representations

For each of the following real-world systems, describe the graph representation: what are the nodes, what are the edges (directed or undirected?), what node features would you use, and what edge features would you use?

a) A protein-protein interaction network b) A citation network of academic papers c) A road network for navigation d) An abstract syntax tree of a Python program e) A supply chain network

Exercise 2: Adjacency Matrix Properties

Given the following adjacency matrix:

$$\mathbf{A} = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{pmatrix}$$

a) Draw the corresponding graph. Is it directed or undirected? b) Compute the degree matrix $\mathbf{D}$. c) Compute the graph Laplacian $\mathbf{L} = \mathbf{D} - \mathbf{A}$. d) Compute $\tilde{\mathbf{A}} = \mathbf{A} + \mathbf{I}$ and the corresponding $\tilde{\mathbf{D}}$. e) What does the entry $(i,j)$ of $\mathbf{A}^2$ represent?

Exercise 3: Message Passing by Hand

Consider a graph with 3 nodes in a triangle (all connected), with initial features $\mathbf{x}_0 = [1, 0]$, $\mathbf{x}_1 = [0, 1]$, $\mathbf{x}_2 = [1, 1]$.

a) Compute one round of mean aggregation (without self-loops) for each node. b) Now compute one round with self-loops added. How do the results differ? c) Apply symmetric normalization (as in GCN). Show the normalized features. d) After how many rounds of mean aggregation would all node features converge? Explain intuitively why.

Exercise 4: Over-Smoothing Analysis

Consider a path graph: $0 - 1 - 2 - 3 - 4$ with node features $\mathbf{x}_i = i$ (scalars).

a) Compute the node features after 1 layer of simple mean aggregation with self-loops. b) Compute after 2 layers. c) Compute after 5 layers. d) What is the limiting behavior as the number of layers approaches infinity? e) Propose two concrete strategies to mitigate this over-smoothing.

Exercise 5: GCN vs. GAT

Explain the key differences between GCN and GAT along the following dimensions:

a) How neighbor importance is determined b) Computational complexity per layer c) Interpretability of learned representations d) Performance on heterophilous graphs (where connected nodes tend to have different labels) e) When would you choose one over the other?

Exercise 6: Expressiveness and the WL Test

a) Explain the Weisfeiler-Lehman (WL) graph isomorphism test in 3--4 sentences. b) Why is GIN (Graph Isomorphism Network) more expressive than GCN with mean aggregation? c) Give an example of two non-isomorphic graphs that a 1-WL test (and thus any message-passing GNN) cannot distinguish. d) What approaches go beyond the 1-WL bound?

Exercise 7: Readout Functions

For graph-level prediction, we need to aggregate node features into a graph-level representation.

a) Compare sum, mean, and max readout. For which types of tasks is each most appropriate? b) Prove that sum readout can distinguish graphs with different numbers of nodes while mean readout cannot (given identical node features). c) Why might hierarchical pooling (e.g., DiffPool) be preferable to flat readout for molecular graphs?

Exercise 8: Knowledge Graph Reasoning

Consider a small knowledge graph with entities {Alice, Bob, Carol, MIT, Google} and relations {works_at, friends_with, graduated_from}.

a) Write out at least 5 plausible triples. b) Explain how TransE would encode the triple (Alice, works_at, Google). What is the scoring function? c) Why does TransE struggle with symmetric relations? How does DistMult address this? d) How does R-GCN extend the standard GCN for multi-relational graphs?


Programming Exercises

Exercise 9: Implement a GNN from Scratch

Using only PyTorch (no PyTorch Geometric), implement the following:

a) A GCNLayer class that performs the GCN update rule with symmetric normalization. b) A GATLayer class with single-head attention. c) A two-layer GCN model for node classification. d) Test your implementation on a synthetic graph where connected nodes share the same label.

Hint: Use torch.scatter_add_ for the aggregation step.

Exercise 10: Comparing Aggregation Functions

Using PyTorch Geometric, build three graph classification models that are identical except for the aggregation function: mean, sum, and max. Train each on the PROTEINS dataset and compare performance.

a) Report test accuracy for each aggregation function. b) Does the ranking of aggregation functions depend on the dataset? Try MUTAG as well. c) Implement a "multi-aggregation" readout that concatenates mean, sum, and max. Does it improve performance?

Exercise 11: Neighbor Sampling

Implement a simple version of GraphSAGE's neighbor sampling:

a) Write a function sample_neighbors(edge_index, node_ids, num_samples) that, for each node in node_ids, randomly samples num_samples neighbors. b) Use this to construct a 2-hop computation graph for a batch of target nodes. c) Compare the runtime and memory usage of full-batch GCN vs. your sampled mini-batch approach on a graph with 10,000 nodes.

Exercise 12: Attention Visualization

Using PyTorch Geometric's GATConv with return_attention_weights=True:

a) Train a 2-layer GAT on the Cora dataset for node classification. b) Extract attention weights for 5 correctly classified nodes and 5 incorrectly classified nodes. c) Visualize the attention patterns. Do correctly classified nodes show more focused attention? d) Compute the entropy of attention distributions. Is lower entropy correlated with correct predictions?

Exercise 13: Molecular Fingerprints vs. GNN

Compare traditional molecular fingerprints with GNN-learned representations:

a) Using RDKit, compute Morgan fingerprints (radius 2, 1024 bits) for molecules in the ESOL dataset. b) Train a simple MLP on these fingerprints. c) Train a GCN on the molecular graphs. d) Compare RMSE on a held-out test set. When does the GNN outperform fingerprints?

Implement a link prediction model on the Cora dataset:

a) Randomly remove 10% of edges to form a test set. Add the same number of non-edges as negative samples. b) Train a 2-layer GCN on the remaining graph. c) For link prediction, compute scores as $\text{score}(u,v) = \sigma(\mathbf{h}_u^T \mathbf{h}_v)$. d) Report AUC-ROC on the test set. Compare against a baseline using only node features (cosine similarity).

Exercise 15: Graph Data Augmentation

Implement and compare three graph augmentation strategies:

a) DropEdge: Randomly remove a fraction of edges during training. b) DropNode: Randomly mask a fraction of node features. c) Subgraph sampling: Train on random subgraphs instead of the full graph.

For each, train a GCN on Cora and report how augmentation affects accuracy, training stability, and over-smoothing (measured by node feature similarity across layers).