Chapter 14: Exercises

DataField.Dev

Chapter 14: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Graph Fundamentals

Exercise 14.1 (*)

Consider the following undirected graph with 5 nodes and 6 edges:

    0---1
   /|   |\
  3 |   | 4
   \|   |/
    2---+

Edges: (0,1), (0,2), (0,3), (1,2), (1,4), (2,3).

(a) Write the adjacency matrix $\mathbf{A}$, the degree matrix $\mathbf{D}$, and the graph Laplacian $\mathbf{L} = \mathbf{D} - \mathbf{A}$.

(b) Verify that $\mathbf{L} \mathbf{1} = \mathbf{0}$ (the all-ones vector is an eigenvector with eigenvalue 0).

(c) Compute $\mathbf{f}^\top \mathbf{L} \mathbf{f}$ for $\mathbf{f} = [1, 1, 1, 1, 1]^\top$ and $\mathbf{f} = [1, -1, 0, 1, -1]^\top$. Verify the result using the edge-sum formula $\sum_{(i,j) \in \mathcal{E}} (f_i - f_j)^2$.

Exercise 14.2 (*)

Using PyTorch Geometric, load the Cora, CiteSeer, and PubMed datasets from Planetoid. For each dataset, report:

(a) Number of nodes, edges, features, and classes.

(b) Average node degree, graph density (number of edges divided by number of possible edges), and number of connected components.

(c) Class distribution (number of nodes per class). Is the labeling balanced?

(d) Feature sparsity (fraction of zero entries in the feature matrix).

from torch_geometric.datasets import Planetoid
from torch_geometric.utils import degree, to_networkx
import networkx as nx


def analyze_planetoid_dataset(name: str) -> dict:
    """Analyze a Planetoid dataset and return key statistics.

    Args:
        name: Dataset name ('Cora', 'CiteSeer', or 'PubMed').

    Returns:
        Dictionary with graph statistics.
    """
    dataset = Planetoid(root=f"/tmp/{name}", name=name)
    data = dataset[0]

    # Your code here — compute all requested statistics
    stats = {}
    stats["num_nodes"] = data.num_nodes
    stats["num_edges"] = data.num_edges // 2  # Undirected: each edge stored twice
    stats["num_features"] = data.num_features
    stats["num_classes"] = dataset.num_classes

    deg = degree(data.edge_index[0], data.num_nodes)
    stats["avg_degree"] = deg.mean().item()
    stats["density"] = data.num_edges / (data.num_nodes * (data.num_nodes - 1))

    # Feature sparsity
    stats["feature_sparsity"] = (data.x == 0).float().mean().item()

    return stats


for name in ["Cora", "CiteSeer", "PubMed"]:
    stats = analyze_planetoid_dataset(name)
    print(f"\n{name}: {stats}")

Exercise 14.3 (**)

The normalized Laplacian $\tilde{\mathbf{L}} = \mathbf{I} - \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2}$ has eigenvalues in $[0, 2]$.

(a) Prove that $\lambda_{\max} \leq 2$ by showing that $\mathbf{f}^\top \tilde{\mathbf{L}} \mathbf{f} \leq 2 \mathbf{f}^\top \mathbf{f}$ for all $\mathbf{f}$. (Hint: use the identity $\mathbf{f}^\top \tilde{\mathbf{L}} \mathbf{f} = \sum_{(i,j)} (f_i/\sqrt{d_i} - f_j/\sqrt{d_j})^2$.)

(b) Compute the eigenvalues of $\tilde{\mathbf{L}}$ for the 5-node graph from Exercise 14.1. Use numpy.linalg.eigh.

(c) The second-smallest eigenvalue $\lambda_2$ (the Fiedler value) measures algebraic connectivity. Compare $\lambda_2$ for a path graph (0-1-2-3-4), a cycle graph (0-1-2-3-4-0), and a complete graph ($K_5$). What does $\lambda_2$ tell you about the "connectedness" of each graph?

GCN

Exercise 14.4 (*)

(a) Write out the GCN update equation for node 0 in the 5-node graph from Exercise 14.1, including self-loops. What are the normalization coefficients $1/\sqrt{\tilde{d}_i \tilde{d}_j}$ for each neighbor?

(b) Train the 2-layer GCN from Section 14.5 on Cora. Report the test accuracy. Then train an MLP with the same hidden dimension and number of parameters but no graph structure (ignore edge_index). Report the test accuracy. How much does the graph contribute?

(c) Vary the number of GCN layers from 1 to 16. Plot test accuracy vs. number of layers. At what depth does over-smoothing become visible?

Exercise 14.5 (**)

(a) Implement a GCN layer from scratch using only PyTorch (no PyG). Your implementation should: - Add self-loops to the adjacency matrix - Compute the symmetric normalization $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ - Apply the linear transformation and nonlinearity

def gcn_layer_from_scratch(
    x: torch.Tensor,
    adj: torch.Tensor,
    weight: torch.Tensor,
) -> torch.Tensor:
    """GCN layer using dense adjacency matrix.

    Args:
        x: Node features, shape (N, d_in).
        adj: Adjacency matrix (without self-loops), shape (N, N).
        weight: Weight matrix, shape (d_in, d_out).

    Returns:
        Updated features, shape (N, d_out).
    """
    # Your implementation here
    n = adj.size(0)
    adj_tilde = adj + torch.eye(n, device=adj.device)
    deg_tilde = adj_tilde.sum(dim=1)
    deg_inv_sqrt = deg_tilde.pow(-0.5)
    deg_inv_sqrt[deg_inv_sqrt == float("inf")] = 0.0
    d_mat = torch.diag(deg_inv_sqrt)
    adj_hat = d_mat @ adj_tilde @ d_mat
    return torch.relu(adj_hat @ x @ weight)

(b) Verify that your implementation produces the same output as torch_geometric.nn.GCNConv on the Cora dataset (up to numerical precision).

Exercise 14.6 (**)

The GCN normalization $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ is the symmetric normalization. Two alternatives are:

Row normalization (random walk): $\hat{\mathbf{A}}_{\text{rw}} = \tilde{\mathbf{D}}^{-1} \tilde{\mathbf{A}}$ (each row sums to 1).
Column normalization: $\hat{\mathbf{A}}_{\text{col}} = \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1}$ (each column sums to 1).

(a) Implement a GCN variant using row normalization. Train it on Cora and compare accuracy to the standard symmetric normalization.

(b) The symmetric normalization makes the operation $\hat{\mathbf{A}} \mathbf{X}$ equivalent to averaging over neighbors with degree-weighted coefficients. What is the interpretation of the row-normalized version? Why might it perform differently on graphs with high degree variance?

(c) Prove that the symmetric normalization preserves the symmetry of $\hat{\mathbf{A}}$ (i.e., $\hat{\mathbf{A}}$ is symmetric if $\mathbf{A}$ is symmetric), while row normalization does not.

GraphSAGE

Exercise 14.7 (*)

(a) Train a GraphSAGE model on Cora using full-batch training (no sampling) and compare accuracy to GCN. Is there a significant difference?

(b) Now train GraphSAGE with mini-batch sampling, using num_neighbors=[25, 10]. Vary the batch size from 32 to 1024. How does batch size affect convergence speed and final accuracy?

(c) Change the neighbor sample sizes to [5, 5], [15, 10], and [50, 25]. How does the neighbor budget affect accuracy and training time?

Exercise 14.8 (**)

GraphSAGE supports multiple aggregation functions: mean, max, and LSTM.

(a) Implement the max-pooling aggregator from the GraphSAGE paper:

$$\text{AGG}_{\text{max}} = \max\!\left(\left\{\sigma(\mathbf{W}_{\text{pool}} \mathbf{h}_j + \mathbf{b}_{\text{pool}}), \forall j \in \mathcal{N}(i)\right\}\right)$$

where the max is element-wise. (Hint: use torch_geometric.nn.MessagePassing with aggr='max'.)

(b) Compare mean, max, and the default SAGEConv aggregation on Cora. Report accuracy and training time.

(c) On which types of graphs would you expect max aggregation to outperform mean aggregation? Construct a synthetic graph where this difference is measurable.

GAT

Exercise 14.9 (*)

(a) Train a 2-layer GAT with 8 attention heads on Cora. Use hidden_channels=8 (so the output of the first layer is $8 \times 8 = 64$). Report accuracy.

(b) Extract the attention weights from the first GAT layer. For 5 randomly selected nodes, visualize the attention distribution over their neighbors. Do certain neighbors consistently receive higher attention?

(c) Compare GAT, GCN, and GraphSAGE on Cora, CiteSeer, and PubMed. Create a table of test accuracies. Which architecture performs best on each dataset? Is the difference statistically significant (run 10 seeds and report mean $\pm$ std)?

Exercise 14.10 (***)

The original GAT uses an additive attention mechanism: $e_{ij} = \text{LeakyReLU}(\mathbf{a}^\top [\mathbf{z}_i \| \mathbf{z}_j])$. An alternative is dot-product attention (as in the transformer): $e_{ij} = (\mathbf{W}_Q \mathbf{h}_i)^\top (\mathbf{W}_K \mathbf{h}_j) / \sqrt{d}$.

(a) Implement a GAT layer with dot-product attention using torch_geometric.nn.MessagePassing.

class DotProductGATConv(MessagePassing):
    """GAT layer with dot-product attention instead of additive.

    Args:
        in_channels: Input feature dimensionality.
        out_channels: Output feature dimensionality.
        heads: Number of attention heads.
    """

    def __init__(
        self, in_channels: int, out_channels: int, heads: int = 1
    ) -> None:
        super().__init__(aggr="add", node_dim=0)
        self.heads = heads
        self.out_channels = out_channels
        self.W_Q = nn.Linear(in_channels, heads * out_channels, bias=False)
        self.W_K = nn.Linear(in_channels, heads * out_channels, bias=False)
        self.W_V = nn.Linear(in_channels, heads * out_channels, bias=False)
        self.scale = out_channels ** -0.5

    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        Q = self.W_Q(x).view(-1, self.heads, self.out_channels)
        K = self.W_K(x).view(-1, self.heads, self.out_channels)
        V = self.W_V(x).view(-1, self.heads, self.out_channels)
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        out = self.propagate(edge_index, Q=Q, K=K, V=V)
        return out.view(-1, self.heads * self.out_channels)

    def message(
        self, Q_i: torch.Tensor, K_j: torch.Tensor, V_j: torch.Tensor,
        index: torch.Tensor, ptr: Optional[torch.Tensor],
        size_i: Optional[int],
    ) -> torch.Tensor:
        from torch_geometric.utils import softmax
        alpha = (Q_i * K_j).sum(dim=-1) * self.scale
        alpha = softmax(alpha, index, ptr, size_i)
        return alpha.unsqueeze(-1) * V_j

(b) Compare additive GAT and dot-product GAT on Cora. Which achieves higher accuracy? Which is faster?

(c) The original GAT paper argues that additive attention is more expressive than dot-product attention for graph data. Construct a theoretical argument or empirical counterexample.

GIN and Expressiveness

Exercise 14.11 (**)

(a) Consider two graphs: a 6-cycle ($C_6$) and two disjoint 3-cycles ($C_3 + C_3$). Both have 6 nodes and 6 edges. Can the 1-WL test distinguish them? Run the WL iteration manually for 3 rounds (assuming all nodes start with the same label).

(b) Now give the nodes degree-based initial labels. Can 1-WL distinguish the graphs in this case? Why or why not?

(c) Construct two non-isomorphic graphs with the same degree sequence that 1-WL cannot distinguish, even with degree-based initialization. (Hint: consider regular graphs.)

Exercise 14.12 (***)

(a) Prove that mean aggregation is strictly less expressive than sum aggregation. That is, construct two different multisets $S_1$ and $S_2$ of vectors such that $\text{mean}(S_1) = \text{mean}(S_2)$ but $\text{sum}(S_1) \neq \text{sum}(S_2)$.

(b) Prove that max aggregation is strictly less expressive than sum aggregation. Construct two multisets where max gives the same result but sum does not.

(c) Train GCN (mean-like aggregation), GIN (sum aggregation), and a GNN with max aggregation on the MUTAG molecular benchmark. Report accuracy with 10-fold cross-validation. Does the expressiveness hierarchy translate to empirical performance?

Exercise 14.13 (**)

Train a GIN model on three molecular datasets from TUDataset: MUTAG, PTC_MR, and PROTEINS.

(a) Use the pipeline from Section 14.12. Report mean $\pm$ std accuracy with 10-fold cross-validation.

(b) Ablate the number of GIN layers from 1 to 7. Plot accuracy vs. depth for each dataset. Does over-smoothing affect graph classification differently than node classification?

(c) Replace sum readout with mean readout and max readout. How does the readout function affect accuracy on each dataset?

Over-Smoothing

Exercise 14.14 (**)

Quantify over-smoothing by measuring the mean average distance (MAD) between node representations:

$$\text{MAD}(\mathbf{H}) = \frac{1}{N^2} \sum_{i,j} \|\mathbf{h}_i - \mathbf{h}_j\|_2$$

(a) Train GCN models with depths 2, 4, 8, 16, 32 on Cora. After training, compute MAD at each layer's output. Plot MAD vs. layer depth. At what depth does MAD collapse?

(b) Implement residual connections: $\mathbf{H}^{(\ell+1)} = \mathbf{H}^{(\ell+1)} + \mathbf{H}^{(\ell)}$. Repeat the MAD analysis. Does the residual connection delay the collapse?

(c) Implement DropEdge: randomly remove 20% of edges at each training step. Repeat the analysis. Does DropEdge help more or less than residual connections?

Exercise 14.15 (***)

(a) Implement a Jumping Knowledge Network (JKNet): after $L$ GCN layers, concatenate the representations from all layers and apply a linear classifier:

$$\mathbf{h}_i^{\text{final}} = [\mathbf{h}_i^{(1)} \| \mathbf{h}_i^{(2)} \| \cdots \| \mathbf{h}_i^{(L)}]$$

(b) Train JKNet with $L = 8$ layers on Cora. Compare accuracy to a standard 8-layer GCN and a 2-layer GCN. Does JKNet recover the accuracy lost to over-smoothing?

(c) Replace concatenation with an attention-based readout: $\mathbf{h}_i^{\text{final}} = \sum_\ell \alpha_{i\ell} \mathbf{h}_i^{(\ell)}$ where $\alpha_{i\ell}$ is a learned attention weight. Does attention-based JKNet outperform concatenation-based JKNet?

Link Prediction

Exercise 14.16 (**)

(a) Train a GCN-based link predictor on the Cora dataset using the framework from Section 14.9. Report AUC and average precision.

(b) Compare three scoring functions: dot product, bilinear ($\mathbf{h}_i^\top \mathbf{W} \mathbf{h}_j$), and MLP ($\text{MLP}([\mathbf{h}_i \| \mathbf{h}_j])$). Which achieves the highest AUC?

(c) Implement a hard negative sampling strategy: for each positive edge $(i, j)$, sample negative nodes from the 2-hop neighborhood of $i$ (nodes at distance exactly 2) rather than uniformly. Compare AUC to uniform negative sampling.

Exercise 14.17 (***)

(a) Implement link prediction using matrix factorization (Chapter 1) on the Cora adjacency matrix. Factorize $\mathbf{A} \approx \mathbf{U} \mathbf{V}^\top$ and score edges by $\text{score}(i, j) = \mathbf{u}_i^\top \mathbf{v}_j$. Report AUC.

(b) Compare MF to GCN-based link prediction. The GCN uses node features; the MF does not. How much does the graph structure contribute beyond what features provide?

(c) Combine MF and GCN: initialize node embeddings with MF embeddings, then refine with GCN layers. Does the combination outperform either alone?

Heterogeneous Graphs

Exercise 14.18 (**)

(a) Construct a heterogeneous bipartite graph from the StreamRec data (Chapter 1). Use 1,000 users and 500 items with random interaction edges. Create a PyG HeteroData object.

from torch_geometric.data import HeteroData


def create_bipartite_streamrec(
    n_users: int = 1000,
    n_items: int = 500,
    n_interactions: int = 5000,
    user_feat_dim: int = 16,
    item_feat_dim: int = 32,
    seed: int = 42,
) -> HeteroData:
    """Create a bipartite user-item interaction graph.

    Args:
        n_users: Number of user nodes.
        n_items: Number of item nodes.
        n_interactions: Number of interaction edges.
        user_feat_dim: User feature dimensionality.
        item_feat_dim: Item feature dimensionality.
        seed: Random seed.

    Returns:
        HeteroData object with user and item nodes and interaction edges.
    """
    torch.manual_seed(seed)
    data = HeteroData()
    data["user"].x = torch.randn(n_users, user_feat_dim)
    data["item"].x = torch.randn(n_items, item_feat_dim)

    # Random interactions
    user_idx = torch.randint(0, n_users, (n_interactions,))
    item_idx = torch.randint(0, n_items, (n_interactions,))
    data["user", "interacts", "item"].edge_index = torch.stack([user_idx, item_idx])
    data["item", "interacted_by", "user"].edge_index = torch.stack([item_idx, user_idx])

    return data

(b) Use torch_geometric.nn.HeteroConv to build a 2-layer heterogeneous GNN with separate SAGEConv layers for each edge type. Verify that it runs on your bipartite graph.

(c) Add a second edge type: "item-similar_to-item" edges between items that share many users. How does adding this edge type affect item embedding quality (measured by nearest-neighbor retrieval accuracy)?

Applied Problems

Exercise 14.19 (**)

Social network influence prediction. Generate a synthetic social network using the Barabasi-Albert model (preferential attachment):

import networkx as nx

G = nx.barabasi_albert_graph(1000, 5, seed=42)

(a) Convert to a PyG Data object. Assign node features as the concatenation of [degree, clustering coefficient, PageRank].

(b) Assign binary labels based on whether a node's influence (PageRank) is above the median. Train a GCN for node classification. Report accuracy.

(c) Compare GCN with an MLP that uses the same features but ignores graph structure. How much does the graph add on this task? Why might the answer be different from Cora?

Exercise 14.20 (***)

Molecular fingerprint learning. Molecular fingerprints (e.g., Morgan fingerprints) are fixed-length binary vectors computed from molecular graphs using a rule-based algorithm. GNNs can be viewed as learning a molecular fingerprint.

(a) Load the MUTAG dataset. Compute Morgan fingerprints (radius 2, 1024 bits) for each molecule using RDKit. Train a random forest on the fingerprints and report accuracy.

(b) Train a GIN model on the same dataset. Compare accuracy to the random forest. Which performs better?

(c) After training the GIN, extract the final graph-level embeddings (before the classifier). Compute cosine similarity between all pairs of molecules. Visualize the similarity matrix and color by class label. Does the GIN learn a useful representation?

Exercise 14.21 (**)

Graph generation for data augmentation. One limitation of GNN evaluation is the small size of benchmark datasets (MUTAG has 188 graphs).

(a) Train a GIN on 80% of MUTAG and evaluate on 20%. Report accuracy.

(b) Augment the training set by generating 100 synthetic graphs: for each real graph, create a variant by randomly adding or removing one edge. Retrain and evaluate. Does augmentation help?

(c) A more principled augmentation: for each graph, drop nodes with probability 0.1, drop edges with probability 0.1, and mask features with probability 0.1. Create three augmented copies per training graph. Does this help more than the naive approach?

Progressive Project (StreamRec)

Exercise 14.22 (***) — Milestone M6

Build a GNN-based collaborative filtering model for the StreamRec platform.

(a) Construct the user-item bipartite interaction graph from the StreamRec data (use the load_streamrec_sample function from Chapter 1). Create a PyG Data object with user and item nodes and interaction edges.

(b) Implement a 2-layer LightGCN model: a simplified GCN that removes feature transformation and nonlinearity, propagating only embeddings:

$$\mathbf{e}_u^{(\ell+1)} = \sum_{i \in \mathcal{N}(u)} \frac{1}{\sqrt{|\mathcal{N}(u)|} \sqrt{|\mathcal{N}(i)|}} \mathbf{e}_i^{(\ell)}, \quad \mathbf{e}_i^{(\ell+1)} = \sum_{u \in \mathcal{N}(i)} \frac{1}{\sqrt{|\mathcal{N}(i)|} \sqrt{|\mathcal{N}(u)|}} \mathbf{e}_u^{(\ell)}$$

The final embedding is the mean of all layers: $\mathbf{e}_u = \frac{1}{L+1} \sum_{\ell=0}^{L} \mathbf{e}_u^{(\ell)}$.

(c) Train with BPR loss (Bayesian Personalized Ranking): for each user $u$ with positive item $i$ and sampled negative item $j$:

$$\mathcal{L}_{\text{BPR}} = -\sum_{(u,i,j)} \log \sigma\!\left(\mathbf{e}_u^\top \mathbf{e}_i - \mathbf{e}_u^\top \mathbf{e}_j\right) + \lambda \|\mathbf{E}\|_F^2$$

(d) Evaluate with Recall@20 and NDCG@20 on a held-out 20% of interactions. Compare with the SVD-based matrix factorization from Chapter 1 and the two-tower model from Chapter 13. Create a comparison table.

Exercise 14.23 (****)

Extend the StreamRec GNN to handle the cold-start problem: new users with no interaction history.

(a) Implement an inductive GNN (e.g., GraphSAGE) that can embed new users based on their profile features (demographics, stated preferences) by aggregating similar known users' embeddings.

(b) Simulate cold start: hold out 10% of users from training entirely. At test time, provide only their profile features (no interactions). Compare recommendation quality for cold-start users vs. warm-start users.

(c) Implement a hybrid approach: for cold-start users, use the profile-based GNN embedding; for warm-start users, use the LightGCN interaction-based embedding. How does the hybrid compare to either approach alone?

Theoretical and Research-Level Problems

Exercise 14.24 (***)

Spectral graph convolution from scratch.

(a) Compute the eigendecomposition of the normalized Laplacian for the Cora graph. Plot the first 20 eigenvalues. What does the spectrum tell you about the graph structure?

(b) Implement a spectral graph convolution that learns a filter in the eigenvalue domain: $g_\theta(\lambda_k) = \theta_k$ for $k = 1, \ldots, K$ (truncated to the first $K$ eigenvectors). Set $K = 50$ and train on Cora. Compare accuracy to GCN.

(c) The spectral approach requires the eigendecomposition of the Laplacian, which is $O(N^3)$. For Cora ($N = 2{,}708$), this is feasible. Estimate the wall-clock time for the eigendecomposition. At what graph size does the spectral approach become impractical?

Exercise 14.25 (****)

Provably more powerful GNNs. The 1-WL test cannot distinguish $k$-regular graphs with the same number of nodes. Higher-order WL tests ($k$-WL for $k \geq 3$) can distinguish more graphs but at higher computational cost.

(a) Implement a simple 2-dimensional WL test: for each pair of nodes $(i, j)$, the "2-WL label" is updated based on the multiset of labels of all pairs involving $i$ or $j$. Run this on the $C_6$ vs. $C_3 + C_3$ example from Exercise 14.11. Can 2-WL distinguish them?

(b) Implement a PPGN (Provably Powerful Graph Network) that operates on $N \times N$ feature matrices. For each pair of nodes $(i, j)$, compute features by aggregating over the third index $k$. This is $O(N^3)$ per layer. Verify that it can distinguish graphs that 1-WL cannot.

(c) Discuss the practical implications: when is the added expressiveness of higher-order GNNs worth the computational cost? Give a specific application where 1-WL expressiveness is insufficient and a specific application where it is sufficient.

Exercise 14.26 (****)

Over-smoothing: theoretical analysis.

(a) For a connected $d$-regular graph, show that the eigenvalues of $\hat{\mathbf{A}} = \tilde{\mathbf{D}}^{-1/2} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-1/2}$ (with self-loops) are $\hat{\lambda}_k = (\lambda_k + 1) / (d + 1)$, where $\lambda_k$ are the eigenvalues of $\mathbf{A}$.

(b) After $K$ layers of linear GCN (no nonlinearity), the node features are $\mathbf{H}^{(K)} = \hat{\mathbf{A}}^K \mathbf{X} \prod_\ell \mathbf{W}^{(\ell)}$. Show that the contribution of the $k$-th eigenvector to $\mathbf{H}^{(K)}$ decays as $\hat{\lambda}_k^K$. Since $|\hat{\lambda}_k| < 1$ for $k > 1$, all components except the DC component decay exponentially.

(c) Compute the rate of over-smoothing for a $d$-regular graph: how many layers until the second eigenvector's contribution is less than 1% of the first eigenvector's? Express the answer in terms of the spectral gap $1 - |\hat{\lambda}_2|$.

Exercise 14.27 (****)

GNNs and the heterophily challenge.

(a) Generate a synthetic heterophilous graph: 500 nodes with 2 classes, where nodes are connected preferentially to nodes of the opposite class (heterophily ratio $h > 0.7$, where $h$ is the fraction of edges connecting nodes of different classes).

(b) Train GCN, GraphSAGE, and GAT on this graph. Also train an MLP (no graph structure). Which performs best? Why do standard GNNs struggle with heterophily?

(c) Implement a simple fix: the H2GCN approach of separating ego and neighbor representations and including 2-hop neighbors. Does this improve accuracy on the heterophilous graph?

Exercise 14.28 (***)

Scaling GNNs to large graphs. The ogbn-arxiv dataset (Open Graph Benchmark) has 169,343 nodes and 1,166,243 edges.

(a) Load ogbn-arxiv from ogb.nodeproppred. Attempt to train a full-batch GCN. Report memory usage and training time per epoch.

(b) Train with GraphSAGE and NeighborLoader (batch_size=1024, num_neighbors=[15, 10, 5]). Report memory usage and training time. How does the accuracy compare to full-batch?

(c) Implement Cluster-GCN: partition the graph into clusters using METIS, and train on one cluster at a time. Compare accuracy and speed to neighborhood sampling.