Chapter 23: Exercises — Network Analysis of Information Spread

DataField.Dev

Chapter 23: Exercises — Network Analysis of Information Spread

Conceptual Exercises

Exercise 23.1 — Small-World Properties

A social media retweet network has 50,000 nodes. The average shortest path length between any two nodes is 4.2, and the average clustering coefficient is 0.38. A random graph of the same size and density would be expected to have an average path length of 3.9 and a clustering coefficient of 0.0002.

a) Is this network a small-world network? Justify your answer using the Watts-Strogatz criteria. b) What does the high clustering coefficient imply about the local structure of information sharing? c) What does the short average path length imply about how quickly misinformation could theoretically reach the entire network? d) Why might a researcher prefer to focus on network structure rather than simply counting total retweets when studying misinformation spread?

Exercise 23.2 — Centrality Measures

Consider the following directed graph representing a retweet network (edges point from retweeter to original poster, i.e., A→B means A retweeted B):

Nodes: {A, B, C, D, E, F} Edges: A→B, A→C, D→B, E→B, F→B, C→B, E→C, F→C, D→A

a) Compute the in-degree and out-degree of each node. b) Which node has the highest in-degree centrality? What does this mean in terms of information spread? c) Identify which node is most likely to have high betweenness centrality. Explain your reasoning without computing it formally. d) If you wanted to suppress the spread of content, which node would you prioritize deactivating? Would your answer differ depending on whether your goal is to reduce immediate reach or to disrupt cross-community flow? e) Explain why eigenvector centrality might rank nodes differently from degree centrality in this network.

Exercise 23.3 — Network Metrics Interpretation

A researcher constructs two Twitter networks — one for a verified political disinformation campaign and one for organic political discussion about the same topic. The metrics are:

Metric	Campaign Network	Organic Network
Modularity	0.71	0.34
Avg. Clustering Coeff.	0.62	0.41
Avg. Path Length	3.1	5.8
Cross-Community Edges	4%	28%
Network Diameter	8	14

a) Which network shows stronger echo chamber characteristics? Explain using at least three metrics. b) Why might the campaign network have a shorter average path length than the organic network? c) What does the 4% cross-community edge fraction in the campaign network suggest about how information flows within that campaign? d) A policy maker argues that the high modularity in the campaign network is evidence of coordinated behavior. Do you agree? What additional evidence would strengthen or weaken this claim?

Exercise 23.4 — SIR Model Dynamics

A piece of health misinformation begins spreading in a population of 10,000 people. The transmission rate β = 0.3 and the recovery rate γ = 0.1.

a) Compute the basic reproduction number R₀. Will the misinformation spread or die out? b) What would β need to be reduced to in order to bring R₀ below 1 (assuming γ is fixed)? c) If a correction campaign increases γ (the rate at which people stop sharing the misinformation) to 0.35, what happens to R₀? d) The SIR model assumes homogeneous mixing. Explain why this assumption is problematic for modeling misinformation on social media. What model would be more appropriate? e) In what way does the SEIR model's "Exposed" state better capture real information diffusion dynamics?

Exercise 23.5 — Independent Cascade Model

A network has 8 nodes. The seed set is {Node 1}. The transmission probability on all edges is p = 0.4. Each edge can only be traversed once (standard IC model assumption).

a) Explain the mechanics of one time step in the IC model. b) If Node 1 has three neighbors (Nodes 2, 3, and 4), what is the probability that none of them become activated in the first time step? c) What is the probability that exactly two of the three neighbors are activated? d) Compare how the IC model and the Linear Threshold model would represent the difference between casual sharing of a meme and the adoption of a political belief. Which model is more appropriate for each? e) Explain the concept of submodularity in the context of influence maximization. Why does submodularity guarantee that the greedy algorithm achieves a (1 - 1/e) approximation?

Exercise 23.6 — Vosoughi et al. Methodology

Read the summary of Vosoughi et al. (2018) in Section 23.5 and answer the following:

a) Why did the researchers use multiple fact-checking organizations rather than developing their own fact-checking system? What are the advantages and disadvantages of this approach? b) The study found that bots spread true and false news at the same rate. Why is this finding important for policy debates about bot regulation? c) Design a replication study of Vosoughi et al. using data from a platform other than Twitter. What methodological challenges would you face? How would you address them? d) A critic argues that Vosoughi et al.'s finding that "novelty drives sharing" is confounded by emotional arousal — false stories may simply be more emotionally arousing, and emotion (not novelty per se) drives sharing. How would you design an analysis to disentangle novelty from emotional arousal? e) What are the ethical implications of Vosoughi et al.'s finding that humans (not bots) are primarily responsible for false news spread?

Exercise 23.7 — Community Detection

Explain the following:

a) Why does the Louvain algorithm produce a hierarchical community structure? What is the significance of this hierarchy for understanding information networks? b) The Girvan-Newman algorithm is based on removing high-betweenness edges. Why would edges with high betweenness tend to be between communities rather than within them? c) Label propagation is non-deterministic — it can produce different results on the same network. Why is this, and how should a researcher handle this instability? d) A network has a modularity score Q = 0.05. What does this tell you about its community structure? Is this consistent or inconsistent with echo chamber dynamics? e) Describe how you would validate that the communities detected by the Louvain algorithm on a political Twitter network correspond to actual partisan communities. What external data sources would you use?

Exercise 23.8 — Influence Maximization

a) State the influence maximization problem formally. Why is it NP-hard? b) Explain the intuition behind the greedy algorithm for influence maximization. What does it mean for a function to be submodular, and why does this property matter? c) Compare the following seed selection heuristics: top-k by degree, top-k by betweenness, greedy IM. Under what conditions would each outperform the others? d) You are designing a counter-messaging campaign to correct a piece of health misinformation that has already spread to 15% of a network. Would influence maximization be useful at this stage? Justify your answer with reference to diffusion model dynamics. e) Discuss the dual-use problem of influence maximization tools. Should they be regulated? Who should decide?

Exercise 23.9 — Cross-Platform Tracking

A research team wants to track a piece of electoral misinformation as it moves from Telegram channels to Twitter to mainstream news outlets.

a) Describe at least three methodological challenges they would face. b) How could URL tracking be used to link posts across platforms? What are the limitations of this approach? c) Propose a research design for tracking this misinformation across platforms using public data only. Include your data collection strategy, network construction approach, and analytical methods. d) Why do data access asymmetries across platforms create selection biases in misinformation research? Give a specific example.

Exercise 23.10 — Scale-Free Networks

a) What is preferential attachment, and how does it generate scale-free networks? b) If a social media platform's degree distribution follows P(k) ∝ k^{-2.5}, what fraction of nodes have degree greater than 1,000 relative to the fraction with degree greater than 100? (Hint: use the power law scaling relationship.) c) Scale-free networks are described as "robust yet fragile." Explain what this means for misinformation spread, specifically in terms of network resilience to deplatforming random versus targeted accounts. d) A social media platform implements a "follower cap" — no account can have more than 100,000 followers. How would this policy affect the degree distribution of the network? What would the implications be for information spread?

Coding Exercises

Exercise 23.11 — Basic Network Construction (Python/NetworkX)

Write a Python script that: a) Creates a directed graph with 20 nodes. b) Adds edges based on a Barabási-Albert preferential attachment model (use nx.barabasi_albert_graph). c) Computes degree centrality, betweenness centrality, and eigenvector centrality for all nodes. d) Identifies the top 5 nodes by each centrality measure. e) Prints a summary table comparing the rankings across measures. f) Saves the graph as a JSON file using nx.node_link_data.

Exercise 23.12 — Small-World Comparison

Write a Python script that: a) Generates three graphs with n=500 nodes: a regular ring lattice, a Watts-Strogatz small-world graph (rewiring probability p=0.1), and an Erdős-Rényi random graph with the same average degree. b) For each graph, computes average clustering coefficient and average shortest path length. c) Creates a comparison table and a bar chart showing these metrics side by side. d) Annotates which graph exhibits small-world properties and explains why.

Exercise 23.13 — Cascade Reconstruction

Given the following retweet event log:

events = [
    {"time": 0, "tweeter": "original", "retweeter": None},
    {"time": 1, "tweeter": "original", "retweeter": "A"},
    {"time": 2, "tweeter": "original", "retweeter": "B"},
    {"time": 3, "tweeter": "A", "retweeter": "C"},
    {"time": 3, "tweeter": "A", "retweeter": "D"},
    {"time": 4, "tweeter": "B", "retweeter": "E"},
    {"time": 5, "tweeter": "C", "retweeter": "F"},
    {"time": 6, "tweeter": "D", "retweeter": "G"},
]

Write a Python script that: a) Constructs the cascade as a directed tree using NetworkX. b) Computes cascade depth, breadth (max nodes at any level), and size. c) Visualizes the cascade as a tree diagram with node labels showing time of activation. d) Extends the script to handle a list of cascades and compute aggregate statistics across all cascades.

Exercise 23.14 — Community Detection and Modularity

Write a Python script that: a) Generates a synthetic network with 4 planted communities of 25 nodes each, with high within-community edge probability (0.3) and low between-community edge probability (0.02). b) Applies the Louvain algorithm to detect communities. c) Computes modularity before and after detection. d) Visualizes the network with nodes colored by detected community. e) Computes the fraction of edges that cross community boundaries. f) Compares the detected communities against the planted communities using Normalized Mutual Information (NMI).

Exercise 23.15 — Independent Cascade Simulation

Implement the Independent Cascade model from scratch in Python (do not use an existing diffusion library): a) Function signature: ic_simulate(G, seeds, transmission_prob, num_simulations=100) returning the expected number of activated nodes. b) The function should run num_simulations independent simulations and return the mean and standard deviation of activation counts. c) Apply to a Barabási-Albert network with 200 nodes, k=3. d) Compare the expected spread from seed sets selected by: (i) top-3 by degree, (ii) top-3 by betweenness, (iii) random selection. Which performs best? e) Plot the distribution of activation counts across simulations for each seed set.

Exercise 23.16 — Centrality-Based Super-Spreader Identification

Using the NetworkX retweet network example code (or your own synthetic data): a) Identify the top 10 accounts by each of the four centrality measures. b) Compute the overlap between these top-10 lists using Jaccard similarity. c) For the accounts in the top-10 by at least two measures, retrieve (or simulate) their posting content and assess whether they show signs of coordinated behavior (high retweet percentage, narrow topic range, suspicious posting intervals). d) Write a brief report (as comments in your code) interpreting your findings.

Exercise 23.17 — Network Visualization

Using NetworkX and matplotlib: a) Construct a directed retweet network from a synthetic dataset of 100 nodes. b) Create four visualizations using different layout algorithms: spring layout, Kamada-Kawai layout, circular layout, and spectral layout. c) In the spring layout visualization, size nodes by degree centrality and color them by community. d) Add a legend showing community colors and a colorbar for node size. e) Save all four plots to a single PDF file with descriptive captions.

Exercise 23.18 — Temporal Network Analysis

Social networks change over time. Write a Python script that: a) Generates a synthetic temporal retweet dataset spanning 7 days. b) Constructs separate daily network snapshots. c) Tracks how the top-5 nodes by betweenness centrality change from day to day. d) Computes the Jaccard similarity between consecutive days' top-5 lists. e) Plots the network diameter and average clustering coefficient over time. f) Interprets the results: does the network become more or less echo-chambered over the week?

Exercise 23.19 — Cross-Platform URL Tracking

Write a Python script that simulates cross-platform URL tracking: a) Generate a synthetic dataset with 500 URL share events across three platforms (Twitter, Facebook, Reddit) with associated timestamps, account IDs, and platform labels. b) Construct a bipartite network where one set of nodes represents URLs and the other represents accounts. c) Project the bipartite network onto the account set (connect accounts that shared the same URL). d) Analyze the resulting network: identify accounts that bridge platforms, compute their centrality measures. e) Determine what fraction of URL sharing happens within-platform versus cross-platform.

Exercise 23.20 — Influence Maximization Implementation

Implement the greedy influence maximization algorithm: a) Use the IC model simulation from Exercise 23.15 to estimate expected influence for each candidate seed. b) Implement greedy seed selection: iteratively add the node with the highest marginal increase in expected influence. c) Compare greedy IM against degree-centrality-based seed selection across networks of different sizes (50, 100, 200 nodes) and transmission probabilities (0.1, 0.3, 0.5). d) Plot the influence spread curve (expected activated nodes vs. seed set size k from 1 to 10) for both strategies. e) Compute and plot the approximation ratio (greedy performance / degree-based performance) across parameter settings.

Exercise 23.21 — SIR Model on a Network

Implement the network SIR model: a) Place the SIR model on a Watts-Strogatz network with n=500, k=6, p=0.1. b) Initialize with a single infected node (the highest-degree node). c) Simulate for 50 time steps with β=0.2 and γ=0.1. d) Plot S, I, R trajectories over time. e) Repeat with the seed node being a random low-degree node instead of the highest-degree node. Compare the resulting epidemic curves. f) Repeat with β=0.05 (R₀ < 1). Does the information die out as predicted?

Exercise 23.22 — Echo Chamber Measurement

Define and operationalize an "echo chamber score" for a social network community: a) Propose at least four quantitative metrics that together constitute an echo chamber score. Justify each. b) Implement the computation of each metric in Python for a given community subgraph. c) Apply your score to 5 synthetic communities with varying levels of isolation. d) Validate your score against a gold-standard dataset of known echo chambers and non-echo-chamber communities (you may use synthetic data with planted properties). e) Discuss the limitations of automated echo chamber detection.

Exercise 23.23 — Reproducing Vosoughi et al. Key Statistics

Using the synthetic cascade data in code/case-study-code.py: a) Compute the mean cascade depth, breadth, and size separately for "true" and "false" labeled cascades. b) Test whether the differences are statistically significant using appropriate tests (Mann-Whitney U test for non-normal distributions). c) Compute the "speed to reach N users" for N = 100, 500, 1000 for both cascade types. d) Create visualizations replicating the key figures from Vosoughi et al.: depth distributions, size distributions, and speed comparisons. e) Write a brief interpretation paragraph for each visualization.

Exercise 23.24 — Network Resilience Analysis

Analyze the resilience of an information network to targeted node removal: a) Construct a Barabási-Albert network with 300 nodes. b) Simulate three node removal strategies: (i) random removal, (ii) targeted removal of highest-degree nodes, (iii) targeted removal of highest-betweenness nodes. c) For each strategy, remove nodes in increments of 1% of the network and measure the size of the largest connected component after each removal. d) Plot the resilience curves for all three strategies on the same plot. e) Interpret the results in terms of the implications for deplatforming high-profile misinformation accounts.

Exercise 23.25 — Full Pipeline Analysis

Design and implement a complete network analysis pipeline for a synthetic misinformation event: a) Generate a synthetic dataset simulating a misinformation cascade over 7 days in a 1,000-node network with planted community structure. b) Construct the retweet network and compute all centrality measures. c) Detect communities and compute modularity. d) Identify the top 10 super-spreaders using a composite centrality score. e) Simulate what would have happened if the top 3 super-spreaders had been removed at day 2 of the cascade. f) Visualize the full cascade, the community structure, and the counterfactual comparison. g) Write a 300-word analytical report (as a multiline string in your code) summarizing the findings.

Exercise 23.26 — Comparative Diffusion Models

On the same network: a) Run the SIR, IC, and LT models with comparable parameters. b) For each model, record the fraction of nodes ultimately activated. c) Track the activation curve (fraction activated vs. time step) for each model. d) Analyze how network clustering coefficient affects the final activation fraction differently across models. e) Discuss which model best captures the dynamics you observed in a recent real-world misinformation event you have researched.

Exercise 23.27 — Reporting for a Non-Technical Audience

You have completed a network analysis of a misinformation campaign. Write a 500-word report (suitable for a journalism outlet) that: a) Explains what network analysis revealed about the campaign without using jargon. b) Includes simplified descriptions of the three most important findings. c) Addresses potential limitations of the analysis. d) Proposes three concrete actions that platforms, regulators, or users could take based on the findings. e) Reflects on what the analysis could not tell us.