Chapter 40: Exercises
Exercise 40.1 --- Inference Scaling Analysis (Conceptual)
Consider a language model that achieves 60% accuracy on a math benchmark with standard decoding (single pass, greedy). You have the following inference scaling options:
- Best-of-N sampling with a perfect verifier: accuracy scales as $1 - (1 - p)^N$ where $p = 0.6$.
- Chain-of-thought with extended tokens: accuracy scales as $A - B \cdot c^{-0.4}$ with $A = 0.95$, $B = 0.35$, where $c$ is the number of reasoning tokens divided by 100.
(a) For best-of-N, how many candidates $N$ are needed to reach 95% accuracy?
(b) For chain-of-thought scaling, how many reasoning tokens are needed to reach 90% accuracy?
(c) If each candidate in best-of-N costs 500 tokens and each reasoning token costs 1 token, which strategy is more compute-efficient to reach 90% accuracy?
(d) Discuss what happens when the verifier in best-of-N is imperfect (e.g., 80% reliable). How does this change the analysis?
Exercise 40.2 --- World Model Error Accumulation (Mathematical)
A world model predicts the next state with mean squared error $\epsilon$ per step. Assume errors are independent and additive.
(a) Derive the expected cumulative error after $T$ prediction steps.
(b) If the model has a "re-grounding" mechanism that resets the state to ground truth every $K$ steps, what is the expected cumulative error over $T$ steps?
(c) Find the optimal re-grounding interval $K^*$ that minimizes total cost, given that each re-grounding step costs $C_r$ and each prediction error contributes cost proportional to the squared error. Express $K^*$ in terms of $\epsilon$, $C_r$, and $T$.
Exercise 40.3 --- Elastic Weight Consolidation (Coding)
Implement a simplified version of Elastic Weight Consolidation (EWC) for a two-task continual learning scenario.
(a) Create a simple MLP with two hidden layers (128 units each, ReLU activation).
(b) Train on Task A: classify digits 0--4 from a synthetic dataset.
(c) Compute the Fisher information matrix (diagonal approximation) after training on Task A.
(d) Train on Task B: classify digits 5--9, with the EWC penalty term.
(e) Compare the accuracy on Task A with and without EWC after training on Task B.
(f) Plot the accuracy on both tasks as a function of the EWC regularization strength $\lambda \in \{0, 0.1, 1, 10, 100, 1000\}$.
Exercise 40.4 --- Neurosymbolic Reasoning (Coding)
Build a simple neurosymbolic system that combines a neural text classifier with a symbolic rule engine.
(a) Train a neural network to classify short text inputs into categories (e.g., "math question," "factual question," "opinion question").
(b) Implement a symbolic rule engine with the following rules: - If category is "math question," route to a Python code executor. - If category is "factual question," route to a knowledge base lookup. - If category is "opinion question," return a disclaimer.
(c) Evaluate the system on 20 test inputs and measure routing accuracy.
(d) Discuss: What are the failure modes of this architecture? How would you improve it?
Exercise 40.5 --- Agent Architecture Design (Conceptual)
Design an AI agent that can autonomously research a given scientific topic, collect relevant papers, summarize them, and produce a structured literature review.
(a) Draw the agent architecture, identifying: perception, memory, planning, action, and reflection components.
(b) Specify the tool set the agent needs (e.g., search API, PDF parser, summarizer).
(c) Define a concrete evaluation rubric with at least five measurable criteria (e.g., coverage, accuracy, coherence, citation quality, relevance).
(d) Identify three potential failure modes and propose mitigation strategies for each.
(e) Estimate the compute cost (in API calls and tokens) for producing a 10-paper literature review. How does cost scale with the number of papers?
Exercise 40.6 --- AGI Level Assessment (Conceptual)
Using the graduated AGI framework from Section 40.7.3:
(a) For each of the following systems, assign an AGI level (0--4) and justify your rating: - A chess engine (e.g., Stockfish) - GPT-4 / Claude 3.5 Sonnet (frontier LLMs circa 2024) - AlphaFold2 - A self-driving car (Level 4 autonomy) - A hypothetical system that passes all university exams at the 95th percentile
(b) What tasks would a Level 2 system need to perform that current Level 1 systems cannot?
(c) Argue for or against the proposition: "A system that achieves Level 2 on cognitive tasks but has no physical embodiment should be considered AGI."
Exercise 40.7 --- Quantum Circuit Simulation (Coding)
Using only NumPy (no quantum libraries), simulate a simple parameterized quantum circuit.
(a) Implement single-qubit rotation gates $R_x(\theta)$, $R_y(\theta)$, $R_z(\theta)$ as 2x2 unitary matrices.
(b) Implement a two-qubit CNOT gate as a 4x4 matrix.
(c) Build a 2-qubit variational circuit: $R_y(\theta_1) \otimes R_y(\theta_2) \to \text{CNOT} \to R_y(\theta_3) \otimes R_y(\theta_4)$.
(d) Compute the expectation value $\langle Z_1 \otimes Z_2 \rangle$ for the output state as a function of $\theta = [\theta_1, \theta_2, \theta_3, \theta_4]$.
(e) Use gradient descent (finite differences for gradients) to find parameters $\theta^*$ that minimize the expectation value. What is the minimum value achievable?
Exercise 40.8 --- Career Analysis (Coding + Reflection)
(a) Write a Python script that scrapes (or uses provided synthetic data about) AI job postings and extracts the most frequently mentioned skills, tools, and qualifications.
(b) Categorize the skills into: foundational (math, programming), technical (specific frameworks, tools), domain (NLP, CV, robotics), and soft skills (communication, leadership).
(c) Visualize the results as a horizontal bar chart showing skill frequency by category.
(d) Based on your analysis, identify three skills you should prioritize developing in the next 12 months. Write a brief learning plan (1 paragraph each) for each skill.
Exercise 40.9 --- Continual Learning Benchmark (Coding)
Create a benchmark for evaluating continual learning methods.
(a) Generate a sequence of five synthetic classification tasks (each with 2D Gaussian clusters, different means).
(b) Implement three baselines: - Naive fine-tuning (no continual learning strategy) - EWC (from Exercise 40.3) - Experience replay with a buffer of 100 examples
(c) After training on all five tasks sequentially, measure: - Average accuracy across all five tasks - Forgetting measure: average drop in accuracy on previous tasks after learning each new task - Forward transfer: improvement on a new task relative to training from scratch
(d) Present results in a table and discuss which method works best and why.
Exercise 40.10 --- Future Technology Assessment (Conceptual)
Choose one emerging AI technology not covered in depth in this chapter (e.g., neuromorphic computing, DNA data storage for training data, federated foundation models, AI-designed AI architectures).
(a) Write a one-page technical summary covering: what it is, how it works, current state of development, and key challenges.
(b) Assess its potential impact on AI engineering using a structured framework: - Timeline: When might it become practically relevant? (1 year, 5 years, 10+ years) - Impact magnitude: How much would it change current practices? (incremental, significant, transformative) - Confidence: How confident are you in your assessment? (low, medium, high)
(c) Identify one concrete project you could undertake today to prepare for this technology's arrival.
Exercise 40.11 --- Building an Adaptive Inference System (Coding)
Build a prototype adaptive inference system that adjusts compute allocation based on query difficulty.
(a) Create a DifficultyClassifier that estimates query difficulty from input features (length, vocabulary complexity, presence of technical terms).
(b) Create a ComputeAllocator that maps difficulty scores to compute budgets (number of reasoning steps, number of candidates for best-of-N).
(c) Create a MockLLM that simulates inference with variable compute (higher compute = higher probability of correct answer, with diminishing returns).
(d) Evaluate the system on 100 synthetic queries with known difficulty levels. Compare: - Fixed low compute for all queries - Fixed high compute for all queries - Adaptive compute allocation
(e) Plot the accuracy vs. total compute trade-off for each strategy.
Exercise 40.12 --- Capstone Reflection Essay
Write a 500--800 word reflective essay addressing:
(a) Which three topics from this book were most valuable to your understanding of AI engineering, and why?
(b) Which topic do you wish had been covered in more depth?
(c) How has your understanding of AI changed from Chapter 1 to Chapter 40?
(d) What is your personal "next step" in your AI engineering journey?
This exercise has no single correct answer. The goal is thoughtful, honest reflection on your learning journey.