Chapter 3 Quiz: Calculus, Optimization, and Automatic Differentiation

Test your understanding of the key concepts from this chapter. Each question has one correct answer unless otherwise stated. Try to answer each question before revealing the solution.

Question 1: Derivative Basics

What is the derivative of f(x) = x ln(x) - x?

(a) ln(x)
(b) ln(x) - 1
(c) 1 + ln(x) - 1 = ln(x)
(d) x / ln(x)

Show Answer

**(a) ln(*x*)** Using the product rule on *x* ln(*x*): d/d*x* [*x* ln(*x*)] = ln(*x*) + *x* * (1/*x*) = ln(*x*) + 1. Then subtracting the derivative of *x*: ln(*x*) + 1 - 1 = ln(*x*).

Question 2: Partial Derivatives

For f(x, y) = x^2 e^y, what is the partial derivative with respect to y?

(a) 2x e^y
(b) x^2 e^y
(c) 2x y e^y
(d) x^2 y e^y

Show Answer

**(b) *x*^2 *e*^*y*** When differentiating with respect to *y*, we treat *x*^2 as a constant. The derivative of *e*^*y* with respect to *y* is *e*^*y*, so the partial derivative is *x*^2 * *e*^*y*.

Question 3: Gradient Direction

The gradient of a scalar function at a point:

(a) Points in the direction of steepest descent
(b) Points in the direction of steepest ascent
(c) Is always perpendicular to the function surface
(d) Points toward the global minimum

Show Answer

**(b) Points in the direction of steepest ascent** The gradient points in the direction of greatest increase of the function. This is why gradient *descent* moves in the *negative* gradient direction: to decrease the function, we move opposite to the gradient.

Question 4: Chain Rule

If y = sin(u) and u = x^2, what is dy/dx?

(a) cos(x^2)
(b) 2x cos(x^2)
(c) 2x sin(x^2)
(d) -2x sin(x^2)

Show Answer

**(b) 2*x* cos(*x*^2)** By the chain rule: d*y*/d*x* = (d*y*/d*u*) * (d*u*/d*x*) = cos(*u*) * 2*x* = 2*x* cos(*x*^2).

Question 5: Sigmoid Derivative

The sigmoid function is sigma(x) = 1/(1 + e^(-x)). Its derivative is:

(a) sigma(x) * (1 + sigma(x))
(b) sigma(x) * (1 - sigma(x))
(c) sigma(x)^2
(d) 1 - sigma(x)^2

Show Answer

**(b) *sigma*(*x*) * (1 - *sigma*(*x*))** The sigmoid derivative is *sigma*'(*x*) = *sigma*(*x*)(1 - *sigma*(*x*)). This elegant form is why the sigmoid was historically popular: its derivative is easy to compute from its output value. Note that the maximum value of the derivative is 0.25 (at *x* = 0), which contributes to the vanishing gradient problem.

Question 6: Critical Points

A saddle point of a function is characterized by:

(a) All eigenvalues of the Hessian are positive
(b) All eigenvalues of the Hessian are negative
(c) The Hessian has both positive and negative eigenvalues
(d) All eigenvalues of the Hessian are zero

Show Answer

**(c) The Hessian has both positive and negative eigenvalues** At a saddle point, the function curves upward in some directions and downward in others. This is captured by the Hessian having a mix of positive eigenvalues (upward curvature) and negative eigenvalues (downward curvature). A local minimum has all positive eigenvalues, and a local maximum has all negative eigenvalues.

Question 7: Convexity

Which of the following functions is convex?

(a) f(x) = sin(x)
(b) f(x) = x^4 - x^2
(c) f(x) = e^x
(d) f(x) = -ln(x)

Show Answer

**(c) *e*^*x*** and **(d) -ln(*x*)** are both convex. If selecting one: **(c) *e*^*x***. Its second derivative is *e*^*x* > 0 for all *x*, confirming convexity. Also, -ln(*x*) has second derivative 1/*x*^2 > 0 for *x* > 0, so it is also convex on its domain. sin(*x*) is neither convex nor concave globally. *x*^4 - *x*^2 has second derivative 12*x*^2 - 2, which is negative near *x* = 0, so it is not convex.

Question 8: Learning Rate

In gradient descent, if the learning rate is too large:

(a) Convergence is guaranteed but slow
(b) The algorithm may overshoot and diverge
(c) The algorithm finds the global minimum
(d) Only local minima are found

Show Answer

**(b) The algorithm may overshoot and diverge** A learning rate that is too large causes the parameter updates to overshoot the minimum, leading to oscillation or divergence. For a quadratic function *f*(*x*) = 0.5 *L* *x*^2, the maximum stable learning rate is 2/*L*. Beyond this, the updates grow without bound.

Question 9: Batch vs. Stochastic

Compared to batch gradient descent, stochastic gradient descent (single-example updates):

(a) Has a smoother loss curve but is slower per epoch
(b) Is faster per update and has a noisier loss curve
(c) Always converges to a better minimum
(d) Requires more memory per iteration

Show Answer

**(b) Is faster per update and has a noisier loss curve** SGD processes a single example per update, making each step very fast. However, the gradient estimate from one example is noisy, leading to a noisy loss curve. This noise can be beneficial (helping escape shallow local minima) but makes convergence less stable. Batch GD is smoother but much slower per epoch for large datasets.

Question 10: Mini-Batch Size

In practice, mini-batch sizes are often chosen as powers of 2 (32, 64, 128) primarily because:

(a) It makes the math simpler
(b) It aligns with GPU memory architecture for efficient computation
(c) It guarantees convergence
(d) It minimizes the variance of gradient estimates

Show Answer

**(b) It aligns with GPU memory architecture for efficient computation** Modern GPUs are designed with memory and compute units that operate most efficiently on data aligned to powers of 2. Using batch sizes that are powers of 2 maximizes hardware utilization. While larger batch sizes do reduce gradient variance, this is not the primary reason for choosing powers of 2 specifically.

Question 11: Momentum

The primary benefit of adding momentum to gradient descent is:

(a) It reduces the learning rate automatically
(b) It dampens oscillations in narrow valleys and accelerates progress along consistent gradient directions
(c) It guarantees finding the global minimum
(d) It eliminates the need for a learning rate

Show Answer

**(b) It dampens oscillations in narrow valleys and accelerates progress along consistent gradient directions** Momentum maintains a running average of past gradients. In a narrow valley, the gradients perpendicular to the valley direction alternate in sign, so they cancel in the average. Gradients along the valley point consistently in the same direction, so they accumulate. This leads to faster progress along the valley with less oscillation across it.

Question 12: Nesterov Momentum

How does Nesterov momentum differ from standard momentum?

(a) It uses a larger momentum coefficient
(b) It computes the gradient at the "look-ahead" position rather than the current position
(c) It eliminates the velocity term
(d) It uses a second-order approximation

Show Answer

**(b) It computes the gradient at the "look-ahead" position rather than the current position** Nesterov momentum first takes a step in the direction of the accumulated velocity, then computes the gradient at this anticipated future position. This "look-ahead" provides a correction that often leads to better convergence, especially near the optimum where standard momentum tends to overshoot.

Question 13: RMSProp

RMSProp adapts the learning rate for each parameter by:

(a) Dividing by the running average of past gradients
(b) Dividing by the square root of the running average of squared gradients
(c) Multiplying by the running average of past gradients
(d) Adding the running average of squared gradients

Show Answer

**(b) Dividing by the square root of the running average of squared gradients** RMSProp maintains an exponential moving average of squared gradients for each parameter. Parameters with consistently large gradients get a smaller effective learning rate (divided by a large number), while parameters with small gradients get a relatively larger one. The "RMS" in RMSProp stands for "root mean square."

Question 14: Adam Components

The Adam optimizer combines ideas from:

(a) SGD and batch gradient descent
(b) Nesterov momentum and gradient clipping
(c) Momentum (first moment) and RMSProp (second moment)
(d) Newton's method and conjugate gradient

Show Answer

**(c) Momentum (first moment) and RMSProp (second moment)** Adam maintains two running averages: the first moment (mean of gradients, like momentum) and the second moment (mean of squared gradients, like RMSProp). It also includes bias correction to account for the initialization at zero. The name "Adam" stands for "Adaptive Moment Estimation."

Question 15: Adam Bias Correction

The bias correction in Adam is necessary because:

(a) The gradients are biased estimators of the true gradient
(b) The moment estimates are initialized to zero, biasing them toward zero in early steps
(c) The learning rate is too large at the beginning
(d) The momentum coefficient needs to be adjusted

Show Answer

**(b) The moment estimates are initialized to zero, biasing them toward zero in early steps** Since the first and second moments are initialized as zero vectors and updated using exponential moving averages, they are biased toward zero early in training. The bias correction divides by (1 - *beta*^t), which is small when *t* is small (large correction) and approaches 1 as *t* grows (negligible correction). This ensures the moment estimates are unbiased throughout training.

Question 16: Automatic Differentiation

Automatic differentiation is different from numerical differentiation because:

(a) It approximates derivatives using finite differences
(b) It produces symbolic expressions for derivatives
(c) It computes exact derivatives by decomposing computation into elementary operations and applying the chain rule
(d) It only works for linear functions

Show Answer

**(c) It computes exact derivatives by decomposing computation into elementary operations and applying the chain rule** Automatic differentiation is neither numerical (finite differences) nor symbolic (expression manipulation). It evaluates derivatives exactly (up to floating-point precision) by recording the sequence of elementary operations and applying the chain rule to each. This avoids both the numerical errors of finite differences and the expression swell of symbolic differentiation.

Question 17: Forward vs. Reverse Mode

For a function f: R^1000 -> R^1, which mode of automatic differentiation is more efficient?

(a) Forward mode, requiring 1 pass
(b) Forward mode, requiring 1000 passes
(c) Reverse mode, requiring 1 pass
(d) Reverse mode, requiring 1000 passes

Show Answer

**(c) Reverse mode, requiring 1 pass** Reverse mode computes the gradient of one output with respect to all inputs in a single backward pass (plus one forward pass). For *f*: R^1000 -> R^1, this means computing all 1000 partial derivatives in one pass. Forward mode would require 1000 passes (one per input). Since deep learning typically has one scalar loss and many parameters, reverse mode (backpropagation) is the standard choice.

Question 18: Computational Graphs

In a computational graph, what does each edge represent?

(a) A parameter of the model
(b) A data dependency between operations
(c) A gradient value
(d) A learning rate

Show Answer

**(b) A data dependency between operations** In a computational graph, nodes represent variables (inputs, intermediate values, or outputs) and operations. Edges represent data flow, indicating which values are used as inputs to which operations. During the backward pass, gradients flow in the reverse direction along these edges.

Question 19: Vanishing Gradients

The vanishing gradient problem is most commonly associated with:

(a) Using ReLU activations
(b) Using sigmoid or tanh activations in deep networks
(c) Using too large a learning rate
(d) Using batch normalization

Show Answer

**(b) Using sigmoid or tanh activations in deep networks** Sigmoid and tanh activations saturate for large inputs, producing derivatives near zero. When many such small derivatives are multiplied together through the chain rule in a deep network, the gradient shrinks exponentially, effectively preventing early layers from learning. ReLU activations help mitigate this because their derivative is 1 for positive inputs. Batch normalization also helps by keeping activations in the non-saturating regime.

Question 20: Gradient Clipping

Gradient clipping by norm:

(a) Sets all gradient components to zero if any exceeds the threshold
(b) Clips each component independently to the range [-c, c]
(c) Scales the gradient to have a maximum norm of c, preserving direction
(d) Doubles the gradient if it is below the threshold

Show Answer

**(c) Scales the gradient to have a maximum norm of *c*, preserving direction** Clip-by-norm checks the total norm of the gradient vector. If it exceeds the threshold *c*, the entire gradient is scaled by *c* / ||**g**||, which reduces its magnitude to *c* while preserving its direction. This is generally preferred over clip-by-value because it maintains the relative importance of different gradient components.

Question 21: Taylor Series and Gradient Descent

Gradient descent can be understood through the lens of Taylor series because:

(a) It uses the second-order approximation to find the exact minimum
(b) It uses the first-order (linear) approximation to determine the descent direction
(c) It requires computing all terms of the Taylor series
(d) It only works for polynomial functions

Show Answer

**(b) It uses the first-order (linear) approximation to determine the descent direction** Gradient descent approximates the function locally by its first-order Taylor expansion: *f*(**x** + **delta**) approximately equals *f*(**x**) + nabla *f*(**x**)^T **delta**. To decrease *f*, we choose **delta** to be proportional to -nabla *f*(**x**). This is why gradient descent is called a "first-order method." Second-order methods like Newton's method use the quadratic Taylor expansion, which includes the Hessian.

Question 22: Dual Numbers

In forward-mode autodiff using dual numbers, the dual number a + b epsilon (where epsilon^2 = 0) encodes:

(a) A complex number
(b) A function value (a) and its derivative (b)
(c) Two independent function values
(d) A value and its second derivative

Show Answer

**(b) A function value (*a*) and its derivative (*b*)** Dual numbers carry both the value and derivative through a computation simultaneously. When we evaluate *f*(*x* + *epsilon*), the result is *f*(*x*) + *f*'(*x*) *epsilon*. The real part gives the function value, and the *epsilon* coefficient gives the derivative. This is because all terms with *epsilon*^2 and higher powers vanish.

Question 23: Condition Number

For gradient descent on a quadratic function, the convergence rate depends primarily on:

(a) The dimension of the input
(b) The condition number of the Hessian (ratio of largest to smallest eigenvalue)
(c) The absolute value of the gradients
(d) The initial point

Show Answer

**(b) The condition number of the Hessian (ratio of largest to smallest eigenvalue)** The condition number *kappa* = *lambda*_max / *lambda*_min determines how "elongated" the loss surface is. A large condition number means the surface is like a narrow valley, causing gradient descent to oscillate. The convergence rate is approximately ((*kappa* - 1)/(*kappa* + 1))^*t*, so larger condition numbers mean slower convergence. This motivates the use of preconditioners and adaptive methods.

Question 24: AdamW vs. Adam

AdamW differs from Adam in that:

(a) It uses a different momentum coefficient
(b) It decouples weight decay from the adaptive learning rate
(c) It removes the bias correction
(d) It uses a constant learning rate

Show Answer

**(b) It decouples weight decay from the adaptive learning rate** In standard Adam with L2 regularization, the regularization gradient is scaled by the adaptive learning rate, which reduces its effect for parameters with large gradient magnitudes. AdamW applies weight decay directly to the parameters, independent of the adaptive learning rate. This decoupling leads to better generalization and is the standard for training transformers and other large models.

Question 25: Backpropagation Memory

The primary memory cost of reverse-mode automatic differentiation (backpropagation) compared to forward-mode is:

(a) Storing the model parameters
(b) Storing all intermediate values from the forward pass needed for the backward pass
(c) Storing the gradient vector
(d) Storing the Hessian matrix

Show Answer

**(b) Storing all intermediate values from the forward pass needed for the backward pass** Reverse-mode autodiff must store the values computed during the forward pass because they are needed to compute the local derivatives during the backward pass. For a deep network, this means storing activations at every layer. Gradient checkpointing is a technique that trades memory for computation: it recomputes some intermediate values during the backward pass instead of storing them all, reducing memory from O(*L*) to O(sqrt(*L*)) for *L* layers.