Chapter 34: Exercises -- MLOps and LLMOps

Section 1: Experiment Tracking

Exercise 34.1: W&B Integration

Set up a Weights & Biases experiment tracker for a simple PyTorch classification model. Log hyperparameters, training loss, validation accuracy, and learning rate at each epoch. Compare at least 3 different hyperparameter configurations in the W&B dashboard.

Exercise 34.2: Custom Metrics Logging

Extend the experiment tracker from Exercise 34.1 to log custom metrics: per-class F1 scores, confusion matrices as images, and model gradient norms. Create a W&B Table that tracks predictions vs ground truth for a validation subset.

Exercise 34.3: Hyperparameter Sweeps

Configure a W&B sweep using Bayesian optimization to search over learning rate (1e-5 to 1e-2, log scale), batch size (16, 32, 64), and dropout rate (0.1 to 0.5). Run at least 20 trials and identify the Pareto-optimal configurations for accuracy vs training time.

Exercise 34.4: Experiment Reproducibility

Implement a complete reproducibility checkpoint that captures: random seeds, library versions, git commit hash, data checksums, and hyperparameters. Verify that restoring from this checkpoint produces identical training results.

Exercise 34.5: MLflow Comparison

Implement the same experiment tracking workflow from Exercise 34.1 using MLflow instead of W&B. Compare the two tools across: ease of setup, logging API, visualization, model registry integration, and self-hosting capability.

Section 2: Data Versioning and Validation

Exercise 34.6: DVC Pipeline

Create a DVC pipeline with three stages: (1) data preprocessing, (2) feature engineering, (3) model training. Version the input data and intermediate artifacts. Demonstrate rolling back to a previous data version.

Exercise 34.7: Data Validation with Great Expectations

Define a suite of data validation expectations for a tabular dataset: column types, value ranges, null percentages, distribution statistics, and referential integrity. Run the validation suite and generate a report.

Exercise 34.8: Schema Drift Detection

Implement a schema drift detector that compares the current data batch against a reference schema. Detect: new columns, removed columns, type changes, and distribution shifts (using KL divergence or PSI). Trigger alerts when drift exceeds configurable thresholds.

Exercise 34.9: Data Lineage Tracking

Build a simple data lineage tracker that records which raw datasets, transformations, and feature engineering steps produced each training dataset. Implement forward and backward lineage queries.

Exercise 34.10: Feature Store Design

Design and implement a minimal feature store that supports: feature registration with metadata, point-in-time retrieval, feature versioning, and feature statistics computation. Test with at least 5 features across 2 entity types.

Section 3: Model Deployment and Serving

Exercise 34.11: Model Packaging

Package a PyTorch model for deployment using TorchScript (tracing and scripting). Compare inference latency and output equivalence between the original model and the TorchScript versions.

Exercise 34.12: REST API Serving

Build a FastAPI service that serves a PyTorch model with: input validation using Pydantic, batch prediction endpoint, health check endpoint, and request/response logging. Measure p50/p95/p99 latency under load.

Exercise 34.13: A/B Testing Framework

Implement an A/B testing framework that routes traffic between two model versions based on configurable percentages. Collect metrics for each variant and compute statistical significance using a chi-squared test.

Exercise 34.14: Canary Deployment

Implement a canary deployment system that gradually shifts traffic from the old model to the new model (1%, 5%, 25%, 50%, 100%) while monitoring error rates. Automatically roll back if the error rate exceeds a threshold.

Exercise 34.15: Model Compression

Apply three compression techniques to a trained model: (a) post-training quantization (INT8), (b) knowledge distillation to a smaller student, (c) pruning 30% of weights. Compare accuracy, latency, and model size for each.

Section 4: Monitoring and Observability

Exercise 34.16: Prediction Monitoring Dashboard

Build a monitoring system that tracks: prediction volume, latency distribution, error rates, feature distributions, and prediction distribution. Implement time-series tracking with rolling windows.

Exercise 34.17: Data Drift Detection

Implement a data drift detector using Population Stability Index (PSI) for numerical features and chi-squared test for categorical features. Test on synthetic data with injected drift at various magnitudes.

Exercise 34.18: Concept Drift Detection

Implement the ADWIN (Adaptive Windowing) algorithm for detecting concept drift in a streaming prediction setting. Compare its detection latency with a fixed-window approach on synthetic data with abrupt and gradual drift.

Exercise 34.19: Alerting Pipeline

Design an alerting pipeline with three severity levels: (a) INFO for minor metric changes, (b) WARNING for significant drift, (c) CRITICAL for model failures. Implement configurable thresholds and notification routing.

Exercise 34.20: Model Performance Attribution

When model accuracy drops, implement a diagnostic pipeline that identifies whether the cause is: (a) data quality issues, (b) feature drift, (c) label shift, or (d) model staleness. Test on synthetic scenarios for each cause.

Section 5: LLMOps

Exercise 34.21: Prompt Version Control

Build a prompt management system that versions prompts with metadata (model target, task type, author). Implement prompt A/B testing and track performance metrics per prompt version.

Exercise 34.22: LLM Evaluation Pipeline

Create an evaluation pipeline for an LLM application that measures: answer correctness, faithfulness, relevance, harmlessness, and latency. Use LLM-as-judge for automated evaluation and compare with human annotations.

Exercise 34.23: Cost Monitoring

Implement a cost tracking system for LLM API calls that monitors: tokens per request, cost per query, daily/weekly budgets, and per-user usage. Alert when spending approaches budget limits.

Exercise 34.24: Guardrails Implementation

Implement input and output guardrails for an LLM application: (a) topic filtering (reject off-topic queries), (b) PII detection and redaction, (c) output safety filtering, (d) factuality checking against a knowledge base.

Exercise 34.25: RAG Monitoring

Build a monitoring system for a RAG pipeline that tracks: retrieval latency, context relevance scores, generation faithfulness, answer completeness, and end-to-end latency. Implement dashboards and drift alerts.

Section 6: CI/CD for ML

Exercise 34.26: ML Testing Pipeline

Implement a testing pipeline for an ML model with: (a) unit tests for data transformations, (b) integration tests for the training pipeline, (c) model quality gates (minimum accuracy thresholds), (d) fairness tests across demographic groups.

Exercise 34.27: Automated Retraining

Build an automated retraining trigger system that initiates retraining when: (a) scheduled (weekly), (b) data drift exceeds threshold, (c) performance drops below minimum, or (d) new labeled data exceeds a count threshold.

Exercise 34.28: Model Registry

Implement a model registry that supports: model versioning, stage transitions (staging, production, archived), metadata tracking, rollback capability, and access control policies.

Exercise 34.29: Infrastructure as Code

Define the infrastructure for an ML serving system using configuration files: compute resources, autoscaling policies, networking, logging, and monitoring. Demonstrate scaling up and down based on request volume.

Exercise 34.30: End-to-End ML Pipeline

Build a complete ML pipeline that integrates all components: data validation, experiment tracking, model training, evaluation gates, model registry, deployment, monitoring, and automated retraining triggers.