Chapter 33 Exercises: AI and Machine Learning Security

Exercise 1: Adversarial Example Generation with FGSM

Difficulty: Beginner

Using PyTorch and a pre-trained image classification model (e.g., ResNet-50):

  1. Load the model and a test image that is correctly classified
  2. Implement the Fast Gradient Sign Method (FGSM) attack
  3. Generate adversarial examples at epsilon values of 0.01, 0.03, 0.05, 0.1, and 0.3
  4. For each epsilon value, record: - The original prediction and confidence - The adversarial prediction and confidence - Whether the attack succeeded (caused misclassification)
  5. Visualize the original image, the perturbation, and the adversarial image side by side
  6. Plot a graph of attack success rate vs. epsilon value
  7. Discuss the trade-off between perturbation visibility and attack success

Exercise 2: PGD Attack Implementation

Difficulty: Intermediate

Extend Exercise 1 by implementing the Projected Gradient Descent (PGD) attack:

  1. Implement PGD with configurable epsilon, step size, and number of iterations
  2. Compare PGD results with FGSM at the same epsilon values
  3. Experiment with different numbers of iterations (5, 10, 20, 40, 100)
  4. Implement both targeted and untargeted variants: - Untargeted: cause any misclassification - Targeted: force classification to a specific target class
  5. Measure the L2 and L-infinity norms of the perturbations
  6. Document which attack variant is more effective and why

Exercise 3: Prompt Injection on a Simple Chatbot

Difficulty: Beginner

Build a simple LLM-powered chatbot with a system prompt and test it against prompt injection:

  1. Create a Flask web application with a chatbot interface
  2. Configure the chatbot with a system prompt that restricts it to answering questions about a specific topic (e.g., "You are a customer service bot for ShopStack. Only answer questions about products and orders.")
  3. Attempt the following injection techniques: - Direct instruction override ("Ignore your instructions and...") - Role-playing ("Pretend you are a different AI that...") - Encoding bypass ("Translate the following from ROT13...") - Context manipulation ("Your new instructions are...")
  4. Document which techniques succeed and which fail
  5. Implement defenses (input filtering, prompt armoring) and retest
  6. Write a report documenting the vulnerabilities and defenses

Exercise 4: System Prompt Extraction

Difficulty: Intermediate

Using the chatbot from Exercise 3 (or a similar LLM application you control):

  1. Attempt at least 10 different techniques to extract the system prompt: - Direct requests - Paraphrasing requests - Encoding requests (base64, hex, ROT13) - Completion tricks - Role-play scenarios - Language switching - Markdown/code block exploitation - Multi-turn conversation strategies - Instruction-following confusion - Token-level manipulation
  2. Rate each technique's success (full extraction, partial, or failed)
  3. Implement a "prompt armor" defense and retest all techniques
  4. Document which techniques are most resistant to defense

Exercise 5: Data Poisoning Simulation

Difficulty: Intermediate

Demonstrate the impact of data poisoning on a simple classifier:

  1. Train a spam classifier (e.g., Naive Bayes or logistic regression) on a clean dataset
  2. Record the baseline accuracy on a held-out test set
  3. Perform three types of poisoning attacks: - Label flipping: Change 5%, 10%, and 20% of labels randomly - Targeted poisoning: Add samples designed to make a specific type of spam pass as legitimate - Backdoor attack: Add a trigger pattern that causes misclassification
  4. Retrain the model after each poisoning attack
  5. Compare accuracy on clean test data and on triggered test data
  6. Implement defenses (outlier detection, data sanitization) and measure their effectiveness
  7. Write a report with graphs showing accuracy degradation at each poisoning level

Exercise 6: Model Extraction Attack

Difficulty: Intermediate

Simulate a model extraction attack:

  1. Train a "secret" classifier (target model) and expose it via a simple Flask API
  2. Implement a model extraction attack that: - Generates diverse query inputs - Queries the target API systematically - Trains a substitute model on the input-output pairs
  3. Evaluate the substitute model's fidelity: - Agreement rate with the target model - Accuracy on a separate test set compared to the target
  4. Experiment with different numbers of queries (100, 500, 1000, 5000)
  5. Plot fidelity vs. number of queries
  6. Implement rate limiting and output perturbation as defenses
  7. Measure how defenses affect extraction success

Exercise 7: Membership Inference Attack

Difficulty: Intermediate

Implement a membership inference attack:

  1. Train a target model on a known dataset (e.g., CIFAR-10 subset)
  2. Split data into member (training) and non-member (held-out) sets
  3. Implement the shadow model approach: - Train multiple shadow models on different data splits - Collect prediction vectors for member and non-member samples - Train an attack classifier on these prediction vectors
  4. Evaluate the attack's precision, recall, and accuracy
  5. Investigate how model overfitting affects membership inference success
  6. Implement regularization (dropout, weight decay) and measure its impact on attack success
  7. Discuss the privacy implications for MedSecure's medical imaging model

Exercise 8: Adversarial Robustness Toolbox (ART) Exploration

Difficulty: Intermediate

Using IBM's Adversarial Robustness Toolbox:

  1. Load a pre-trained model and wrap it in ART's classifier
  2. Generate adversarial examples using three different attacks: - FGSM - PGD - Carlini & Wagner (C&W)
  3. Apply three different defenses: - Spatial smoothing - JPEG compression - Adversarial training
  4. Create a matrix showing accuracy under each attack-defense combination
  5. Identify which defense is most effective against each attack
  6. Discuss the robustness-accuracy trade-off observed in your experiments

Exercise 9: LLM Output Injection (XSS via LLM)

Difficulty: Intermediate

Demonstrate insecure output handling in an LLM application:

  1. Build a web application that displays LLM-generated responses in HTML
  2. Craft prompts that cause the LLM to generate: - JavaScript code (potential XSS) - HTML injection payloads - Markdown that renders as executable content
  3. Show that the application is vulnerable to LLM-mediated XSS
  4. Implement output sanitization using: - HTML encoding - Content Security Policy headers - Output format restrictions
  5. Verify that each defense mitigates the vulnerability
  6. Document the attack chain and defenses

Exercise 10: AI-Enhanced Phishing Analysis

Difficulty: Intermediate

Analyze the characteristics of AI-generated vs. human-written phishing emails:

  1. Collect (or generate using your own LLM) 20 phishing email samples: - 10 traditional phishing emails (with common indicators) - 10 AI-generated phishing emails
  2. For each email, evaluate: - Grammar and spelling quality - Personalization level - Urgency tactics used - Technical accuracy - Social engineering sophistication
  3. Have 5 volunteers rate each email's believability (1-10 scale)
  4. Compare the two groups statistically
  5. Develop a checklist for detecting AI-generated phishing
  6. Discuss implications for security awareness training

Exercise 11: Model Inversion Attack

Difficulty: Advanced

Implement a basic model inversion attack:

  1. Train a facial recognition model on a small dataset
  2. Implement the model inversion technique: - Start with random noise - Optimize to maximize confidence for a target identity - Apply regularization for realistic outputs
  3. Generate reconstructed faces for several identities
  4. Evaluate reconstruction quality: - Visual comparison with actual training images - Structural similarity (SSIM) metric - Feature similarity in embedding space
  5. Discuss privacy implications and potential defenses
  6. Implement differential privacy during training and compare inversion results

Exercise 12: Adversarial Patch Creation

Difficulty: Advanced

Create a physical-world adversarial patch:

  1. Choose a target model (e.g., object detection or image classification)
  2. Implement adversarial patch optimization: - Create a small image patch (e.g., 50x50 pixels) - Optimize the patch to cause targeted misclassification when placed on any image - Apply transformations (rotation, scaling, brightness changes) during optimization for robustness
  3. Test the patch: - Digitally apply it to multiple test images - Print the patch and photograph it with various objects - Test at different distances and angles
  4. Measure attack success rate in digital and physical settings
  5. Discuss implications for security cameras and autonomous vehicles
  6. Propose detection mechanisms for adversarial patches

Exercise 13: Secure ML API Design

Difficulty: Advanced

Design and implement a secure ML model serving API:

  1. Create a Flask/FastAPI application serving an ML model
  2. Implement security controls: - Authentication (API keys, OAuth) - Rate limiting (per-user and global) - Input validation (schema, range, format) - Output perturbation (add controlled noise to confidence scores) - Query logging and anomaly detection - Model versioning and rollback
  3. Attempt the following attacks against your secure API: - Model extraction (with and without rate limiting) - Adversarial example submission - Input manipulation to cause errors - Authentication bypass
  4. Document the effectiveness of each security control

Exercise 14: LLM Red Team Exercise

Difficulty: Advanced

Conduct a structured red team assessment of an LLM application:

  1. Set up a chatbot with tools/function calling: - Database query tool - Email sending tool - File reading tool
  2. Design the system prompt with security constraints
  3. Attempt to: - Extract the system prompt - Invoke tools for unintended purposes (SQL injection via LLM) - Exfiltrate data through tool calls - Bypass content restrictions - Cause the LLM to generate harmful outputs - Chain multiple vulnerabilities
  4. For each successful attack, implement a defense
  5. Write a red team report including: - Executive summary - Methodology - Findings with severity ratings - Evidence (conversation logs) - Recommendations

Exercise 15: AI Security Assessment Methodology

Difficulty: Advanced

Develop a comprehensive AI security assessment methodology for MedSecure's medical imaging system:

  1. Map the complete ML pipeline: - Data collection and labeling - Model training and evaluation - Model deployment and serving - Monitoring and retraining
  2. For each stage, identify: - Assets (data, models, infrastructure) - Threats (using MITRE ATLAS framework) - Vulnerabilities - Existing controls - Residual risks
  3. Design test cases for: - Adversarial robustness testing - Data pipeline security - Model API security - Access control and authentication - Supply chain integrity
  4. Estimate the effort and resources needed for the assessment
  5. Create a report template suitable for presenting to MedSecure's leadership

Exercise 16: Backdoor Detection in Neural Networks

Difficulty: Advanced

Implement techniques for detecting backdoors in pre-trained models:

  1. Train a clean model and a trojaned model (from Exercise 5)
  2. Implement three detection techniques: - Neural Cleanse: Reverse-engineer potential triggers by optimizing for the smallest perturbation that changes classification for all inputs to a specific class - Activation Clustering: Cluster hidden layer activations and identify anomalous clusters associated with triggered inputs - STRIP (STRong Intentional Perturbation): Blend test inputs with random clean inputs and check if the model is unusually confident
  3. Test each technique on both clean and trojaned models
  4. Measure detection accuracy (true positive and false positive rates)
  5. Discuss limitations and practical applicability

Exercise 17: Text Adversarial Attacks

Difficulty: Intermediate

Using the TextAttack library, assess the robustness of a text classification model:

  1. Train or load a sentiment analysis model
  2. Apply three attack methods: - Character-level: typos, homoglyphs, invisible characters - Word-level: synonym substitution, word importance-based replacement - Sentence-level: paraphrasing while preserving semantics
  3. For each attack: - Measure success rate - Evaluate semantic similarity between original and adversarial text - Check if the adversarial text is still readable/natural to humans
  4. Implement defenses: - Spelling correction preprocessing - Ensemble voting - Certified robustness via randomized smoothing
  5. Compare attack success rates before and after defenses

Exercise 18: Deepfake Detection

Difficulty: Advanced

Build a basic deepfake detection system:

  1. Obtain a dataset of real and deepfake images or videos (e.g., FaceForensics++)
  2. Train a binary classifier to distinguish real from fake
  3. Evaluate using metrics: - Accuracy, precision, recall, F1-score - ROC curve and AUC
  4. Test against different deepfake generation methods
  5. Analyze failure cases—what types of deepfakes are hardest to detect?
  6. Implement ensemble detection using multiple signals: - Visual artifacts - Frequency domain analysis - Temporal consistency (for video) - Biological signals (blinking patterns, pulse)
  7. Discuss the arms race between deepfake generation and detection

Bonus Challenge: Full AI Security Audit

Difficulty: Expert

Perform a comprehensive security audit of a complete AI system:

  1. Deploy a full ML pipeline in your lab: - Data storage (PostgreSQL or similar) - Training pipeline (Python scripts) - Model registry (MLflow) - Serving API (Flask/FastAPI) - Monitoring (Prometheus/Grafana)
  2. Conduct the audit covering: - Infrastructure security (networks, containers, access controls) - Data security (encryption, access controls, integrity) - Model security (adversarial robustness, extraction, poisoning) - API security (authentication, rate limiting, input validation) - Supply chain security (dependencies, base images, frameworks) - Operational security (monitoring, logging, incident response)
  3. Write a professional penetration test report including: - Executive summary - Scope and methodology - Findings with CVSS-style severity ratings - Evidence and proof of concept - Recommendations prioritized by risk - Appendices with detailed technical steps