Chapter 33 Exercises: AI and Machine Learning Security

Exercise 1: Adversarial Example Generation with FGSM

Difficulty: Beginner

Using PyTorch and a pre-trained image classification model (e.g., ResNet-50):

Load the model and a test image that is correctly classified
Implement the Fast Gradient Sign Method (FGSM) attack
Generate adversarial examples at epsilon values of 0.01, 0.03, 0.05, 0.1, and 0.3
For each epsilon value, record: - The original prediction and confidence - The adversarial prediction and confidence - Whether the attack succeeded (caused misclassification)
Visualize the original image, the perturbation, and the adversarial image side by side
Plot a graph of attack success rate vs. epsilon value
Discuss the trade-off between perturbation visibility and attack success

Exercise 2: PGD Attack Implementation

Difficulty: Intermediate

Extend Exercise 1 by implementing the Projected Gradient Descent (PGD) attack:

Implement PGD with configurable epsilon, step size, and number of iterations
Compare PGD results with FGSM at the same epsilon values
Experiment with different numbers of iterations (5, 10, 20, 40, 100)
Implement both targeted and untargeted variants: - Untargeted: cause any misclassification - Targeted: force classification to a specific target class
Measure the L2 and L-infinity norms of the perturbations
Document which attack variant is more effective and why

Exercise 3: Prompt Injection on a Simple Chatbot

Difficulty: Beginner

Build a simple LLM-powered chatbot with a system prompt and test it against prompt injection:

Create a Flask web application with a chatbot interface
Configure the chatbot with a system prompt that restricts it to answering questions about a specific topic (e.g., "You are a customer service bot for ShopStack. Only answer questions about products and orders.")
Attempt the following injection techniques: - Direct instruction override ("Ignore your instructions and...") - Role-playing ("Pretend you are a different AI that...") - Encoding bypass ("Translate the following from ROT13...") - Context manipulation ("Your new instructions are...")
Document which techniques succeed and which fail
Implement defenses (input filtering, prompt armoring) and retest
Write a report documenting the vulnerabilities and defenses

Exercise 4: System Prompt Extraction

Difficulty: Intermediate

Using the chatbot from Exercise 3 (or a similar LLM application you control):

Attempt at least 10 different techniques to extract the system prompt: - Direct requests - Paraphrasing requests - Encoding requests (base64, hex, ROT13) - Completion tricks - Role-play scenarios - Language switching - Markdown/code block exploitation - Multi-turn conversation strategies - Instruction-following confusion - Token-level manipulation
Rate each technique's success (full extraction, partial, or failed)
Implement a "prompt armor" defense and retest all techniques
Document which techniques are most resistant to defense

Exercise 5: Data Poisoning Simulation

Difficulty: Intermediate

Demonstrate the impact of data poisoning on a simple classifier:

Train a spam classifier (e.g., Naive Bayes or logistic regression) on a clean dataset
Record the baseline accuracy on a held-out test set
Perform three types of poisoning attacks: - Label flipping: Change 5%, 10%, and 20% of labels randomly - Targeted poisoning: Add samples designed to make a specific type of spam pass as legitimate - Backdoor attack: Add a trigger pattern that causes misclassification
Retrain the model after each poisoning attack
Compare accuracy on clean test data and on triggered test data
Implement defenses (outlier detection, data sanitization) and measure their effectiveness
Write a report with graphs showing accuracy degradation at each poisoning level

Exercise 6: Model Extraction Attack

Difficulty: Intermediate

Simulate a model extraction attack:

Train a "secret" classifier (target model) and expose it via a simple Flask API
Implement a model extraction attack that: - Generates diverse query inputs - Queries the target API systematically - Trains a substitute model on the input-output pairs
Evaluate the substitute model's fidelity: - Agreement rate with the target model - Accuracy on a separate test set compared to the target
Experiment with different numbers of queries (100, 500, 1000, 5000)
Plot fidelity vs. number of queries
Implement rate limiting and output perturbation as defenses
Measure how defenses affect extraction success

Exercise 7: Membership Inference Attack

Difficulty: Intermediate

Implement a membership inference attack:

Train a target model on a known dataset (e.g., CIFAR-10 subset)
Split data into member (training) and non-member (held-out) sets
Implement the shadow model approach: - Train multiple shadow models on different data splits - Collect prediction vectors for member and non-member samples - Train an attack classifier on these prediction vectors
Evaluate the attack's precision, recall, and accuracy
Investigate how model overfitting affects membership inference success
Implement regularization (dropout, weight decay) and measure its impact on attack success
Discuss the privacy implications for MedSecure's medical imaging model

Exercise 8: Adversarial Robustness Toolbox (ART) Exploration

Difficulty: Intermediate

Using IBM's Adversarial Robustness Toolbox:

Load a pre-trained model and wrap it in ART's classifier
Generate adversarial examples using three different attacks: - FGSM - PGD - Carlini & Wagner (C&W)
Apply three different defenses: - Spatial smoothing - JPEG compression - Adversarial training
Create a matrix showing accuracy under each attack-defense combination
Identify which defense is most effective against each attack
Discuss the robustness-accuracy trade-off observed in your experiments

Exercise 9: LLM Output Injection (XSS via LLM)

Difficulty: Intermediate

Demonstrate insecure output handling in an LLM application:

Build a web application that displays LLM-generated responses in HTML
Craft prompts that cause the LLM to generate: - JavaScript code (potential XSS) - HTML injection payloads - Markdown that renders as executable content
Show that the application is vulnerable to LLM-mediated XSS
Implement output sanitization using: - HTML encoding - Content Security Policy headers - Output format restrictions
Verify that each defense mitigates the vulnerability
Document the attack chain and defenses

Exercise 10: AI-Enhanced Phishing Analysis

Difficulty: Intermediate

Analyze the characteristics of AI-generated vs. human-written phishing emails:

Collect (or generate using your own LLM) 20 phishing email samples: - 10 traditional phishing emails (with common indicators) - 10 AI-generated phishing emails
For each email, evaluate: - Grammar and spelling quality - Personalization level - Urgency tactics used - Technical accuracy - Social engineering sophistication
Have 5 volunteers rate each email's believability (1-10 scale)
Compare the two groups statistically
Develop a checklist for detecting AI-generated phishing
Discuss implications for security awareness training

Exercise 11: Model Inversion Attack

Difficulty: Advanced

Implement a basic model inversion attack:

Train a facial recognition model on a small dataset
Implement the model inversion technique: - Start with random noise - Optimize to maximize confidence for a target identity - Apply regularization for realistic outputs
Generate reconstructed faces for several identities
Evaluate reconstruction quality: - Visual comparison with actual training images - Structural similarity (SSIM) metric - Feature similarity in embedding space
Discuss privacy implications and potential defenses
Implement differential privacy during training and compare inversion results

Exercise 12: Adversarial Patch Creation

Difficulty: Advanced

Create a physical-world adversarial patch:

Choose a target model (e.g., object detection or image classification)
Implement adversarial patch optimization: - Create a small image patch (e.g., 50x50 pixels) - Optimize the patch to cause targeted misclassification when placed on any image - Apply transformations (rotation, scaling, brightness changes) during optimization for robustness
Test the patch: - Digitally apply it to multiple test images - Print the patch and photograph it with various objects - Test at different distances and angles
Measure attack success rate in digital and physical settings
Discuss implications for security cameras and autonomous vehicles
Propose detection mechanisms for adversarial patches

Exercise 13: Secure ML API Design

Difficulty: Advanced

Design and implement a secure ML model serving API:

Create a Flask/FastAPI application serving an ML model
Implement security controls: - Authentication (API keys, OAuth) - Rate limiting (per-user and global) - Input validation (schema, range, format) - Output perturbation (add controlled noise to confidence scores) - Query logging and anomaly detection - Model versioning and rollback
Attempt the following attacks against your secure API: - Model extraction (with and without rate limiting) - Adversarial example submission - Input manipulation to cause errors - Authentication bypass
Document the effectiveness of each security control

Exercise 14: LLM Red Team Exercise

Difficulty: Advanced

Conduct a structured red team assessment of an LLM application:

Set up a chatbot with tools/function calling: - Database query tool - Email sending tool - File reading tool
Design the system prompt with security constraints
Attempt to: - Extract the system prompt - Invoke tools for unintended purposes (SQL injection via LLM) - Exfiltrate data through tool calls - Bypass content restrictions - Cause the LLM to generate harmful outputs - Chain multiple vulnerabilities
For each successful attack, implement a defense
Write a red team report including: - Executive summary - Methodology - Findings with severity ratings - Evidence (conversation logs) - Recommendations

Exercise 15: AI Security Assessment Methodology

Difficulty: Advanced

Develop a comprehensive AI security assessment methodology for MedSecure's medical imaging system:

Map the complete ML pipeline: - Data collection and labeling - Model training and evaluation - Model deployment and serving - Monitoring and retraining
For each stage, identify: - Assets (data, models, infrastructure) - Threats (using MITRE ATLAS framework) - Vulnerabilities - Existing controls - Residual risks
Design test cases for: - Adversarial robustness testing - Data pipeline security - Model API security - Access control and authentication - Supply chain integrity
Estimate the effort and resources needed for the assessment
Create a report template suitable for presenting to MedSecure's leadership

Exercise 16: Backdoor Detection in Neural Networks

Difficulty: Advanced

Implement techniques for detecting backdoors in pre-trained models:

Train a clean model and a trojaned model (from Exercise 5)
Implement three detection techniques: - Neural Cleanse: Reverse-engineer potential triggers by optimizing for the smallest perturbation that changes classification for all inputs to a specific class - Activation Clustering: Cluster hidden layer activations and identify anomalous clusters associated with triggered inputs - STRIP (STRong Intentional Perturbation): Blend test inputs with random clean inputs and check if the model is unusually confident
Test each technique on both clean and trojaned models
Measure detection accuracy (true positive and false positive rates)
Discuss limitations and practical applicability

Exercise 17: Text Adversarial Attacks

Difficulty: Intermediate

Using the TextAttack library, assess the robustness of a text classification model:

Train or load a sentiment analysis model
Apply three attack methods: - Character-level: typos, homoglyphs, invisible characters - Word-level: synonym substitution, word importance-based replacement - Sentence-level: paraphrasing while preserving semantics
For each attack: - Measure success rate - Evaluate semantic similarity between original and adversarial text - Check if the adversarial text is still readable/natural to humans
Implement defenses: - Spelling correction preprocessing - Ensemble voting - Certified robustness via randomized smoothing
Compare attack success rates before and after defenses

Exercise 18: Deepfake Detection

Difficulty: Advanced

Build a basic deepfake detection system:

Obtain a dataset of real and deepfake images or videos (e.g., FaceForensics++)
Train a binary classifier to distinguish real from fake
Evaluate using metrics: - Accuracy, precision, recall, F1-score - ROC curve and AUC
Test against different deepfake generation methods
Analyze failure cases—what types of deepfakes are hardest to detect?
Implement ensemble detection using multiple signals: - Visual artifacts - Frequency domain analysis - Temporal consistency (for video) - Biological signals (blinking patterns, pulse)
Discuss the arms race between deepfake generation and detection

Bonus Challenge: Full AI Security Audit

Difficulty: Expert

Perform a comprehensive security audit of a complete AI system:

Deploy a full ML pipeline in your lab: - Data storage (PostgreSQL or similar) - Training pipeline (Python scripts) - Model registry (MLflow) - Serving API (Flask/FastAPI) - Monitoring (Prometheus/Grafana)
Conduct the audit covering: - Infrastructure security (networks, containers, access controls) - Data security (encryption, access controls, integrity) - Model security (adversarial robustness, extraction, poisoning) - API security (authentication, rate limiting, input validation) - Supply chain security (dependencies, base images, frameworks) - Operational security (monitoring, logging, incident response)
Write a professional penetration test report including: - Executive summary - Scope and methodology - Findings with CVSS-style severity ratings - Evidence and proof of concept - Recommendations prioritized by risk - Appendices with detailed technical steps