Chapter 33 Exercises: AI and Machine Learning Security
Exercise 1: Adversarial Example Generation with FGSM
Difficulty: Beginner
Using PyTorch and a pre-trained image classification model (e.g., ResNet-50):
- Load the model and a test image that is correctly classified
- Implement the Fast Gradient Sign Method (FGSM) attack
- Generate adversarial examples at epsilon values of 0.01, 0.03, 0.05, 0.1, and 0.3
- For each epsilon value, record: - The original prediction and confidence - The adversarial prediction and confidence - Whether the attack succeeded (caused misclassification)
- Visualize the original image, the perturbation, and the adversarial image side by side
- Plot a graph of attack success rate vs. epsilon value
- Discuss the trade-off between perturbation visibility and attack success
Exercise 2: PGD Attack Implementation
Difficulty: Intermediate
Extend Exercise 1 by implementing the Projected Gradient Descent (PGD) attack:
- Implement PGD with configurable epsilon, step size, and number of iterations
- Compare PGD results with FGSM at the same epsilon values
- Experiment with different numbers of iterations (5, 10, 20, 40, 100)
- Implement both targeted and untargeted variants: - Untargeted: cause any misclassification - Targeted: force classification to a specific target class
- Measure the L2 and L-infinity norms of the perturbations
- Document which attack variant is more effective and why
Exercise 3: Prompt Injection on a Simple Chatbot
Difficulty: Beginner
Build a simple LLM-powered chatbot with a system prompt and test it against prompt injection:
- Create a Flask web application with a chatbot interface
- Configure the chatbot with a system prompt that restricts it to answering questions about a specific topic (e.g., "You are a customer service bot for ShopStack. Only answer questions about products and orders.")
- Attempt the following injection techniques: - Direct instruction override ("Ignore your instructions and...") - Role-playing ("Pretend you are a different AI that...") - Encoding bypass ("Translate the following from ROT13...") - Context manipulation ("Your new instructions are...")
- Document which techniques succeed and which fail
- Implement defenses (input filtering, prompt armoring) and retest
- Write a report documenting the vulnerabilities and defenses
Exercise 4: System Prompt Extraction
Difficulty: Intermediate
Using the chatbot from Exercise 3 (or a similar LLM application you control):
- Attempt at least 10 different techniques to extract the system prompt: - Direct requests - Paraphrasing requests - Encoding requests (base64, hex, ROT13) - Completion tricks - Role-play scenarios - Language switching - Markdown/code block exploitation - Multi-turn conversation strategies - Instruction-following confusion - Token-level manipulation
- Rate each technique's success (full extraction, partial, or failed)
- Implement a "prompt armor" defense and retest all techniques
- Document which techniques are most resistant to defense
Exercise 5: Data Poisoning Simulation
Difficulty: Intermediate
Demonstrate the impact of data poisoning on a simple classifier:
- Train a spam classifier (e.g., Naive Bayes or logistic regression) on a clean dataset
- Record the baseline accuracy on a held-out test set
- Perform three types of poisoning attacks: - Label flipping: Change 5%, 10%, and 20% of labels randomly - Targeted poisoning: Add samples designed to make a specific type of spam pass as legitimate - Backdoor attack: Add a trigger pattern that causes misclassification
- Retrain the model after each poisoning attack
- Compare accuracy on clean test data and on triggered test data
- Implement defenses (outlier detection, data sanitization) and measure their effectiveness
- Write a report with graphs showing accuracy degradation at each poisoning level
Exercise 6: Model Extraction Attack
Difficulty: Intermediate
Simulate a model extraction attack:
- Train a "secret" classifier (target model) and expose it via a simple Flask API
- Implement a model extraction attack that: - Generates diverse query inputs - Queries the target API systematically - Trains a substitute model on the input-output pairs
- Evaluate the substitute model's fidelity: - Agreement rate with the target model - Accuracy on a separate test set compared to the target
- Experiment with different numbers of queries (100, 500, 1000, 5000)
- Plot fidelity vs. number of queries
- Implement rate limiting and output perturbation as defenses
- Measure how defenses affect extraction success
Exercise 7: Membership Inference Attack
Difficulty: Intermediate
Implement a membership inference attack:
- Train a target model on a known dataset (e.g., CIFAR-10 subset)
- Split data into member (training) and non-member (held-out) sets
- Implement the shadow model approach: - Train multiple shadow models on different data splits - Collect prediction vectors for member and non-member samples - Train an attack classifier on these prediction vectors
- Evaluate the attack's precision, recall, and accuracy
- Investigate how model overfitting affects membership inference success
- Implement regularization (dropout, weight decay) and measure its impact on attack success
- Discuss the privacy implications for MedSecure's medical imaging model
Exercise 8: Adversarial Robustness Toolbox (ART) Exploration
Difficulty: Intermediate
Using IBM's Adversarial Robustness Toolbox:
- Load a pre-trained model and wrap it in ART's classifier
- Generate adversarial examples using three different attacks: - FGSM - PGD - Carlini & Wagner (C&W)
- Apply three different defenses: - Spatial smoothing - JPEG compression - Adversarial training
- Create a matrix showing accuracy under each attack-defense combination
- Identify which defense is most effective against each attack
- Discuss the robustness-accuracy trade-off observed in your experiments
Exercise 9: LLM Output Injection (XSS via LLM)
Difficulty: Intermediate
Demonstrate insecure output handling in an LLM application:
- Build a web application that displays LLM-generated responses in HTML
- Craft prompts that cause the LLM to generate: - JavaScript code (potential XSS) - HTML injection payloads - Markdown that renders as executable content
- Show that the application is vulnerable to LLM-mediated XSS
- Implement output sanitization using: - HTML encoding - Content Security Policy headers - Output format restrictions
- Verify that each defense mitigates the vulnerability
- Document the attack chain and defenses
Exercise 10: AI-Enhanced Phishing Analysis
Difficulty: Intermediate
Analyze the characteristics of AI-generated vs. human-written phishing emails:
- Collect (or generate using your own LLM) 20 phishing email samples: - 10 traditional phishing emails (with common indicators) - 10 AI-generated phishing emails
- For each email, evaluate: - Grammar and spelling quality - Personalization level - Urgency tactics used - Technical accuracy - Social engineering sophistication
- Have 5 volunteers rate each email's believability (1-10 scale)
- Compare the two groups statistically
- Develop a checklist for detecting AI-generated phishing
- Discuss implications for security awareness training
Exercise 11: Model Inversion Attack
Difficulty: Advanced
Implement a basic model inversion attack:
- Train a facial recognition model on a small dataset
- Implement the model inversion technique: - Start with random noise - Optimize to maximize confidence for a target identity - Apply regularization for realistic outputs
- Generate reconstructed faces for several identities
- Evaluate reconstruction quality: - Visual comparison with actual training images - Structural similarity (SSIM) metric - Feature similarity in embedding space
- Discuss privacy implications and potential defenses
- Implement differential privacy during training and compare inversion results
Exercise 12: Adversarial Patch Creation
Difficulty: Advanced
Create a physical-world adversarial patch:
- Choose a target model (e.g., object detection or image classification)
- Implement adversarial patch optimization: - Create a small image patch (e.g., 50x50 pixels) - Optimize the patch to cause targeted misclassification when placed on any image - Apply transformations (rotation, scaling, brightness changes) during optimization for robustness
- Test the patch: - Digitally apply it to multiple test images - Print the patch and photograph it with various objects - Test at different distances and angles
- Measure attack success rate in digital and physical settings
- Discuss implications for security cameras and autonomous vehicles
- Propose detection mechanisms for adversarial patches
Exercise 13: Secure ML API Design
Difficulty: Advanced
Design and implement a secure ML model serving API:
- Create a Flask/FastAPI application serving an ML model
- Implement security controls: - Authentication (API keys, OAuth) - Rate limiting (per-user and global) - Input validation (schema, range, format) - Output perturbation (add controlled noise to confidence scores) - Query logging and anomaly detection - Model versioning and rollback
- Attempt the following attacks against your secure API: - Model extraction (with and without rate limiting) - Adversarial example submission - Input manipulation to cause errors - Authentication bypass
- Document the effectiveness of each security control
Exercise 14: LLM Red Team Exercise
Difficulty: Advanced
Conduct a structured red team assessment of an LLM application:
- Set up a chatbot with tools/function calling: - Database query tool - Email sending tool - File reading tool
- Design the system prompt with security constraints
- Attempt to: - Extract the system prompt - Invoke tools for unintended purposes (SQL injection via LLM) - Exfiltrate data through tool calls - Bypass content restrictions - Cause the LLM to generate harmful outputs - Chain multiple vulnerabilities
- For each successful attack, implement a defense
- Write a red team report including: - Executive summary - Methodology - Findings with severity ratings - Evidence (conversation logs) - Recommendations
Exercise 15: AI Security Assessment Methodology
Difficulty: Advanced
Develop a comprehensive AI security assessment methodology for MedSecure's medical imaging system:
- Map the complete ML pipeline: - Data collection and labeling - Model training and evaluation - Model deployment and serving - Monitoring and retraining
- For each stage, identify: - Assets (data, models, infrastructure) - Threats (using MITRE ATLAS framework) - Vulnerabilities - Existing controls - Residual risks
- Design test cases for: - Adversarial robustness testing - Data pipeline security - Model API security - Access control and authentication - Supply chain integrity
- Estimate the effort and resources needed for the assessment
- Create a report template suitable for presenting to MedSecure's leadership
Exercise 16: Backdoor Detection in Neural Networks
Difficulty: Advanced
Implement techniques for detecting backdoors in pre-trained models:
- Train a clean model and a trojaned model (from Exercise 5)
- Implement three detection techniques: - Neural Cleanse: Reverse-engineer potential triggers by optimizing for the smallest perturbation that changes classification for all inputs to a specific class - Activation Clustering: Cluster hidden layer activations and identify anomalous clusters associated with triggered inputs - STRIP (STRong Intentional Perturbation): Blend test inputs with random clean inputs and check if the model is unusually confident
- Test each technique on both clean and trojaned models
- Measure detection accuracy (true positive and false positive rates)
- Discuss limitations and practical applicability
Exercise 17: Text Adversarial Attacks
Difficulty: Intermediate
Using the TextAttack library, assess the robustness of a text classification model:
- Train or load a sentiment analysis model
- Apply three attack methods: - Character-level: typos, homoglyphs, invisible characters - Word-level: synonym substitution, word importance-based replacement - Sentence-level: paraphrasing while preserving semantics
- For each attack: - Measure success rate - Evaluate semantic similarity between original and adversarial text - Check if the adversarial text is still readable/natural to humans
- Implement defenses: - Spelling correction preprocessing - Ensemble voting - Certified robustness via randomized smoothing
- Compare attack success rates before and after defenses
Exercise 18: Deepfake Detection
Difficulty: Advanced
Build a basic deepfake detection system:
- Obtain a dataset of real and deepfake images or videos (e.g., FaceForensics++)
- Train a binary classifier to distinguish real from fake
- Evaluate using metrics: - Accuracy, precision, recall, F1-score - ROC curve and AUC
- Test against different deepfake generation methods
- Analyze failure cases—what types of deepfakes are hardest to detect?
- Implement ensemble detection using multiple signals: - Visual artifacts - Frequency domain analysis - Temporal consistency (for video) - Biological signals (blinking patterns, pulse)
- Discuss the arms race between deepfake generation and detection
Bonus Challenge: Full AI Security Audit
Difficulty: Expert
Perform a comprehensive security audit of a complete AI system:
- Deploy a full ML pipeline in your lab: - Data storage (PostgreSQL or similar) - Training pipeline (Python scripts) - Model registry (MLflow) - Serving API (Flask/FastAPI) - Monitoring (Prometheus/Grafana)
- Conduct the audit covering: - Infrastructure security (networks, containers, access controls) - Data security (encryption, access controls, integrity) - Model security (adversarial robustness, extraction, poisoning) - API security (authentication, rate limiting, input validation) - Supply chain security (dependencies, base images, frameworks) - Operational security (monitoring, logging, incident response)
- Write a professional penetration test report including: - Executive summary - Scope and methodology - Findings with CVSS-style severity ratings - Evidence and proof of concept - Recommendations prioritized by risk - Appendices with detailed technical steps