Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM
Project Overview
In this capstone project, you will take a pre-trained open-source large language model and adapt it to a specific domain through data preparation, parameter-efficient fine-tuning, rigorous evaluation, quantization, and production deployment. By the end, you will have a specialized model that demonstrably outperforms the base model on your chosen domain, served through a high-performance inference endpoint.
This project synthesizes concepts from Chapters 10 (Embeddings and Tokenization), 12 (Pre-Training and Transfer Learning), 14 (Language Model Pre-Training), 15 (Text Generation), 16 (Parameter-Efficient Fine-Tuning), 17 (Alignment and RLHF), 28 (Data Engineering), 31 (Evaluation), 33 (Model Compression), and 34 (Model Serving).
Estimated Time: 60-80 hours over 4-6 weeks.
Team Size: 1-3 people.
Domain Selection
Choose one of the following domains (or propose your own with instructor approval):
- Legal: Contract analysis, legal question answering, clause generation. Training data from open legal corpora (e.g., CaseLaw Access Project, EUR-Lex).
- Medical/Clinical: Medical question answering, clinical note summarization. Training data from PubMed abstracts, medical QA datasets (MedQA, PubMedQA).
- Financial: Financial report analysis, earnings call summarization, financial QA. Training data from SEC filings (EDGAR), financial news.
- Scientific: Research paper summarization, methodology extraction, scientific QA. Training data from arXiv abstracts, Semantic Scholar.
- Code/Technical: Code generation, documentation writing, technical QA for a specific framework or language. Training data from GitHub, Stack Overflow.
The domain should have sufficient publicly available text data (at least 10,000 high-quality examples for instruction tuning) and a clear way to evaluate domain-specific performance.
System Architecture
+------------------------------------------------------------------+
| DATA PIPELINE |
| |
| +------------------+ +------------------+ +--------------+ |
| | Raw Data | | Data Processing | | Training | |
| | Collection |--->| & Curation |--->| Dataset | |
| | | | | | (JSONL) | |
| | - Web scraping | | - Cleaning | | | |
| | - API access | | - Deduplication | | - train.jsonl| |
| | - Public datasets| | - Quality filter | | - val.jsonl | |
| | | | - Formatting | | - test.jsonl | |
| +------------------+ +------------------+ +--------------+ |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| TRAINING PIPELINE |
| |
| +------------------+ +------------------+ +--------------+ |
| | Base Model | | LoRA / QLoRA | | Trained | |
| | (Llama 3.1 8B, |--->| Fine-Tuning |--->| Adapter | |
| | Mistral 7B, or | | | | Weights | |
| | Gemma 2 9B) | | - rank, alpha | | | |
| | | | - target modules | | + Merged | |
| | | | - learning rate | | Model | |
| +------------------+ | - scheduler | +--------------+ |
| +------------------+ |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| EVALUATION PIPELINE |
| |
| +-------------------+ +-------------------+ +----------------+ |
| | Automated Metrics | | Domain-Specific | | Human | |
| | | | Benchmarks | | Evaluation | |
| | - Perplexity | | | | | |
| | - ROUGE / BLEU | | - Domain QA acc | | - Correctness | |
| | - BERTScore | | - Task completion | | - Helpfulness | |
| | - Exact match | | - Terminology use | | - Fluency | |
| +-------------------+ +-------------------+ +----------------+ |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| DEPLOYMENT PIPELINE |
| |
| +------------------+ +------------------+ +--------------+ |
| | Quantization | | Serving Engine | | API Gateway | |
| | | | | | | |
| | - GPTQ (4-bit) | | - vLLM | | - FastAPI | |
| | - AWQ (4-bit) | | - TGI | | - Auth | |
| | - GGUF (llama.cpp)| | - Continuous | | - Rate limit | |
| | | | batching | | - Logging | |
| +------------------+ +------------------+ +--------------+ |
+------------------------------------------------------------------+
Milestone 1: Data Collection and Preparation (Week 1)
Objectives
Assemble, clean, and format a high-quality domain-specific dataset for instruction tuning.
Requirements
1.1 Data Collection - Collect raw domain text from at least two different sources. - Document each source: URL, license/terms of use, approximate size, collection method. - Collect at least 50,000 raw text passages (before filtering).
1.2 Data Cleaning and Filtering - Remove duplicates (exact and near-duplicate using MinHash or similar). - Filter for quality using at least two heuristics: - Minimum length (e.g., at least 50 words). - Language detection (ensure all text is in the target language). - Perplexity filtering: use a reference language model to remove text that is abnormally high or low perplexity (gibberish or boilerplate). - Remove personally identifiable information (PII) where applicable. - Handle encoding issues, broken formatting, and special characters.
1.3 Instruction Dataset Creation - Create an instruction-tuning dataset in the conversational format expected by your chosen base model. Each example should include: - A system message (optional, defining the domain expert role). - A user instruction/question. - An assistant response. - Create examples using a combination of: - Existing QA datasets in the domain (if available). - LLM-generated examples: Use a strong model (GPT-4, Claude) to generate instruction-response pairs from your domain text. This is the synthetic data generation approach. - Manual curation: Hand-write or verify at least 100 high-quality examples. - Target dataset sizes: - Training set: at least 5,000 examples (10,000+ preferred). - Validation set: 500 examples. - Test set: 500 examples (must be completely held out from training).
1.4 Data Format - Store data in JSONL format with the following schema:
{
"messages": [
{"role": "system", "content": "You are an expert in [domain]..."},
{"role": "user", "content": "What is..."},
{"role": "assistant", "content": "Based on..."}
],
"source": "pubmed_qa",
"quality_score": 0.92
}
- Create a data card documenting: sources, sizes, creation methodology, filtering steps, and known biases.
Deliverables
data/train.jsonl,data/val.jsonl,data/test.jsonl.- Data collection and processing scripts (reproducible pipeline).
- Data card document.
- Exploratory data analysis notebook: distribution of lengths, topics, sources.
Milestone 2: Fine-Tuning with LoRA (Week 2-3)
Objectives
Fine-tune the base model using parameter-efficient methods, conducting systematic hyperparameter experiments.
Requirements
2.1 Base Model Selection - Choose an open-source base model. Recommended options: - Llama 3.1 8B Instruct: Strong baseline, well-documented. - Mistral 7B Instruct v0.3: Efficient architecture with sliding window attention. - Gemma 2 9B Instruct: Strong performance per parameter. - Document your selection rationale including: model size, architecture, license, known strengths/weaknesses, and memory requirements.
2.2 LoRA Configuration
- Implement fine-tuning using the HuggingFace peft and trl libraries.
- Configure LoRA with the following as starting points (then experiment):
- Rank (r): 8, 16, 32, 64.
- Alpha: 16, 32 (typically 2x rank).
- Target modules: attention projections (q_proj, k_proj, v_proj, o_proj) and optionally MLP layers (gate_proj, up_proj, down_proj).
- Dropout: 0.05.
- Use QLoRA (4-bit base model quantization) if GPU memory is limited.
2.3 Training Configuration - Optimizer: AdamW with weight decay 0.01. - Learning rate: Sweep over {5e-6, 1e-5, 2e-5, 5e-5} with cosine schedule. - Warmup: 10% of total steps. - Batch size: Effective batch size of 16-64 (use gradient accumulation as needed). - Maximum sequence length: 2048 tokens (or 4096 if resources allow). - Number of epochs: 1-3 (monitor for overfitting on validation loss). - Use mixed precision (bf16 or fp16).
2.4 Experiment Tracking - Track all experiments using Weights & Biases (wandb) or MLflow. - For each run, log: hyperparameters, training loss curve, validation loss curve, learning rate schedule, GPU utilization, peak memory, training time. - Run at least 6 experiments varying: - LoRA rank (at least 3 values). - Learning rate (at least 2 values). - Target modules (attention-only vs. attention+MLP).
2.5 Training Best Practices - Implement early stopping based on validation loss. - Save checkpoints at regular intervals. - Use gradient clipping (max norm 1.0). - Monitor for catastrophic forgetting by periodically evaluating on a general-purpose benchmark (e.g., a subset of MMLU).
Deliverables
- Training scripts with configuration management (command-line arguments or config file).
- Experiment tracking dashboard (wandb project or equivalent) with all runs.
- A table summarizing key experiments: hyperparameters, final validation loss, training time, peak memory.
- The best model checkpoint (adapter weights).
- A short analysis (~500 words) of what worked, what did not, and why.
Milestone 3: Comprehensive Evaluation (Week 4)
Objectives
Rigorously evaluate the fine-tuned model against the base model and, if possible, against commercial API baselines.
Requirements
3.1 Automated Evaluation - Evaluate on the held-out test set using: - Perplexity: Compare base model vs. fine-tuned model on domain text. - Generation quality: ROUGE-1, ROUGE-L, and BERTScore against reference answers. - Exact match / F1: For extractive QA-style questions. - Domain terminology accuracy: Check that the model uses correct domain-specific terms (custom metric: fraction of expected terms present in model output).
3.2 Domain-Specific Benchmark - Create or adapt a domain-specific benchmark with at least 100 questions spanning: - Factual recall (e.g., "What is the standard treatment for X?"). - Reasoning (e.g., "Given these symptoms, what is the most likely diagnosis?"). - Summarization (e.g., "Summarize the key findings of this report."). - Generation (e.g., "Draft a clause for X."). - Evaluate base model, fine-tuned model, and (if budget allows) GPT-4o-mini or Claude Haiku as an upper-bound reference. - Report accuracy, quality scores, and response latency for each model.
3.3 Human Evaluation - Conduct human evaluation on at least 50 test examples. - For each example, a human rater (you, a teammate, or a recruited evaluator) scores responses from the base model and the fine-tuned model on: - Correctness (1-5): Is the information factually accurate? - Helpfulness (1-5): Does the response actually answer the question? - Fluency (1-5): Is the response well-written and natural? - Domain appropriateness (1-5): Does the model use appropriate domain language and conventions? - Compute inter-rater agreement if multiple raters are available. - Perform a paired statistical test (e.g., Wilcoxon signed-rank test) to determine if the fine-tuned model's improvement is statistically significant.
3.4 Safety and Robustness Evaluation - Test for: - Hallucination rate: Fraction of responses containing fabricated information. - Refusal appropriateness: Does the model refuse to answer questions outside its expertise? - Adversarial robustness: Test with intentionally misleading or adversarial prompts. - General capability retention: Evaluate on 50 questions from MMLU (diverse topics) to check for catastrophic forgetting.
3.5 Error Analysis - Manually categorize errors from the test set into types: - Factual errors (wrong information). - Incomplete answers (missing key details). - Hallucinations (fabricated information). - Format errors (wrong structure or style). - Refusal errors (refuses a legitimate question). - For each error type, provide 2-3 concrete examples and analysis.
Deliverables
- Evaluation report with all automated metrics in tables.
- Human evaluation results with statistical significance tests.
- Error analysis with categorized examples.
- A comparison table: base model vs. fine-tuned model vs. reference (if available) across all metrics.
Milestone 4: Quantization (Week 5)
Objectives
Apply quantization to the fine-tuned model to reduce its memory footprint and increase inference speed, while carefully measuring the impact on quality.
Requirements
4.1 Merge and Export - Merge the LoRA adapter weights into the base model to create a standalone model. - Save the merged model in HuggingFace format. - Verify that the merged model produces identical outputs to the adapter-based model.
4.2 Quantization Methods
- Apply at least two of the following quantization methods:
- GPTQ (4-bit, 128 group size): Use the auto-gptq library with a calibration dataset of 128-256 examples from your training data.
- AWQ (4-bit): Use the autoawq library. Compare with GPTQ on quality and speed.
- GGUF (for llama.cpp): Quantize to Q4_K_M and Q5_K_M formats for CPU-friendly inference.
- BitsAndBytes (4-bit NF4): For on-the-fly quantization during loading.
- For each quantized model, measure:
- Model size on disk (GB).
- GPU memory required for inference.
- Inference speed (tokens per second) at batch size 1 and batch size 8.
- All automated evaluation metrics from Milestone 3 (perplexity, ROUGE, BERTScore, domain accuracy).
4.3 Quality-Speed Tradeoff Analysis - Create a plot showing: x-axis = inference speed (tokens/sec), y-axis = quality metric (e.g., average score), with each point labeled by quantization method and bit width. - Identify the best quantization method for your use case (the one offering the best quality-speed tradeoff). - Document any quality degradation that would be unacceptable for production use.
Deliverables
- Quantized model files for each method.
- Quantization comparison table: method, model size, memory, speed, and quality metrics.
- Quality-speed tradeoff plot.
- Recommendation for which quantized model to deploy, with justification.
Milestone 5: Deployment with vLLM (Week 5-6)
Objectives
Deploy the fine-tuned (and optionally quantized) model as a production inference service with proper API design and monitoring.
Requirements
5.1 Serving Engine Setup - Deploy the model using vLLM (recommended) or HuggingFace Text Generation Inference (TGI). - Configure: - Tensor parallelism (if multiple GPUs are available). - Maximum model length. - GPU memory utilization target (e.g., 90%). - Maximum number of concurrent requests. - Verify the model loads and generates correctly.
5.2 API Layer
- Build a FastAPI wrapper around the serving engine with endpoints:
- POST /generate -- Single completion request.
- POST /generate/stream -- Streaming completion (SSE).
- POST /chat -- Multi-turn chat completion (handles message formatting).
- GET /health -- Health check (model loaded, GPU available).
- GET /model/info -- Return model name, quantization method, max length.
- Implement proper request validation:
- Maximum input length.
- Valid temperature range (0.0 to 2.0).
- Valid top_p range (0.0 to 1.0).
- Maximum output tokens.
5.3 Performance Optimization
- Benchmark the serving setup:
- Latency: Time to first token (TTFT) and inter-token latency (ITL) at various input lengths (128, 512, 1024, 2048 tokens).
- Throughput: Maximum requests per second at different concurrency levels (1, 4, 8, 16, 32 concurrent requests).
- Use a load-testing tool (e.g., locust, vegeta, or a custom script with asyncio + httpx).
- Optimize based on results:
- Tune max_num_seqs (maximum batch size).
- Experiment with different gpu_memory_utilization values.
- Enable prefix caching if applicable.
5.4 Monitoring and Logging - Log every request: timestamp, input length, output length, latency, tokens/second, model parameters (temperature, top_p). - Track GPU utilization and memory in real time. - Implement a simple dashboard or log summary script.
5.5 Containerization
- Create a Dockerfile for the complete serving stack.
- Provide a docker-compose.yml if multiple services are involved.
- Document GPU passthrough configuration for Docker.
- Include a startup script that handles model download/loading.
Deliverables
- Running inference service accessible via HTTP.
- Load test results: latency and throughput at various concurrency levels.
- Docker configuration for deployment.
- Performance tuning documentation.
Milestone 6: Final Report and Demo (Week 6)
Objectives
Produce a comprehensive project report and live demonstration.
Requirements
6.1 Final Report The report should be 8-12 pages and include:
- Executive Summary (0.5 page): Domain, approach, key results.
- Data (1-2 pages): Sources, collection methodology, cleaning, statistics, data card.
- Training (2-3 pages): Model selection rationale, LoRA configuration, hyperparameter experiments, training curves, key findings.
- Evaluation (2-3 pages): All metrics, human evaluation results, error analysis, comparison with baselines, statistical significance.
- Quantization (1 page): Methods compared, quality-speed tradeoff, recommendation.
- Deployment (1-2 pages): Architecture, performance benchmarks, monitoring setup.
- Limitations and Future Work (0.5-1 page): Honest assessment of shortcomings and potential improvements.
6.2 Live Demo - Demonstrate the deployed model answering domain-specific questions. - Show side-by-side comparison with the base model. - Demonstrate the monitoring dashboard. - Be prepared to answer questions about design decisions, failure modes, and alternative approaches.
6.3 Code Repository - All code in a clean, well-organized repository. - README with setup instructions. - Requirements file or environment specification. - All scripts should be runnable with clear documentation.
Deliverables
- Final report (PDF).
- Live demo (in-person or recorded video, 15-20 minutes).
- Code repository with README.
Grading Rubric
| Component | Weight | Criteria |
|---|---|---|
| Data Preparation | 15% | Diverse sources, thorough cleaning, well-formatted instruction dataset, documented methodology, sufficient volume. |
| Fine-Tuning | 25% | Correct LoRA implementation, systematic hyperparameter experiments (at least 6 runs), proper experiment tracking, thoughtful analysis of results. |
| Evaluation | 25% | Comprehensive automated metrics, domain-specific benchmark, human evaluation with statistical tests, error analysis, comparison with baselines. |
| Quantization | 10% | At least two methods compared, quality-speed tradeoff analysis, clear recommendation with justification. |
| Deployment | 15% | Working inference service, load testing with reported results, containerized deployment, monitoring. |
| Report and Presentation | 10% | Clear writing, complete coverage of all milestones, honest discussion of limitations, professional presentation. |
Grade Thresholds
- A (90-100%): The fine-tuned model shows clear, statistically significant improvement over the base model. All milestones completed with high quality. Experiments are well-designed and thoroughly analyzed. Deployment is production-ready.
- B (80-89%): The fine-tuned model shows improvement on most metrics. All milestones completed but with some gaps in depth. Good experiment tracking and analysis.
- C (70-79%): Basic fine-tuning works but experiments are limited. Evaluation lacks depth or rigor. Deployment is functional but not optimized.
- D (60-69%): Fine-tuning produces a model but with minimal evaluation. Significant milestones incomplete.
- F (<60%): Project is incomplete or the fine-tuned model fails to work correctly.
Technical Recommendations
Compute Requirements
- Minimum: 1 GPU with 24GB VRAM (e.g., RTX 3090/4090, A5000). Use QLoRA for 7-8B models.
- Recommended: 1 GPU with 40-80GB VRAM (e.g., A100 40GB/80GB). Enables full LoRA without quantized base model.
- Cloud options: Google Colab Pro+ (A100), Lambda Labs, RunPod, or AWS/GCP spot instances.
Key Libraries
pip install torch transformers datasets peft trl accelerate bitsandbytes
pip install auto-gptq autoawq # For quantization
pip install vllm # For serving
pip install wandb evaluate rouge-score bert-score
pip install fastapi uvicorn httpx
Starter Code Template
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load base model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Load dataset
dataset = load_dataset("json", data_files={"train": "data/train.jsonl", "val": "data/val.jsonl"})
# Training configuration
training_config = SFTConfig(
output_dir="./output",
num_train_epochs=2,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
bf16=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
max_seq_length=2048,
report_to="wandb",
)
# Train
trainer = SFTTrainer(
model=model,
args=training_config,
train_dataset=dataset["train"],
eval_dataset=dataset["val"],
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./final_adapter")
Advice
- Start with a small experiment (100 training examples, 50 steps) to verify the pipeline works end-to-end before scaling up.
- Keep your data pipeline and training pipeline modular and configurable. You will run many experiments and need to iterate quickly.
- Domain-specific evaluation is the most important part of this project. Generic metrics like perplexity tell you something, but a well-designed domain benchmark tells you whether your model is actually useful.
- Be honest about failures. A thoughtful analysis of why something did not work is more valuable than hiding negative results.