Capstone Project 2: Fine-Tune and Deploy a Domain-Specific LLM

Project Overview

In this capstone project, you will take a pre-trained open-source large language model and adapt it to a specific domain through data preparation, parameter-efficient fine-tuning, rigorous evaluation, quantization, and production deployment. By the end, you will have a specialized model that demonstrably outperforms the base model on your chosen domain, served through a high-performance inference endpoint.

This project synthesizes concepts from Chapters 10 (Embeddings and Tokenization), 12 (Pre-Training and Transfer Learning), 14 (Language Model Pre-Training), 15 (Text Generation), 16 (Parameter-Efficient Fine-Tuning), 17 (Alignment and RLHF), 28 (Data Engineering), 31 (Evaluation), 33 (Model Compression), and 34 (Model Serving).

Estimated Time: 60-80 hours over 4-6 weeks.

Team Size: 1-3 people.

Domain Selection

Choose one of the following domains (or propose your own with instructor approval):

Legal: Contract analysis, legal question answering, clause generation. Training data from open legal corpora (e.g., CaseLaw Access Project, EUR-Lex).
Medical/Clinical: Medical question answering, clinical note summarization. Training data from PubMed abstracts, medical QA datasets (MedQA, PubMedQA).
Financial: Financial report analysis, earnings call summarization, financial QA. Training data from SEC filings (EDGAR), financial news.
Scientific: Research paper summarization, methodology extraction, scientific QA. Training data from arXiv abstracts, Semantic Scholar.
Code/Technical: Code generation, documentation writing, technical QA for a specific framework or language. Training data from GitHub, Stack Overflow.

The domain should have sufficient publicly available text data (at least 10,000 high-quality examples for instruction tuning) and a clear way to evaluate domain-specific performance.

System Architecture

+------------------------------------------------------------------+
|                      DATA PIPELINE                                |
|                                                                   |
|  +------------------+    +------------------+    +--------------+ |
|  | Raw Data         |    | Data Processing  |    | Training     | |
|  | Collection       |--->| & Curation       |--->| Dataset      | |
|  |                  |    |                  |    | (JSONL)      | |
|  | - Web scraping   |    | - Cleaning       |    |              | |
|  | - API access     |    | - Deduplication  |    | - train.jsonl| |
|  | - Public datasets|    | - Quality filter |    | - val.jsonl  | |
|  |                  |    | - Formatting     |    | - test.jsonl | |
|  +------------------+    +------------------+    +--------------+ |
+------------------------------------------------------------------+
                                                          |
                                                          v
+------------------------------------------------------------------+
|                    TRAINING PIPELINE                               |
|                                                                   |
|  +------------------+    +------------------+    +--------------+ |
|  | Base Model       |    | LoRA / QLoRA     |    | Trained      | |
|  | (Llama 3.1 8B,   |--->| Fine-Tuning      |--->| Adapter      | |
|  |  Mistral 7B, or  |    |                  |    | Weights      | |
|  |  Gemma 2 9B)     |    | - rank, alpha    |    |              | |
|  |                  |    | - target modules |    | + Merged     | |
|  |                  |    | - learning rate  |    |   Model      | |
|  +------------------+    | - scheduler      |    +--------------+ |
|                          +------------------+                     |
+------------------------------------------------------------------+
                                                          |
                                                          v
+------------------------------------------------------------------+
|                   EVALUATION PIPELINE                              |
|                                                                   |
|  +-------------------+  +-------------------+ +----------------+  |
|  | Automated Metrics |  | Domain-Specific   | | Human          |  |
|  |                   |  | Benchmarks        | | Evaluation     |  |
|  | - Perplexity      |  |                   | |                |  |
|  | - ROUGE / BLEU    |  | - Domain QA acc   | | - Correctness  |  |
|  | - BERTScore       |  | - Task completion | | - Helpfulness  |  |
|  | - Exact match     |  | - Terminology use | | - Fluency      |  |
|  +-------------------+  +-------------------+ +----------------+  |
+------------------------------------------------------------------+
                                                          |
                                                          v
+------------------------------------------------------------------+
|                   DEPLOYMENT PIPELINE                              |
|                                                                   |
|  +------------------+    +------------------+    +--------------+ |
|  | Quantization     |    | Serving Engine   |    | API Gateway  | |
|  |                  |    |                  |    |              | |
|  | - GPTQ (4-bit)   |    | - vLLM           |    | - FastAPI    | |
|  | - AWQ (4-bit)    |    | - TGI            |    | - Auth       | |
|  | - GGUF (llama.cpp)|    | - Continuous     |    | - Rate limit | |
|  |                  |    |   batching       |    | - Logging    | |
|  +------------------+    +------------------+    +--------------+ |
+------------------------------------------------------------------+

Milestone 1: Data Collection and Preparation (Week 1)

Objectives

Assemble, clean, and format a high-quality domain-specific dataset for instruction tuning.

Requirements

1.1 Data Collection - Collect raw domain text from at least two different sources. - Document each source: URL, license/terms of use, approximate size, collection method. - Collect at least 50,000 raw text passages (before filtering).

1.2 Data Cleaning and Filtering - Remove duplicates (exact and near-duplicate using MinHash or similar). - Filter for quality using at least two heuristics: - Minimum length (e.g., at least 50 words). - Language detection (ensure all text is in the target language). - Perplexity filtering: use a reference language model to remove text that is abnormally high or low perplexity (gibberish or boilerplate). - Remove personally identifiable information (PII) where applicable. - Handle encoding issues, broken formatting, and special characters.

1.3 Instruction Dataset Creation - Create an instruction-tuning dataset in the conversational format expected by your chosen base model. Each example should include: - A system message (optional, defining the domain expert role). - A user instruction/question. - An assistant response. - Create examples using a combination of: - Existing QA datasets in the domain (if available). - LLM-generated examples: Use a strong model (GPT-4, Claude) to generate instruction-response pairs from your domain text. This is the synthetic data generation approach. - Manual curation: Hand-write or verify at least 100 high-quality examples. - Target dataset sizes: - Training set: at least 5,000 examples (10,000+ preferred). - Validation set: 500 examples. - Test set: 500 examples (must be completely held out from training).

1.4 Data Format - Store data in JSONL format with the following schema:

{
  "messages": [
    {"role": "system", "content": "You are an expert in [domain]..."},
    {"role": "user", "content": "What is..."},
    {"role": "assistant", "content": "Based on..."}
  ],
  "source": "pubmed_qa",
  "quality_score": 0.92
}

Create a data card documenting: sources, sizes, creation methodology, filtering steps, and known biases.

Deliverables

data/train.jsonl, data/val.jsonl, data/test.jsonl.
Data collection and processing scripts (reproducible pipeline).
Data card document.
Exploratory data analysis notebook: distribution of lengths, topics, sources.

Milestone 2: Fine-Tuning with LoRA (Week 2-3)

Objectives

Fine-tune the base model using parameter-efficient methods, conducting systematic hyperparameter experiments.

Requirements

2.1 Base Model Selection - Choose an open-source base model. Recommended options: - Llama 3.1 8B Instruct: Strong baseline, well-documented. - Mistral 7B Instruct v0.3: Efficient architecture with sliding window attention. - Gemma 2 9B Instruct: Strong performance per parameter. - Document your selection rationale including: model size, architecture, license, known strengths/weaknesses, and memory requirements.

2.2 LoRA Configuration - Implement fine-tuning using the HuggingFace peft and trl libraries. - Configure LoRA with the following as starting points (then experiment): - Rank (r): 8, 16, 32, 64. - Alpha: 16, 32 (typically 2x rank). - Target modules: attention projections (q_proj, k_proj, v_proj, o_proj) and optionally MLP layers (gate_proj, up_proj, down_proj). - Dropout: 0.05. - Use QLoRA (4-bit base model quantization) if GPU memory is limited.

2.3 Training Configuration - Optimizer: AdamW with weight decay 0.01. - Learning rate: Sweep over {5e-6, 1e-5, 2e-5, 5e-5} with cosine schedule. - Warmup: 10% of total steps. - Batch size: Effective batch size of 16-64 (use gradient accumulation as needed). - Maximum sequence length: 2048 tokens (or 4096 if resources allow). - Number of epochs: 1-3 (monitor for overfitting on validation loss). - Use mixed precision (bf16 or fp16).

2.4 Experiment Tracking - Track all experiments using Weights & Biases (wandb) or MLflow. - For each run, log: hyperparameters, training loss curve, validation loss curve, learning rate schedule, GPU utilization, peak memory, training time. - Run at least 6 experiments varying: - LoRA rank (at least 3 values). - Learning rate (at least 2 values). - Target modules (attention-only vs. attention+MLP).

2.5 Training Best Practices - Implement early stopping based on validation loss. - Save checkpoints at regular intervals. - Use gradient clipping (max norm 1.0). - Monitor for catastrophic forgetting by periodically evaluating on a general-purpose benchmark (e.g., a subset of MMLU).

Deliverables

Training scripts with configuration management (command-line arguments or config file).
Experiment tracking dashboard (wandb project or equivalent) with all runs.
A table summarizing key experiments: hyperparameters, final validation loss, training time, peak memory.
The best model checkpoint (adapter weights).
A short analysis (~500 words) of what worked, what did not, and why.

Milestone 3: Comprehensive Evaluation (Week 4)

Objectives

Rigorously evaluate the fine-tuned model against the base model and, if possible, against commercial API baselines.

Requirements

3.1 Automated Evaluation - Evaluate on the held-out test set using: - Perplexity: Compare base model vs. fine-tuned model on domain text. - Generation quality: ROUGE-1, ROUGE-L, and BERTScore against reference answers. - Exact match / F1: For extractive QA-style questions. - Domain terminology accuracy: Check that the model uses correct domain-specific terms (custom metric: fraction of expected terms present in model output).

3.2 Domain-Specific Benchmark - Create or adapt a domain-specific benchmark with at least 100 questions spanning: - Factual recall (e.g., "What is the standard treatment for X?"). - Reasoning (e.g., "Given these symptoms, what is the most likely diagnosis?"). - Summarization (e.g., "Summarize the key findings of this report."). - Generation (e.g., "Draft a clause for X."). - Evaluate base model, fine-tuned model, and (if budget allows) GPT-4o-mini or Claude Haiku as an upper-bound reference. - Report accuracy, quality scores, and response latency for each model.

3.3 Human Evaluation - Conduct human evaluation on at least 50 test examples. - For each example, a human rater (you, a teammate, or a recruited evaluator) scores responses from the base model and the fine-tuned model on: - Correctness (1-5): Is the information factually accurate? - Helpfulness (1-5): Does the response actually answer the question? - Fluency (1-5): Is the response well-written and natural? - Domain appropriateness (1-5): Does the model use appropriate domain language and conventions? - Compute inter-rater agreement if multiple raters are available. - Perform a paired statistical test (e.g., Wilcoxon signed-rank test) to determine if the fine-tuned model's improvement is statistically significant.

3.4 Safety and Robustness Evaluation - Test for: - Hallucination rate: Fraction of responses containing fabricated information. - Refusal appropriateness: Does the model refuse to answer questions outside its expertise? - Adversarial robustness: Test with intentionally misleading or adversarial prompts. - General capability retention: Evaluate on 50 questions from MMLU (diverse topics) to check for catastrophic forgetting.

3.5 Error Analysis - Manually categorize errors from the test set into types: - Factual errors (wrong information). - Incomplete answers (missing key details). - Hallucinations (fabricated information). - Format errors (wrong structure or style). - Refusal errors (refuses a legitimate question). - For each error type, provide 2-3 concrete examples and analysis.

Deliverables

Evaluation report with all automated metrics in tables.
Human evaluation results with statistical significance tests.
Error analysis with categorized examples.
A comparison table: base model vs. fine-tuned model vs. reference (if available) across all metrics.

Milestone 4: Quantization (Week 5)

Objectives

Apply quantization to the fine-tuned model to reduce its memory footprint and increase inference speed, while carefully measuring the impact on quality.

Requirements

4.1 Merge and Export - Merge the LoRA adapter weights into the base model to create a standalone model. - Save the merged model in HuggingFace format. - Verify that the merged model produces identical outputs to the adapter-based model.

4.2 Quantization Methods - Apply at least two of the following quantization methods: - GPTQ (4-bit, 128 group size): Use the auto-gptq library with a calibration dataset of 128-256 examples from your training data. - AWQ (4-bit): Use the autoawq library. Compare with GPTQ on quality and speed. - GGUF (for llama.cpp): Quantize to Q4_K_M and Q5_K_M formats for CPU-friendly inference. - BitsAndBytes (4-bit NF4): For on-the-fly quantization during loading. - For each quantized model, measure: - Model size on disk (GB). - GPU memory required for inference. - Inference speed (tokens per second) at batch size 1 and batch size 8. - All automated evaluation metrics from Milestone 3 (perplexity, ROUGE, BERTScore, domain accuracy).

4.3 Quality-Speed Tradeoff Analysis - Create a plot showing: x-axis = inference speed (tokens/sec), y-axis = quality metric (e.g., average score), with each point labeled by quantization method and bit width. - Identify the best quantization method for your use case (the one offering the best quality-speed tradeoff). - Document any quality degradation that would be unacceptable for production use.

Deliverables

Quantized model files for each method.
Quantization comparison table: method, model size, memory, speed, and quality metrics.
Quality-speed tradeoff plot.
Recommendation for which quantized model to deploy, with justification.

Milestone 5: Deployment with vLLM (Week 5-6)

Objectives

Deploy the fine-tuned (and optionally quantized) model as a production inference service with proper API design and monitoring.

Requirements

5.1 Serving Engine Setup - Deploy the model using vLLM (recommended) or HuggingFace Text Generation Inference (TGI). - Configure: - Tensor parallelism (if multiple GPUs are available). - Maximum model length. - GPU memory utilization target (e.g., 90%). - Maximum number of concurrent requests. - Verify the model loads and generates correctly.

5.2 API Layer - Build a FastAPI wrapper around the serving engine with endpoints: - POST /generate -- Single completion request. - POST /generate/stream -- Streaming completion (SSE). - POST /chat -- Multi-turn chat completion (handles message formatting). - GET /health -- Health check (model loaded, GPU available). - GET /model/info -- Return model name, quantization method, max length. - Implement proper request validation: - Maximum input length. - Valid temperature range (0.0 to 2.0). - Valid top_p range (0.0 to 1.0). - Maximum output tokens.

5.3 Performance Optimization - Benchmark the serving setup: - Latency: Time to first token (TTFT) and inter-token latency (ITL) at various input lengths (128, 512, 1024, 2048 tokens). - Throughput: Maximum requests per second at different concurrency levels (1, 4, 8, 16, 32 concurrent requests). - Use a load-testing tool (e.g., locust, vegeta, or a custom script with asyncio + httpx). - Optimize based on results: - Tune max_num_seqs (maximum batch size). - Experiment with different gpu_memory_utilization values. - Enable prefix caching if applicable.

5.4 Monitoring and Logging - Log every request: timestamp, input length, output length, latency, tokens/second, model parameters (temperature, top_p). - Track GPU utilization and memory in real time. - Implement a simple dashboard or log summary script.

5.5 Containerization - Create a Dockerfile for the complete serving stack. - Provide a docker-compose.yml if multiple services are involved. - Document GPU passthrough configuration for Docker. - Include a startup script that handles model download/loading.

Deliverables

Running inference service accessible via HTTP.
Load test results: latency and throughput at various concurrency levels.
Docker configuration for deployment.
Performance tuning documentation.

Milestone 6: Final Report and Demo (Week 6)

Objectives

Produce a comprehensive project report and live demonstration.

Requirements

6.1 Final Report The report should be 8-12 pages and include:

Executive Summary (0.5 page): Domain, approach, key results.
Data (1-2 pages): Sources, collection methodology, cleaning, statistics, data card.
Training (2-3 pages): Model selection rationale, LoRA configuration, hyperparameter experiments, training curves, key findings.
Evaluation (2-3 pages): All metrics, human evaluation results, error analysis, comparison with baselines, statistical significance.
Quantization (1 page): Methods compared, quality-speed tradeoff, recommendation.
Deployment (1-2 pages): Architecture, performance benchmarks, monitoring setup.
Limitations and Future Work (0.5-1 page): Honest assessment of shortcomings and potential improvements.

6.2 Live Demo - Demonstrate the deployed model answering domain-specific questions. - Show side-by-side comparison with the base model. - Demonstrate the monitoring dashboard. - Be prepared to answer questions about design decisions, failure modes, and alternative approaches.

6.3 Code Repository - All code in a clean, well-organized repository. - README with setup instructions. - Requirements file or environment specification. - All scripts should be runnable with clear documentation.

Deliverables

Final report (PDF).
Live demo (in-person or recorded video, 15-20 minutes).
Code repository with README.

Grading Rubric

Component	Weight	Criteria
Data Preparation	15%	Diverse sources, thorough cleaning, well-formatted instruction dataset, documented methodology, sufficient volume.
Fine-Tuning	25%	Correct LoRA implementation, systematic hyperparameter experiments (at least 6 runs), proper experiment tracking, thoughtful analysis of results.
Evaluation	25%	Comprehensive automated metrics, domain-specific benchmark, human evaluation with statistical tests, error analysis, comparison with baselines.
Quantization	10%	At least two methods compared, quality-speed tradeoff analysis, clear recommendation with justification.
Deployment	15%	Working inference service, load testing with reported results, containerized deployment, monitoring.
Report and Presentation	10%	Clear writing, complete coverage of all milestones, honest discussion of limitations, professional presentation.

Grade Thresholds

A (90-100%): The fine-tuned model shows clear, statistically significant improvement over the base model. All milestones completed with high quality. Experiments are well-designed and thoroughly analyzed. Deployment is production-ready.
B (80-89%): The fine-tuned model shows improvement on most metrics. All milestones completed but with some gaps in depth. Good experiment tracking and analysis.
C (70-79%): Basic fine-tuning works but experiments are limited. Evaluation lacks depth or rigor. Deployment is functional but not optimized.
D (60-69%): Fine-tuning produces a model but with minimal evaluation. Significant milestones incomplete.
F (<60%): Project is incomplete or the fine-tuned model fails to work correctly.

Technical Recommendations

Compute Requirements

Minimum: 1 GPU with 24GB VRAM (e.g., RTX 3090/4090, A5000). Use QLoRA for 7-8B models.
Recommended: 1 GPU with 40-80GB VRAM (e.g., A100 40GB/80GB). Enables full LoRA without quantized base model.
Cloud options: Google Colab Pro+ (A100), Lambda Labs, RunPod, or AWS/GCP spot instances.

Key Libraries

pip install torch transformers datasets peft trl accelerate bitsandbytes
pip install auto-gptq autoawq  # For quantization
pip install vllm                # For serving
pip install wandb evaluate rouge-score bert-score
pip install fastapi uvicorn httpx

Starter Code Template

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# Load base model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load dataset
dataset = load_dataset("json", data_files={"train": "data/train.jsonl", "val": "data/val.jsonl"})

# Training configuration
training_config = SFTConfig(
    output_dir="./output",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    max_seq_length=2048,
    report_to="wandb",
)

# Train
trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset["train"],
    eval_dataset=dataset["val"],
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./final_adapter")

Advice

Start with a small experiment (100 training examples, 50 steps) to verify the pipeline works end-to-end before scaling up.
Keep your data pipeline and training pipeline modular and configurable. You will run many experiments and need to iterate quickly.
Domain-specific evaluation is the most important part of this project. Generic metrics like perplexity tell you something, but a well-designed domain benchmark tells you whether your model is actually useful.
Be honest about failures. A thoughtful analysis of why something did not work is more valuable than hiding negative results.