Case Study 2: Benchmarking Open-Source LLMs

Overview

In this case study, we build a lightweight benchmarking pipeline for open-source large language models. We evaluate models from different families (Llama, Mistral, and others) on a curated set of tasks spanning knowledge, reasoning, and code generation. The goal is not to reproduce a full HELM-style evaluation but to develop the practical skills needed to compare models rigorously and interpret the results critically.

By working through this case study, you will understand the subtleties of LLM evaluation---prompt sensitivity, metric choice, contamination awareness, and the gap between benchmark scores and real-world utility.

Learning Objectives

Load and run inference with open-source LLMs using HuggingFace Transformers.
Implement standard evaluation protocols for multiple-choice and free-form tasks.
Compute and interpret pass@k for code generation.
Analyze the effect of prompt format on benchmark scores.
Visualize and compare model capabilities across multiple dimensions.

Prerequisites

A machine with at least 16 GB RAM (GPU recommended for larger models).
pip install transformers torch datasets tqdm
For code evaluation: pip install human-eval (optional, see notes).

Part 1: Setting Up the Evaluation Framework

"""Evaluation framework for benchmarking open-source LLMs.

Provides a modular pipeline for loading models, running inference,
and computing metrics across different benchmark tasks.
"""

import json
import time
from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Tuple

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.manual_seed(42)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"


@dataclass
class BenchmarkResult:
    """Container for benchmark evaluation results.

    Args:
        model_name: Name of the evaluated model.
        task_name: Name of the benchmark task.
        score: Primary metric score.
        metric_name: Name of the metric used.
        n_examples: Number of examples evaluated.
        details: Additional per-example details.
        elapsed_seconds: Wall-clock time for evaluation.
    """
    model_name: str
    task_name: str
    score: float
    metric_name: str
    n_examples: int
    details: Dict[str, Any] = field(default_factory=dict)
    elapsed_seconds: float = 0.0


def load_model(
    model_name: str, dtype: torch.dtype = torch.float16
) -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
    """Load a model and tokenizer from HuggingFace.

    Args:
        model_name: HuggingFace model identifier.
        dtype: Data type for model weights.

    Returns:
        Tuple of (model, tokenizer).
    """
    print(f"Loading {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=dtype,
        device_map="auto" if DEVICE == "cuda" else None,
        trust_remote_code=True,
    )
    if DEVICE == "cpu":
        model = model.to(DEVICE)
    model.eval()
    print(f"  Loaded. Parameters: {sum(p.numel() for p in model.parameters()):,}")
    return model, tokenizer

Part 2: Multiple-Choice Evaluation (MMLU-Style)

We implement the standard MMLU evaluation protocol using log-probability comparison.

"""MMLU-style multiple-choice evaluation.

Evaluates models on multiple-choice knowledge questions using
the log-probability of answer tokens.
"""

from typing import List, Tuple

import torch
import torch.nn.functional as F

torch.manual_seed(42)


# Sample MMLU-style questions (subset for demonstration)
SAMPLE_MMLU_QUESTIONS = [
    {
        "question": "What is the powerhouse of the cell?",
        "choices": ["Nucleus", "Mitochondria", "Ribosome", "Golgi apparatus"],
        "answer": 1,
        "subject": "biology",
    },
    {
        "question": "Which data structure uses LIFO ordering?",
        "choices": ["Queue", "Stack", "Heap", "Linked List"],
        "answer": 1,
        "subject": "computer_science",
    },
    {
        "question": "What is the derivative of sin(x)?",
        "choices": ["-cos(x)", "cos(x)", "tan(x)", "-sin(x)"],
        "answer": 1,
        "subject": "mathematics",
    },
    {
        "question": "Newton's second law states that F equals:",
        "choices": ["mv", "ma", "mg", "mv^2"],
        "answer": 1,
        "subject": "physics",
    },
    {
        "question": "Which amendment guarantees freedom of speech in the US?",
        "choices": ["Second", "First", "Fourth", "Fifth"],
        "answer": 1,
        "subject": "law",
    },
    {
        "question": "What is the chemical formula for water?",
        "choices": ["CO2", "H2O", "NaCl", "O2"],
        "answer": 1,
        "subject": "chemistry",
    },
    {
        "question": "The GDP of a country measures:",
        "choices": [
            "Total population",
            "Total value of goods and services produced",
            "Total land area",
            "Total government spending only",
        ],
        "answer": 1,
        "subject": "economics",
    },
    {
        "question": "In Python, which keyword is used to define a function?",
        "choices": ["func", "define", "def", "function"],
        "answer": 2,
        "subject": "computer_science",
    },
]


def format_mmlu_prompt(
    question: str,
    choices: List[str],
    n_shot_examples: List[dict] | None = None,
) -> str:
    """Format a multiple-choice question in MMLU style.

    Args:
        question: The question text.
        choices: List of answer choices.
        n_shot_examples: Optional few-shot examples.

    Returns:
        Formatted prompt string.
    """
    prompt = ""
    if n_shot_examples:
        for ex in n_shot_examples:
            prompt += f"Question: {ex['question']}\n"
            for i, c in enumerate(ex["choices"]):
                prompt += f"{'ABCD'[i]}. {c}\n"
            prompt += f"Answer: {'ABCD'[ex['answer']]}\n\n"

    prompt += f"Question: {question}\n"
    for i, choice in enumerate(choices):
        prompt += f"{'ABCD'[i]}. {choice}\n"
    prompt += "Answer:"
    return prompt


@torch.no_grad()
def evaluate_mmlu_logprob(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    questions: List[dict],
    n_shot: int = 0,
) -> BenchmarkResult:
    """Evaluate model on MMLU-style questions using log-probabilities.

    For each question, computes the log-probability of each answer
    token (A, B, C, D) given the prompt and selects the highest.

    Args:
        model: The language model.
        tokenizer: The tokenizer.
        questions: List of question dictionaries.
        n_shot: Number of few-shot examples to include.

    Returns:
        BenchmarkResult with accuracy score.
    """
    correct = 0
    total = 0
    per_subject: Dict[str, List[bool]] = {}
    answer_tokens = [
        tokenizer.encode(" A", add_special_tokens=False)[-1],
        tokenizer.encode(" B", add_special_tokens=False)[-1],
        tokenizer.encode(" C", add_special_tokens=False)[-1],
        tokenizer.encode(" D", add_special_tokens=False)[-1],
    ]

    t0 = time.time()

    for q in questions:
        # Use other questions as few-shot examples
        few_shot = [x for x in questions if x != q][:n_shot] if n_shot > 0 else None

        prompt = format_mmlu_prompt(q["question"], q["choices"], few_shot)
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model(**inputs)

        # Get logits at the last position
        last_logits = outputs.logits[0, -1, :]
        answer_logits = torch.tensor([last_logits[t].item() for t in answer_tokens])
        predicted = answer_logits.argmax().item()

        is_correct = predicted == q["answer"]
        correct += int(is_correct)
        total += 1

        subject = q.get("subject", "unknown")
        if subject not in per_subject:
            per_subject[subject] = []
        per_subject[subject].append(is_correct)

    elapsed = time.time() - t0
    accuracy = correct / total if total > 0 else 0.0

    subject_scores = {
        subj: sum(results) / len(results)
        for subj, results in per_subject.items()
    }

    return BenchmarkResult(
        model_name=str(model.name_or_path) if hasattr(model, "name_or_path") else "unknown",
        task_name="MMLU-style",
        score=accuracy,
        metric_name="accuracy",
        n_examples=total,
        details={"per_subject": subject_scores},
        elapsed_seconds=elapsed,
    )


# Example usage (with a small model for demonstration):
# model, tokenizer = load_model("gpt2")
# result = evaluate_mmlu_logprob(model, tokenizer, SAMPLE_MMLU_QUESTIONS, n_shot=2)
# print(f"MMLU accuracy: {result.score:.2%}")
# print(f"Per-subject: {result.details['per_subject']}")

Part 3: Reasoning Evaluation (GSM8K-Style)

"""GSM8K-style math reasoning evaluation.

Evaluates models on multi-step math word problems,
extracting the final numerical answer.
"""

import re
from typing import List, Optional

torch.manual_seed(42)


SAMPLE_GSM8K_QUESTIONS = [
    {
        "question": (
            "Janet has 3 cats. Each cat eats 2 cans of food per day. "
            "How many cans does Janet need for 7 days?"
        ),
        "answer": 42,
    },
    {
        "question": (
            "A store sells apples for $2 each and oranges for $3 each. "
            "Tom buys 5 apples and 4 oranges. How much does he spend?"
        ),
        "answer": 22,
    },
    {
        "question": (
            "A train travels at 60 mph. How far does it travel in 2.5 hours?"
        ),
        "answer": 150,
    },
    {
        "question": (
            "Maria has 48 cookies. She gives 1/3 to her brother and "
            "1/4 of the remainder to her sister. How many cookies does Maria have left?"
        ),
        "answer": 24,
    },
    {
        "question": (
            "A rectangle has a length of 12 cm and a width of 8 cm. "
            "What is its area in square centimeters?"
        ),
        "answer": 96,
    },
]


def extract_number(text: str) -> Optional[float]:
    """Extract the last number from a text string.

    Handles integers, decimals, and negative numbers.
    Looks for the pattern '#### <number>' first (GSM8K format),
    then falls back to the last number in the text.

    Args:
        text: The model's generated text.

    Returns:
        Extracted number, or None if no number found.
    """
    # Try GSM8K-style answer format first
    match = re.search(r"####\s*([-+]?\d*\.?\d+)", text)
    if match:
        return float(match.group(1))

    # Fall back to last number in text
    numbers = re.findall(r"[-+]?\d*\.?\d+", text)
    if numbers:
        return float(numbers[-1])
    return None


def format_gsm8k_prompt(
    question: str, use_cot: bool = True
) -> str:
    """Format a math word problem with optional chain-of-thought.

    Args:
        question: The math word problem.
        use_cot: Whether to encourage step-by-step reasoning.

    Returns:
        Formatted prompt string.
    """
    if use_cot:
        return (
            f"Question: {question}\n"
            f"Let's solve this step by step:\n"
        )
    else:
        return (
            f"Question: {question}\n"
            f"Answer (number only): "
        )


@torch.no_grad()
def evaluate_gsm8k(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    questions: List[dict],
    use_cot: bool = True,
    max_new_tokens: int = 256,
) -> BenchmarkResult:
    """Evaluate model on GSM8K-style math problems.

    Args:
        model: The language model.
        tokenizer: The tokenizer.
        questions: List of question dictionaries with 'question' and 'answer'.
        use_cot: Whether to use chain-of-thought prompting.
        max_new_tokens: Maximum tokens to generate.

    Returns:
        BenchmarkResult with exact-match accuracy.
    """
    correct = 0
    total = 0
    examples = []
    t0 = time.time()

    for q in questions:
        prompt = format_gsm8k_prompt(q["question"], use_cot=use_cot)
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.0,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
        )

        generated = tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )

        predicted = extract_number(generated)
        expected = q["answer"]
        is_correct = predicted is not None and abs(predicted - expected) < 1e-3

        correct += int(is_correct)
        total += 1
        examples.append({
            "question": q["question"],
            "expected": expected,
            "predicted": predicted,
            "correct": is_correct,
            "generation": generated[:200],
        })

    elapsed = time.time() - t0
    accuracy = correct / total if total > 0 else 0.0

    return BenchmarkResult(
        model_name=str(getattr(model, "name_or_path", "unknown")),
        task_name="GSM8K-style",
        score=accuracy,
        metric_name="exact_match",
        n_examples=total,
        details={"examples": examples},
        elapsed_seconds=elapsed,
    )

Part 4: Code Generation Evaluation (HumanEval-Style)

"""HumanEval-style code generation evaluation.

Evaluates models on Python coding tasks using pass@k metrics.
Uses a safe execution sandbox to run generated code.
"""

import contextlib
import io
import signal
import traceback
from math import comb
from typing import List

torch.manual_seed(42)


SAMPLE_CODE_PROBLEMS = [
    {
        "prompt": 'def fibonacci(n: int) -> int:\n    """Return the nth Fibonacci number (0-indexed)."""\n',
        "test": "assert fibonacci(0) == 0\nassert fibonacci(1) == 1\nassert fibonacci(5) == 5\nassert fibonacci(10) == 55\n",
        "entry_point": "fibonacci",
    },
    {
        "prompt": 'def is_palindrome(s: str) -> bool:\n    """Check if a string is a palindrome, ignoring case and non-alphanumeric characters."""\n',
        "test": 'assert is_palindrome("racecar") == True\nassert is_palindrome("hello") == False\nassert is_palindrome("A man a plan a canal Panama") == True\n',
        "entry_point": "is_palindrome",
    },
    {
        "prompt": 'def flatten(lst: list) -> list:\n    """Flatten a nested list of arbitrary depth into a single flat list."""\n',
        "test": "assert flatten([1, [2, 3], [4, [5, 6]]]) == [1, 2, 3, 4, 5, 6]\nassert flatten([]) == []\nassert flatten([1, 2, 3]) == [1, 2, 3]\n",
        "entry_point": "flatten",
    },
    {
        "prompt": 'def two_sum(nums: list, target: int) -> list:\n    """Return indices of two numbers in nums that add up to target."""\n',
        "test": "result = two_sum([2, 7, 11, 15], 9)\nassert sorted(result) == [0, 1]\nresult = two_sum([3, 2, 4], 6)\nassert sorted(result) == [1, 2]\n",
        "entry_point": "two_sum",
    },
]


def safe_execute(code: str, timeout: int = 5) -> Tuple[bool, str]:
    """Execute code in a restricted environment with timeout.

    Args:
        code: Python code to execute.
        timeout: Maximum execution time in seconds.

    Returns:
        Tuple of (success, error_message).
    """
    try:
        stdout_capture = io.StringIO()
        exec_globals: Dict[str, Any] = {}
        with contextlib.redirect_stdout(stdout_capture):
            exec(code, exec_globals)
        return True, ""
    except AssertionError as e:
        return False, f"AssertionError: {e}"
    except Exception as e:
        return False, f"{type(e).__name__}: {e}"


def compute_pass_at_k(n: int, c: int, k: int) -> float:
    """Compute pass@k metric.

    Args:
        n: Total number of samples.
        c: Number of correct samples.
        k: Number of samples to consider.

    Returns:
        pass@k probability.
    """
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)


@torch.no_grad()
def evaluate_code_generation(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    problems: List[dict],
    n_samples: int = 5,
    max_new_tokens: int = 256,
    temperature: float = 0.8,
) -> BenchmarkResult:
    """Evaluate model on code generation tasks with pass@k.

    Args:
        model: The language model.
        tokenizer: The tokenizer.
        problems: List of problem dictionaries.
        n_samples: Number of code samples to generate per problem.
        max_new_tokens: Maximum tokens to generate.
        temperature: Sampling temperature.

    Returns:
        BenchmarkResult with pass@1 score.
    """
    all_results = []
    t0 = time.time()

    for problem in problems:
        n_correct = 0
        samples = []

        for sample_idx in range(n_samples):
            inputs = tokenizer(
                problem["prompt"], return_tensors="pt"
            ).to(model.device)

            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature if temperature > 0 else 1.0,
                do_sample=temperature > 0,
                top_p=0.95,
                pad_token_id=tokenizer.pad_token_id,
            )

            generated = tokenizer.decode(
                outputs[0][inputs["input_ids"].shape[1]:],
                skip_special_tokens=True,
            )

            # Stop at function boundary
            lines = generated.split("\n")
            code_lines = []
            for line in lines:
                if line.strip().startswith("def ") and code_lines:
                    break
                code_lines.append(line)
            generated_code = "\n".join(code_lines)

            # Combine prompt + generated code + test
            full_code = problem["prompt"] + generated_code + "\n" + problem["test"]
            success, error = safe_execute(full_code)
            n_correct += int(success)
            samples.append({
                "code": generated_code[:300],
                "success": success,
                "error": error if not success else "",
            })

        pass_at_1 = compute_pass_at_k(n_samples, n_correct, k=1)
        pass_at_5 = compute_pass_at_k(n_samples, n_correct, k=min(5, n_samples))

        all_results.append({
            "entry_point": problem["entry_point"],
            "n_correct": n_correct,
            "n_samples": n_samples,
            "pass_at_1": pass_at_1,
            "pass_at_5": pass_at_5,
            "samples": samples,
        })

    elapsed = time.time() - t0
    avg_pass_at_1 = sum(r["pass_at_1"] for r in all_results) / len(all_results)

    return BenchmarkResult(
        model_name=str(getattr(model, "name_or_path", "unknown")),
        task_name="HumanEval-style",
        score=avg_pass_at_1,
        metric_name="pass@1",
        n_examples=len(problems),
        details={"per_problem": all_results},
        elapsed_seconds=elapsed,
    )

Part 5: Comparative Analysis and Visualization

"""Comparative analysis and visualization of benchmark results.

Aggregates results across models and tasks, produces comparison
charts, and analyzes prompt sensitivity.
"""

import json
from typing import Dict, List

import torch

torch.manual_seed(42)


def run_full_evaluation(
    model_names: List[str],
) -> Dict[str, List[BenchmarkResult]]:
    """Run the full benchmark suite on multiple models.

    Args:
        model_names: List of HuggingFace model identifiers.

    Returns:
        Dictionary mapping model names to lists of BenchmarkResults.
    """
    all_results = {}

    for model_name in model_names:
        print(f"\n{'='*60}")
        print(f"Evaluating: {model_name}")
        print(f"{'='*60}")

        try:
            model, tokenizer = load_model(model_name)
        except Exception as e:
            print(f"  Failed to load: {e}")
            continue

        results = []

        # MMLU-style evaluation
        print("\n  Running MMLU-style evaluation...")
        mmlu_result = evaluate_mmlu_logprob(
            model, tokenizer, SAMPLE_MMLU_QUESTIONS, n_shot=2
        )
        results.append(mmlu_result)
        print(f"  MMLU accuracy: {mmlu_result.score:.2%}")

        # GSM8K-style evaluation (with and without CoT)
        print("\n  Running GSM8K-style evaluation (with CoT)...")
        gsm_result_cot = evaluate_gsm8k(
            model, tokenizer, SAMPLE_GSM8K_QUESTIONS, use_cot=True
        )
        gsm_result_cot.task_name = "GSM8K-CoT"
        results.append(gsm_result_cot)
        print(f"  GSM8K (CoT) accuracy: {gsm_result_cot.score:.2%}")

        print("\n  Running GSM8K-style evaluation (direct)...")
        gsm_result_direct = evaluate_gsm8k(
            model, tokenizer, SAMPLE_GSM8K_QUESTIONS, use_cot=False
        )
        gsm_result_direct.task_name = "GSM8K-Direct"
        results.append(gsm_result_direct)
        print(f"  GSM8K (Direct) accuracy: {gsm_result_direct.score:.2%}")

        # Code generation evaluation
        print("\n  Running code generation evaluation...")
        code_result = evaluate_code_generation(
            model, tokenizer, SAMPLE_CODE_PROBLEMS, n_samples=3
        )
        results.append(code_result)
        print(f"  Code pass@1: {code_result.score:.2%}")

        all_results[model_name] = results

        # Clean up GPU memory
        del model, tokenizer
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return all_results


def print_comparison_table(all_results: Dict[str, List[BenchmarkResult]]) -> None:
    """Print a formatted comparison table.

    Args:
        all_results: Dictionary mapping model names to benchmark results.
    """
    tasks = ["MMLU-style", "GSM8K-CoT", "GSM8K-Direct", "HumanEval-style"]

    print(f"\n{'Model':<30} ", end="")
    for task in tasks:
        print(f"| {task:<18} ", end="")
    print()
    print("-" * (30 + 21 * len(tasks)))

    for model_name, results in all_results.items():
        short_name = model_name.split("/")[-1][:28]
        print(f"{short_name:<30} ", end="")
        for task in tasks:
            matching = [r for r in results if r.task_name == task]
            if matching:
                score = matching[0].score
                print(f"| {score:>6.1%}             ", end="")
            else:
                print(f"| {'N/A':>6}             ", end="")
        print()


def analyze_prompt_sensitivity(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
) -> Dict[str, float]:
    """Analyze how prompt format affects MMLU scores.

    Tests three different prompt formats on the same questions.

    Args:
        model: The language model.
        tokenizer: The tokenizer.

    Returns:
        Dictionary mapping prompt format names to accuracy scores.
    """
    results = {}

    # Format 1: Standard MMLU format (already implemented)
    r1 = evaluate_mmlu_logprob(model, tokenizer, SAMPLE_MMLU_QUESTIONS, n_shot=0)
    results["zero-shot"] = r1.score

    # Format 2: 2-shot
    r2 = evaluate_mmlu_logprob(model, tokenizer, SAMPLE_MMLU_QUESTIONS, n_shot=2)
    results["2-shot"] = r2.score

    # Format 3: 5-shot (using all available as few-shot)
    r3 = evaluate_mmlu_logprob(model, tokenizer, SAMPLE_MMLU_QUESTIONS, n_shot=5)
    results["5-shot"] = r3.score

    print("\nPrompt Sensitivity Analysis:")
    for fmt, score in results.items():
        print(f"  {fmt:<15}: {score:.2%}")

    return results


# Example usage:
# models_to_evaluate = ["gpt2", "gpt2-medium", "gpt2-large"]
# all_results = run_full_evaluation(models_to_evaluate)
# print_comparison_table(all_results)

Part 6: Visualization

"""Visualization of benchmark comparison results."""

try:
    import matplotlib.pyplot as plt
    import numpy as np

    def plot_model_comparison(
        all_results: Dict[str, List[BenchmarkResult]],
        output_path: str = "benchmark_comparison.png",
    ) -> None:
        """Create a radar chart comparing models across benchmarks.

        Args:
            all_results: Dictionary mapping model names to benchmark results.
            output_path: Path to save the figure.
        """
        tasks = ["MMLU-style", "GSM8K-CoT", "GSM8K-Direct", "HumanEval-style"]
        models = list(all_results.keys())
        n_tasks = len(tasks)

        fig, axes = plt.subplots(1, 2, figsize=(16, 6))

        # Bar chart comparison
        ax = axes[0]
        x = np.arange(n_tasks)
        width = 0.8 / len(models)
        colors = ["#4C72B0", "#DD8452", "#55A868", "#C44E52", "#8172B3"]

        for i, model_name in enumerate(models):
            scores = []
            for task in tasks:
                matching = [r for r in all_results[model_name] if r.task_name == task]
                scores.append(matching[0].score if matching else 0.0)
            short_name = model_name.split("/")[-1]
            ax.bar(
                x + i * width - (len(models) - 1) * width / 2,
                scores,
                width,
                label=short_name,
                color=colors[i % len(colors)],
            )

        ax.set_xlabel("Benchmark")
        ax.set_ylabel("Score")
        ax.set_title("Model Comparison Across Benchmarks")
        ax.set_xticks(x)
        ax.set_xticklabels([t.replace("-style", "") for t in tasks], rotation=15)
        ax.legend(loc="upper right")
        ax.set_ylim(0, 1.0)
        ax.grid(axis="y", alpha=0.3)

        # CoT vs Direct comparison
        ax = axes[1]
        for i, model_name in enumerate(models):
            cot_results = [r for r in all_results[model_name] if r.task_name == "GSM8K-CoT"]
            direct_results = [r for r in all_results[model_name] if r.task_name == "GSM8K-Direct"]
            if cot_results and direct_results:
                short_name = model_name.split("/")[-1]
                ax.scatter(
                    direct_results[0].score,
                    cot_results[0].score,
                    s=100,
                    color=colors[i % len(colors)],
                    label=short_name,
                    zorder=5,
                )

        ax.plot([0, 1], [0, 1], "k--", alpha=0.3, label="Equal performance")
        ax.set_xlabel("GSM8K Direct Answer Accuracy")
        ax.set_ylabel("GSM8K Chain-of-Thought Accuracy")
        ax.set_title("Effect of Chain-of-Thought Prompting")
        ax.legend()
        ax.set_xlim(-0.05, 1.05)
        ax.set_ylim(-0.05, 1.05)
        ax.grid(True, alpha=0.3)

        plt.tight_layout()
        plt.savefig(output_path, dpi=150, bbox_inches="tight")
        plt.show()
        print(f"Plot saved to {output_path}")

except ImportError:
    print("matplotlib not available; skipping visualization functions")

Discussion Questions

Prompt sensitivity: You observed that the same model can score very differently depending on the prompt format (zero-shot vs. few-shot, CoT vs. direct). What does this imply for the reliability of benchmark rankings?
Contamination: How would you determine whether a model has been trained on your evaluation questions? Propose a concrete experiment.
Scale and capability: If you evaluated models across a range of sizes (e.g., GPT-2 117M, 345M, 774M, 1.5B), would you expect the improvements to be uniform across all tasks? Which tasks might show the largest gains from scale?
pass@k interpretation: A model achieves pass@1 = 0.30 and pass@5 = 0.65 on code generation. What does this gap suggest about the model's code generation capability, and how should a user deploy it in practice?
Beyond benchmarks: Benchmark scores do not always correlate with real-world utility. Describe a scenario where a model with lower MMLU scores might be preferred over one with higher scores.
Evaluation cost: Running full evaluations on large models is expensive. Propose a strategy for efficiently comparing 20 different models without running the complete benchmark suite on each.