Case Study 35.1: MITRE ATT&CK Evaluations — How Vendors Perform Against Real APT Techniques

Background

In 2018, MITRE launched the ATT&CK Evaluations program, an initiative to evaluate cybersecurity products against the real-world techniques of specific threat actors. Unlike traditional product testing that measures detection against malware signatures or synthetic tests, the ATT&CK Evaluations replicate actual adversary behavior step by step and measure whether security products can detect each technique.

The program was created to address a persistent problem in the cybersecurity industry: the gap between vendor marketing claims and actual detection capabilities. Vendors routinely claimed "99% detection rates" or "comprehensive coverage" against advanced threats, but these claims were difficult to verify independently. The ATT&CK Evaluations provided transparency by using a standardized, publicly documented methodology and publishing detailed results for each participating vendor.

The Evaluation Methodology

Emulation-Based Testing

Each ATT&CK Evaluation round focuses on a specific threat actor and replicates their known tactics, techniques, and procedures. The evaluation team at MITRE studies the threat actor's published campaigns, develops an emulation plan, and executes the techniques against systems protected by each participating vendor's product.

Round 1 (2018): APT3 (Gothic Panda) - Chinese cyber espionage group - Known for exploiting vulnerabilities in internet-facing applications - Focused on technology and aerospace sectors - 10 vendors participated in the initial round

Round 2 (2020): APT29 (Cozy Bear) - Russian intelligence service (SVR) - Sophisticated tradecraft with emphasis on stealth - Two-day evaluation simulating a targeted intrusion - Day 1: Initial compromise, discovery, privilege escalation, credential access - Day 2: Lateral movement, collection, exfiltration - 21 vendors participated

Round 3 (2021): Carbanak + FIN7 - Financial crime groups targeting banks and retail - Blending espionage techniques with financial theft objectives - Extended evaluation including Linux coverage - 30 vendors participated

Round 4 (2022): Wizard Spider + Sandworm - Ransomware operations (Wizard Spider/Ryuk/Conti) and destructive attacks (Sandworm/NotPetya) - Most complex evaluation to date - Included data destruction and ransomware scenarios - 30 vendors participated

Round 5 (2023): Turla - Russian FSB-linked group known for sophisticated malware - Carbon and Snake backdoors - Complex multi-stage attack chain - Included managed security service evaluation

Detection Categories

MITRE categorizes detections into several types, providing nuanced insight into how products detect threats:

None: No detection of the technique
Telemetry: The product collected relevant data but did not generate an alert
General: The product identified suspicious or malicious activity without specifying the ATT&CK technique
Tactic: The product identified the tactical goal (e.g., "credential access" without specifying the technique)
Technique: The product correctly identified the specific ATT&CK technique used

This categorization is crucial because it distinguishes between products that merely collect logs and products that can identify what an adversary is actually doing. A product that generates telemetry for a technique but does not alert on it requires a human analyst to notice the suspicious activity -- which may or may not happen in a real SOC environment.

Key Findings Across Evaluations

The Visibility Gap

One of the most striking findings across all evaluation rounds is the significant gap between what products can theoretically detect and what they actually alert on. Many products collect relevant telemetry (log data that contains evidence of the technique) but fail to generate alerts or identify the specific technique being used.

In the APT29 evaluation, for example, some products collected telemetry for 90% or more of techniques but only generated technique-level detections for 30-40% of them. This means that a SOC analyst would need to manually sift through enormous amounts of telemetry to identify the attack, essentially requiring the analyst to do the detection work that the product should be doing.

Configuration Matters

The evaluations revealed that product configuration dramatically affects detection capabilities. Vendors were allowed to configure their products with "detection-focused" settings for the evaluation, which may differ from typical customer deployments. This raises an important question: Are customers getting the detection coverage they think they are paying for?

Products configured with aggressive detection rules detected more techniques but also generated more noise. Products configured conservatively missed more techniques but were quieter. This tradeoff between sensitivity and specificity is a fundamental challenge in security operations.

Linux Coverage Gaps

The Carbanak + FIN7 evaluation introduced Linux testing, revealing significant gaps in Linux detection capabilities across the industry. Many products that performed well on Windows had substantially reduced coverage on Linux. Given the prevalence of Linux in server environments, cloud infrastructure, and container workloads, this gap is concerning.

Analytics vs. Signatures

The evaluations consistently showed that detection based on behavioral analytics (identifying suspicious patterns of activity) outperformed signature-based detection (matching known malware signatures). Adversaries in the evaluations used custom or modified tools that evaded signature detection, but their behavioral patterns -- the techniques themselves -- remained detectable by products with strong analytics capabilities.

Impact on the Industry

Transparency Revolution

The ATT&CK Evaluations forced a level of transparency previously unseen in the cybersecurity industry. Vendors could no longer make vague claims about detection capabilities. Instead, their performance against specific, documented techniques was publicly available for anyone to review.

Some vendors initially resisted the evaluations or published their own interpretations of results. Over time, however, participation became essential for market credibility. By Round 4, 30 vendors participated, including virtually every major EDR vendor.

Detection Engineering

The evaluations catalyzed the "detection engineering" discipline. Organizations began systematically evaluating their detection coverage against ATT&CK techniques, rather than relying solely on vendor assurances. Purple team exercises (Chapter 35.6) became standard practice for validating and improving detection coverage.

Procurement Decisions

Security teams began using ATT&CK Evaluation results as a factor in product procurement. While MITRE explicitly states that the evaluations are not rankings, the detailed per-technique results allow organizations to compare products against the specific techniques they are most concerned about based on their threat model.

Vendor Improvement

The competitive pressure from public evaluation results drove significant product improvements. Vendors invested in closing detection gaps identified by the evaluations, and subsequent rounds showed measurable improvement in average detection coverage across the industry.

Controversies and Limitations

"Not a Ranking"

MITRE emphasizes that the evaluations are not product rankings. Different organizations have different threat models, and the "best" product depends on the specific threats an organization faces, its operational maturity, and its tolerance for alert volume. However, the cybersecurity media and analyst community inevitably create rankings from the data, sometimes oversimplifying the results.

Test Conditions vs. Real World

The evaluations run in controlled laboratory conditions with products configured by vendor engineers. Real-world conditions involve noisy networks, resource-constrained systems, misconfigured products, and overwhelmed SOC teams. A product that performs well in the evaluation may underperform in practice if it requires extensive tuning or generates too many false positives.

Known Techniques

The evaluations use known, published techniques from documented threat actors. This tests detection of known adversary behavior but does not measure a product's ability to detect novel techniques or zero-day exploits. Real-world adversaries continuously evolve their tradecraft.

Coverage Does Not Equal Outcome

Detecting a technique does not guarantee a successful defensive outcome. Detection must be followed by investigation, containment, and remediation. A product that detects 90% of techniques but generates 10,000 alerts per day may produce worse outcomes than a product that detects 70% of techniques but presents clear, actionable alerts.

Discussion Questions

Evaluation design: How should the ATT&CK Evaluations evolve to better reflect real-world conditions? What additional factors should be measured beyond technique detection?
Vendor transparency: The evaluations forced unprecedented transparency in the security industry. What other areas of cybersecurity would benefit from similar independent evaluation programs?
Detection vs. telemetry: Is telemetry without alerting valuable? In what situations would an organization rely on telemetry-only coverage, and when is automated alerting essential?
Product selection: How should an organization use ATT&CK Evaluation results when selecting security products? What other factors should influence the decision?
Red team implications: How should red teams use ATT&CK Evaluation results in their work? How can the public data inform engagement planning?
Continuous validation: The evaluations occur periodically. How should organizations continuously validate their detection coverage between evaluation rounds?

Connections to Chapter Content

This case study directly connects to Section 35.3 (MITRE ATT&CK for Red Teams), demonstrating how the ATT&CK framework serves as both a planning tool for offense and a measurement tool for defense. It reinforces Section 35.6 (Purple Teaming) by showing how structured testing against ATT&CK techniques drives detection improvement. The evaluation methodology provides a model for the purple team exercises described throughout this chapter.