Chapter 7: DNA Analysis: How Genetic Evidence Revolutionized Criminal Investigation

DataField.Dev

39 min read

> "DNA evidence is the closest thing forensic science has to a truth serum — which is exactly why it deserves the most skepticism, not the least."

Prerequisites

1
3
4
5
6

Learning Objectives

Explain why DNA analysis is the one forensic method built on a rigorous, quantified foundation, and what that distinction means for the validity spectrum.
Describe, at a working level, the biology that makes DNA typing possible: where DNA lives in the cell, why it is the same in every cell of one body, and what varies between people.
Walk through the standard STR workflow — extraction, PCR amplification, capillary electrophoresis, and the electropherogram — and state the failure mode of each step.
Read a DNA profile as a set of loci and alleles, and explain what CODIS does and does not do when it returns a hit.
Interpret the random match probability honestly — as a fraction and in words — and refute the prosecutor's fallacy at first sight.
State, precisely, what a DNA match cannot establish: not when the DNA was deposited, not how, and not that its source is guilty.

In This Chapter

Overview
Learning Paths
7.1 Why DNA changed everything
7.2 The biology you need: cells, chromosomes, and what varies between people
7.3 STR analysis: extraction → PCR → capillary electrophoresis → electropherogram
7.4 The DNA profile and CODIS
7.5 The random match probability: how rare is "rare"?
7.6 What DNA cannot tell you (presence ≠ guilt; the transfer problem)
🗂️ The Case File
Conclusion
Key Terms
Spaced Review

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 7: DNA Analysis: How Genetic Evidence Revolutionized Criminal Investigation

"DNA evidence is the closest thing forensic science has to a truth serum — which is exactly why it deserves the most skepticism, not the least." — constructed line, in the voice of this book; the point it makes is the chapter's argument [constructed teaching example]

Overview

For six chapters we have been taking methods apart and finding them weaker than advertised — bite marks that match no one, fingerprint examiners certain and wrong, fire investigators reading folklore off a burned floor. You could be forgiven for thinking this is a book about how forensic science fails. It is not. It is a book about how to tell the strong methods from the weak ones, and now we arrive at the strongest. DNA analysis is the one forensic discipline that earned its place in the courtroom the way chemistry and molecular biology earned theirs: by being built outside the courtroom, by population geneticists and molecular biologists with no stake in any verdict, and then validated, quantified, and stress-tested before it was ever asked to convict anyone. When DNA says "this profile would occur in roughly one person in a billion," that number is not a courtroom flourish. It is a measurement, with a method behind it that you can interrogate.

That is why DNA matters so much to this book's argument, and it cuts two ways. DNA is the yardstick — the thing every other method is measured against and mostly found wanting (the validity spectrum from Chapter 1 has DNA at the top for a reason). And DNA is the instrument that proved the other methods wrong: the more than three hundred and seventy-five post-conviction DNA exonerations in the United States (Chapter 6) are, every one of them, a case where genetic evidence overturned a conviction built on something less reliable — a hair, a bite mark, a confident eyewitness, a confession.

But here is the warning that runs under this whole chapter, and the reason its epigraph says what it does. The very power of DNA — its sensitivity, its statistical authority, the near-magical reputation it has earned — makes it the most dangerous evidence in the room when it is misunderstood. A clean number can hide a contaminated tube. A match can be reported as if it settled guilt when all it settled was the source of a few cells. And a jury that has learned from television that "the DNA matched" means "case closed" will convict on a result that the science, honestly stated, does not support. We are going to learn how DNA typing actually works — the biology, the bench, the statistics — precisely so that we can also learn its limits, because a method this strong is only safe in the hands of someone who knows exactly where it stops.

In this chapter, you will learn to:

Explain why DNA is the one forensic method with a rigorous, quantified scientific foundation, and what that buys you on the validity spectrum.
Describe the biology that makes typing possible — chromosomes, loci, and the repeating stretches of DNA that vary between people.
Walk the standard STR workflow from swab to profile, and name what can go wrong at each step.
Read a DNA profile and explain what CODIS does — and does not do — when it returns a hit.
Interpret the random match probability honestly, and recognize the prosecutor's fallacy before it is finished being spoken.
State plainly what a DNA match cannot tell you: not the time of deposit, not the means, and never the guilt of the source.

Learning Paths

This chapter is the spine of Part II and is cited in nearly every later chapter; everyone should read it closely. Here is where each path should lean:

🔎 Investigator/CSI: Your decisions at the scene set the ceiling on what DNA can do. Weight §7.2 (what carries DNA, and what destroys it) and §7.6 (the transfer problem) — collection errors you make in hour one cannot be fixed in the lab in week six. 🧪 Lab analyst: §7.3 is your bench, step by step, with every failure mode named; §7.5 is the number you will have to defend. Know both cold. ⚖️ Law/courtroom: §7.4 (what a CODIS hit legally is — a lead, not a verdict), §7.5 (the random match probability and the prosecutor's fallacy), and §7.6 (what the match does not prove) are where the cross-examination lives. 👥 General reader/juror: §7.1 and §7.6 are the antidote to "the DNA matched, so he did it." A match is a strong statement about a source. Whether that source is guilty is a different question the DNA cannot answer.

7.1 Why DNA changed everything

To understand why DNA analysis reset the standards of an entire field, you have to remember what the field looked like before it. In 1985, if a crime scene yielded a bloodstain, the best a forensic serologist could usually do was determine its ABO blood type and perhaps a handful of protein markers — narrowing the donor to, say, "a type-O secretor," a category that included tens of millions of people. That was class evidence (Chapter 1): it could exclude a suspect whose blood type differed, but it could never point to one person. The pattern-comparison disciplines — hair, bite marks, handwriting — claimed to point to one person but, as we now know, had no validated basis for doing so. The field had methods that were honest but weak, and methods that were strong-sounding but unvalidated. What it did not have was a method that could individualize and defend the claim with a number.

That changed in a British laboratory in 1984. A geneticist named Alec Jeffreys, studying inherited variation in human DNA at the University of Leicester, noticed that certain regions of the genome contained sequences that repeated over and over, and that the number of repeats varied enormously from person to person. When he developed a method to visualize these regions, the result was a pattern of bands — he called it a "DNA fingerprint" — that was different for every individual he tested except identical twins. Jeffreys had stumbled onto something no prior forensic method possessed: a feature of the human body that was both effectively unique to an individual and gridded onto a measurable, inheritable, statistical framework. You could not only say two samples matched; you could estimate, from population data, how often such a match would occur by chance.

The first time this was used in a criminal case, it did the thing this book keeps insisting is forensic science's truest power: it excluded the wrong man. We will study that case — the 1980s Leicestershire murders that led to the conviction of Colin Pitchfork — in detail in this chapter's first case study. For now, hold onto its shape, because it is the perfect parable for everything that follows. A teenager had confessed to one of the murders. The new DNA test proved he could not have committed either. The same test, applied later to thousands of local men, eventually identified the real killer. DNA's first forensic act was to free an innocent suspect, and only second to convict a guilty one. That order is not an accident. It is the asymmetry from Chapter 1, written into the technology itself.

🔬 At the Bench Jeffreys's original method — restriction fragment length polymorphism (RFLP) — is mostly of historical interest now, but it is worth knowing why it was replaced, because the reason is a recurring theme. RFLP needed a relatively large amount of fairly intact DNA — a coin-sized bloodstain, not a touched doorknob — and took weeks. The technique that replaced it in the 1990s, built on the polymerase chain reaction (§7.3), needs only a trace of DNA and can work with degraded material. Every gain in sensitivity, though, comes with a cost we will return to throughout Part II: the more easily a method detects tiny amounts of DNA, the more easily it detects DNA that has nothing to do with the crime. The history of forensic DNA is a history of getting more sensitive, and of slowly learning to be more careful in exact proportion.

Why does DNA sit alone at the top of the validity spectrum? Three reasons, and they are precisely the things the other pattern methods lack. First, a biological basis that is understood: we know why DNA varies between people (inheritance, mutation, recombination) and we can model that variation mathematically. Second, a quantified error structure: the random match probability is not a vibe, it is a calculation from measured population frequencies, and the laboratory error rate, while real and non-zero, has been studied. Third, objectivity at the core measurement: the size of a DNA fragment is read by an instrument against a standard, not judged by an examiner's eye the way a fingerprint or a toolmark is. None of this makes DNA infallible — the rest of this chapter and all of Chapters 8 and 9 are about where it can still go wrong — but it makes DNA the only forensic method whose central claim was validated before it was trusted, rather than after innocent people had already been convicted on it. That sequence is the whole difference, and it is why this chapter is the hinge of the book.

7.2 The biology you need: cells, chromosomes, and what varies between people

You do not need a molecular-biology degree to reason well about DNA evidence, but you need a working model of a few facts, because every limit and every strength of the method flows from them. Let us build that model from the ground up.

Deoxyribonucleic acid (DNA) is the molecule that carries the genetic instructions in nearly every cell of every living thing. In humans, almost every nucleated cell — white blood cells, skin cells, the cells in saliva, semen, the root of a hair — carries a complete copy of that person's DNA, and (this is the load-bearing fact) it is the same copy in every one of those cells. The DNA in a drop of your blood is identical to the DNA in a cheek swab is identical to the DNA in a skin cell you shed onto a doorknob. This is why a reference sample taken painlessly from a suspect's cheek can be compared to a bloodstain from a scene: the body does not keep different DNA in different tissues.

DNA is organized into structures called chromosomes. Humans have 46 of them, arranged in 23 pairs — one member of each pair inherited from each biological parent. Twenty-two of these pairs are the same in both sexes (the autosomes); the 23rd is the sex chromosomes (XX or XY), which forensic methods can use to determine the sex of a sample's source. The molecule itself is the famous double helix — two strands wound around each other, each strand a long sequence of four chemical "letters," the bases: adenine (A), thymine (T), guanine (G), and cytosine (C). The order of those letters is the genetic code.

Here is the fact that makes forensic DNA both possible and humane: more than 99.9% of the human genome is identical from one person to the next. We are, at the level of the code, almost entirely the same. Forensic DNA typing does not read your genes — it does not look at the parts that determine eye color, disease risk, or anything about who you are. Instead, it targets a small set of locations in the non-coding DNA — stretches between genes that do not, as far as we know, do anything — where the sequence happens to vary a great deal between people. This is a deliberate choice, and an ethical one: by deliberately looking only at "junk" regions, standard forensic typing reveals essentially nothing about your health, your traits, or your ancestry. It is built to answer one question — whose cells are these? — and to be blind to every other.

The specific kind of variation forensic typing exploits is the short tandem repeat, or STR: a short sequence of DNA bases (typically 2–6 letters long) that is repeated, back to back, a variable number of times. Picture the four-letter sequence "GATA" repeated in a row. One person might carry eleven copies of it at a particular spot on a chromosome; another might carry fourteen; another, nine. The spot on the chromosome is called a locus (plural loci) — a specific, named, physical address in the genome where forensic scientists agree to look. The version a given person has at that locus — eleven repeats, or fourteen, or nine — is called an allele.

   A SINGLE STR LOCUS (illustrative; the locus name and repeats are schematic)

   Locus "D-EXAMPLE", on a chromosome:   ...─[GATA][GATA][GATA] ... [GATA]─...
                                              └──────── repeated N times ────────┘

   Person A:  allele = 11   (the GATA block repeats 11 times)
   Person B:  allele = 14
   Person C:  allele = 9

   Because you inherit one chromosome of each pair from each parent, you carry
   TWO alleles at each locus — e.g. (11, 14). Those two numbers are your
   "genotype" at this locus. A full profile is just this, read at many loci.

Because you have two copies of each chromosome (one from each parent), you have two alleles at every locus — one inherited from each. So at our example locus a person might be (11, 14): an 11-repeat allele from one parent, a 14-repeat allele from the other. If both parents passed on the same allele, the person is (12, 12) — called homozygous at that locus — and shows just one value. This pair of numbers at one locus is not, by itself, rare; lots of people are (11, 14). The power comes from reading many loci at once. The standard American forensic test (§7.3) currently examines 20 core STR loci plus a sex marker. The chance that two unrelated people share an allele or two is decent; the chance that they share the same pair at all twenty loci is what produces the famous one-in-a-billion numbers (§7.5). DNA typing is, at bottom, just this fact — variable repeats at fixed addresses — multiplied across enough addresses that the combination becomes effectively unique.

🔍 Check Your Understanding 1. Why can a cheek swab from a suspect be compared to a bloodstain from a scene — what biological fact makes those two tissues comparable? 2. Forensic STR typing deliberately targets non-coding "junk" DNA rather than genes. Give one practical and one ethical reason this is the right design choice. 3. A person shows a single value — say, 12 — at a locus. Does that mean they have only one allele there? (Think about what homozygous means.)

7.3 STR analysis: extraction → PCR → capillary electrophoresis → electropherogram

Now to the bench. The modern forensic DNA workflow — the one running in essentially every accredited crime lab — turns a stain or a swab into a profile through four major stages. We will walk each one and, in keeping with this book's discipline, name what can go wrong at every step, because a method's failure modes are as much a part of it as its successes.

Stage 1 — Extraction. First the DNA must be liberated from the cells that hold it and cleaned of everything else — the blood proteins, the denim fibers, the dirt, the chemicals — that would interfere with later steps. The cells are broken open (lysed), and the DNA is separated from the cellular debris and purified. The output is a tiny, invisible quantity of purified DNA in solution.

The failure modes here are foundational. If the sample contains an inhibitor — a substance like the dyes in denim, certain soil components, or heme from large amounts of blood — that survives extraction, it can poison the next stage and produce a partial profile or none at all. And extraction is one of the points where contamination (Chapter 4) is most dangerous: a stray skin cell from an analyst, or DNA carried over from a previous sample, gets purified right along with the evidence. This is why labs use negative controls — blank samples carried through the whole process — to catch contamination, and why analysts wear masks and change gloves obsessively.

Stage 2 — Quantification. Before amplifying, the lab measures how much human DNA the extract actually contains, because the next stage works best within a specific range. Too little DNA and the profile will be incomplete or unreliable; too much and the signal can overload the instrument. (Samples with very little DNA — the touched surfaces and trace amounts — push the method to its edge, and the special problems they create belong to Chapter 8.)

Stage 3 — PCR amplification. This is the step that made modern forensic DNA possible, and it deserves its own callout. The polymerase chain reaction (PCR) is a laboratory technique that copies a specific target region of DNA over and over, doubling the number of copies in each cycle, until a trace amount becomes billions of copies — enough to measure. In forensic typing, PCR is aimed at exactly the STR loci we care about: short primers (small pieces of DNA) flank each target locus and tell the copying enzyme where to start, so that only the STR regions are amplified, each tagged with a colored fluorescent label.

🔬 At the Bench PCR is a thermal cycle. The reaction mixture — the template DNA, the primers, the building-block bases, and a heat-stable enzyme called Taq polymerase — is heated and cooled in repeating steps: heat to ~94 °C to separate the two DNA strands (denaturation), cool to ~55–60 °C so primers can bind their targets (annealing), warm to ~72 °C so the polymerase builds new strands (extension). Each cycle doubles the copies of the target; after ~28–30 cycles, one starting molecule has become hundreds of millions. The exponential math is the magic — and the menace. Because PCR amplifies so powerfully, it amplifies whatever template is present, including a few contaminating molecules. A single stray cell, copied a billion-fold, can become a detectable "profile" that was never relevant to the crime. PCR's sensitivity is its gift and its standing risk, and every safeguard in a DNA lab exists because of that exponential curve.

Stage 4 — Capillary electrophoresis (CE) and the electropherogram. Now the lab has billions of copies of each STR locus, each fragment a length that depends on how many repeats it contains — more repeats, longer fragment — and each tagged with a fluorescent dye. To read the alleles, the fragments are sorted by size. The technique is capillary electrophoresis: the labeled DNA fragments are injected into an ultra-thin tube (a capillary) filled with a gel-like polymer, and an electric field pulls them through it. Because DNA carries a negative charge, the fragments migrate toward the positive end; because smaller fragments slip through the polymer faster than larger ones, they separate by size, smallest arriving first. As each fragment passes a detector window, a laser excites its fluorescent tag and the instrument records the color and the time of arrival.

The instrument plots that record as an electropherogram — the visual readout of a DNA profile, a graph of fluorescence over fragment size, in which each allele appears as a peak. The position of a peak along the horizontal axis tells you the fragment's size, which the software converts to a repeat number (the allele) by comparing it against an internal size standard run alongside. The height of the peak reflects how much of that fragment is present. A single-source sample at one locus shows either one peak (homozygous — both alleles the same) or two (heterozygous — two different alleles).

🔬 Read the Evidence

text FIGURE 7.1 — "One locus on an electropherogram" [constructed teaching example] THE ITEM A schematic single-source electropherogram trace at one STR locus, read from a clean reference cheek swab (the kind of pristine sample real casework rarely gives you). THE CONTEXT Run on a calibrated capillary instrument; a size standard ran in the same injection; a reagent-blank negative control ran clean (no peaks), arguing against contamination. WHAT IT SHOWS Two peaks of similar height at allele positions 11 and 14 → genotype (11, 14) at this locus. The peaks are sharp, balanced, and well above the analytical threshold. WHAT IT DOESN'T One locus is not a profile. (11, 14) is common; on its own it excludes almost no one and identifies no one. It says nothing about *when* or *how* the cells were deposited. THE INFERENCE At THIS locus, the donor is (11, 14); a suspect typing (11, 14) is not excluded here, a suspect typing (9, 12) IS excluded by this locus alone. THE LESSON The strength is in the stack: this same clean read, repeated across 20 loci, is what produces a one-in-a-billion profile. A single locus is a class characteristic; the full set approaches an individual one. (Compare Ch. 1's class-vs-individual arrow.)

   A SCHEMATIC SINGLE-LOCUS ELECTROPHEROGRAM (not to scale)

   fluorescence
      │
      │            ▲                    ▲
      │            █                    █
      │            █                    █
      │            █                    █
      │   _________█____________________█_________   ← analytical threshold (peaks below = noise)
      │            │                    │
      └────────────┴────────────────────┴──────────────►  fragment size →
                  allele 11            allele 14

   Two balanced peaks → heterozygous genotype (11, 14) at this locus.
   A single peak would mean homozygous (e.g., (12, 12)). Peak HEIGHT carries
   information too — it is central to reading mixtures and trace samples (Ch. 8).

That is the whole pipeline: a stain becomes purified DNA (extraction), the right regions are copied a billion-fold (PCR), the copies are sorted by size and read by color (capillary electrophoresis), and the result is plotted as peaks (the electropherogram) that a trained analyst — and increasingly, validated software — interprets into a list of alleles. Where on the validity spectrum does this sit? For a clean, single-source sample, the core measurement is about as objective and reproducible as forensic science gets: this is why DNA anchors the top of the spectrum. But notice the qualifier. The validity is highest for pristine, single-source material; the moment the sample is a mixture of several people, or a trace so small the peaks are unreliable, interpretation re-enters and the error modes multiply. That is not a footnote — it is the entire subject of Chapter 8, and the reason the cold case's gas-can sample (below) is going to be harder than this clean figure suggests.

🧠 Cognitive-Bias Watch Reading an electropherogram is mostly objective at the level of "is there a peak at this size?" — but judgment creeps in at the edges: Is that small bump a real minor-contributor allele or background noise? Should this off-ladder blip be called a peak? Studies of forensic DNA interpretation have shown that when analysts are told what result the case "needs" — that the suspect's profile is expected in a mixture — their borderline calls can drift toward confirming it. The fix is the same one we will name formally in Chapter 31: keep the interpreter from knowing the suspect's profile (or the detective's theory) until after the evidence profile is independently called. The instrument does not have expectations. The human reading its output does.

7.4 The DNA profile and CODIS

When the analyst has read every locus, the result is a DNA profile: the complete set of alleles a sample shows across all the tested loci — essentially a list, locus by locus, of the one or two allele values found at each. A modern profile looks like a table: at locus D3, the genotype is (15, 16); at locus vWA, (17, 18); at locus FGA, (21, 24); and so on across the full panel. That list is the profile. It is not a picture of a person; it is a string of numbers that, taken together, is shared by very few people on Earth.

Two profiles are compared simply by checking whether they have the same alleles at every locus. If they differ at even one locus — the evidence shows (15, 16) at D3 but the suspect is (14, 17) — that is, barring a few well-understood biological exceptions, an exclusion: the suspect is not the source. This is the asymmetry of Chapter 1 in its purest form. A single clean, reproducible mismatch can exclude with near-certainty. Agreement at every locus, by contrast, does not prove the suspect is the source; it makes the suspect consistent with the source, and the strength of that consistency is then a question of probability (§7.5). Mismatch refutes; match supports. Burn this in: it is the most reliable thing forensic DNA does.

Now, what happens when you have a profile from a scene but no suspect to compare it against? This is where the database comes in. The Combined DNA Index System (CODIS) is the United States' national DNA database, administered by the FBI, which links DNA profiles across local, state, and national levels so that an unknown crime-scene profile can be searched against profiles already on file. CODIS holds, in separate indexes, profiles from convicted offenders (and, in many jurisdictions, arrestees), profiles from forensic casework (unknown profiles from unsolved crime scenes), and other categories. When a lab uploads an eligible crime-scene profile, the system searches for matches — both against known individuals (an offender hit) and against other unsolved cases (a forensic hit, which links two crimes to the same unknown person even if no one knows who that person is yet).

🔬 At the Bench CODIS does not store anyone's name, photograph, or genome — it stores the numerical STR profile and a specimen identifier. A "hit" is not a name appearing on a screen; it is the system flagging that two profiles are candidate matches. A human analyst must then confirm the match, and the originating laboratory must be contacted to release the identity tied to that specimen. The database is a pointer, not an answer. This matters for both accuracy and privacy: the system that produces investigative leads is deliberately built to contain numbers, not identities, and the link back to a person is a separate, controlled step.

Here is the part the courtroom path must hold onto, because it is routinely misunderstood by everyone except the people who do it for a living: a CODIS hit is a lead, not proof. It tells investigators where to look; it does not, by itself, make a case. Once the database points to a candidate, the laboratory obtains a fresh reference sample from that person and re-types it, and it is that direct comparison — not the database search — that becomes evidence. There is even a statistical subtlety the defense bar rightly raises: searching a large database and finding a match is not quite the same probabilistic event as testing one pre-identified suspect and finding a match, because searching many profiles gives chance more opportunities to throw up a coincidental hit. The honest treatment of that "database search effect" belongs to the statistics in Chapter 9; the principle for now is simpler and non-negotiable. The database generates a suspect. The case against that suspect must still be built — and the DNA portion of it rests on the confirmatory re-test, reported with its random match probability, not on the fact that a computer flagged a candidate.

⚖️ In the Courtroom An expert may testify that an evidence profile matches a defendant's reference profile and state the random match probability. An expert may not properly testify that "the database identified the defendant as the perpetrator" — the database identified a candidate source of the DNA, which is a narrower and more honest claim. Watch, too, for the unspoken leap from "his DNA is on the object" to "he committed the crime." Those are different propositions; §7.6 is about the gap between them, and a good cross-examination lives in that gap.

This database power is also what makes the year's most celebrated forensic success possible, and it points forward to a technique that Chapter 8 will develop in full. When a crime-scene profile draws a blank in CODIS — the offender is not in the database — investigators in recent years have turned to a different kind of genetic search entirely: uploading crime-scene DNA to genealogy databases populated by ordinary people's ancestry tests, finding distant relatives, and building family trees down to a suspect. That is the method that finally identified the Golden State Killer, Joseph James DeAngelo, in 2018, four decades after his crimes, and we treat it as this book's emblem of real, validated forensic progress. But it is a different beast from a CODIS STR match — it searches different genetic markers, raises serious privacy questions, and produces an investigative lead that must then be confirmed by ordinary STR typing of a sample taken directly from the suspect. We name it here only to set the table; Chapter 8 owns it. The point that belongs to this chapter is the one the DeAngelo case and the Pitchfork case share across thirty years: the genetic search, whatever its form, generates a candidate. The confirmation is still an STR profile and a random match probability — the very things we are learning to read.

7.5 The random match probability: how rare is "rare"?

We have reached the number — the one quantitative idea this book asks you to take seriously, because misunderstanding it is how DNA's authority gets abused. When two profiles match at every locus, the question the court actually needs answered is not "do they match?" (they do) but "how surprising is that match if the defendant is not the source?" The answer is the random match probability (RMP): the probability that a randomly chosen, unrelated person from the relevant population would happen to share the evidence profile by pure chance.

Here is the logic in plain terms. At any single locus, some genotypes are common and some are rare; population studies have measured, for each allele, how often it occurs in various populations. The probability that a random person matches the evidence at one locus might be, say, 1 in 10 — not impressive on its own. But the loci are inherited largely independently of one another, which means (with appropriate statistical care) the per-locus probabilities can be multiplied together. One in ten at the first locus, times one in twelve at the second, times one in eight at the third, and so on across twenty loci — and the product becomes a vanishingly small fraction. This is why a full single-source profile produces numbers like one in a billion, or far rarer: it is the multiplication of twenty modest rarities into one extreme one.

🔬 At the Bench A worked example, with illustrative numbers (these are made up to show the arithmetic, not real allele frequencies): suppose at each of just five loci, the chance a random person matches the evidence genotype is about 1 in 8. Treating the loci as independent, the chance of matching at all five is $(1/8)^5 = 1/32{,}768$ — about 1 in 33,000 from only five loci. Extend the same kind of multiplication across the full 20-locus panel and the figure routinely falls below one in a trillion. The real calculation is more careful than this — it uses measured per-locus frequencies, not a flat 1 in 8, and it applies corrections (a factor often written $\theta$ or $F_{ST}$) to account for the fact that real populations are not perfectly random-mating, which makes the reported number more conservative, i.e., larger and more favorable to the defendant. But the shape is exactly this: modest rarities, multiplied, become extreme ones. That multiplication is the engine of DNA's statistical power.

State the number both ways, always. "The random match probability is approximately one in 1.2 billion" is precise but cold; pair it with "—that is, you would expect to find this profile in roughly one person out of more than a billion, far more than the population of the United States." Expressing the figure as a fraction and in human terms is not a courtesy; it is the difference between a juror understanding the evidence and a juror being overwhelmed by it.

And now the trap — the single most important misuse of statistics in all of forensic science, and the reason this book refutes it every time a probability appears. It is called the prosecutor's fallacy (we will dissect it fully in Chapter 9, but you must be able to spot it from this chapter on). The fallacy is the move from "the chance of this profile in a random innocent person is one in a billion" to "therefore the chance the defendant is innocent is one in a billion." Those two statements sound identical and are completely different. The random match probability answers: given that the person is innocent, how likely is a matching profile? It does not answer: given a matching profile, how likely is the person innocent? The second question depends on other things entirely — how many other people had access, what else the evidence shows, how the DNA might have gotten there. A one-in-a-billion RMP is overwhelming evidence about a source; it is not a one-in-a-billion statement about guilt, and anyone who slides from the first to the second — a prosecutor, an expert, a juror — has committed the error that has helped convict innocent people.

⚖️ In the Courtroom The clean way for an expert to state the result: "This profile is found in approximately one in 1.2 billion unrelated individuals in the relevant population." The forbidden gloss: "There is a one-in-1.2-billion chance the defendant is not the source," or worse, "...is innocent." If an attorney or expert offers the second formulation, that is the prosecutor's fallacy, and it is grounds for objection and correction. Some experts now prefer to state results as a likelihood ratio — "this evidence is X times more probable if the defendant is the source than if an unknown unrelated person is" — precisely because it keeps the logic honest. The likelihood ratio is the proper subject of Chapter 9; flag it here, and know that the fallacy is what it is designed to defeat.

Two more honest cautions about the number, because even DNA's great statistic has edges. First, the RMP is computed for unrelated people. Close relatives share much more DNA than strangers, so the chance that the defendant's brother matches the profile is far higher than one in a billion — sometimes only one in a few hundred. A responsible report acknowledges that the impressive figure applies to unrelated individuals and that a sibling cannot be excluded by the statistic alone. Second, the RMP describes the coincidental-match risk only; it says nothing about laboratory error — a swapped sample, a contaminated tube, a mislabeled tube — which is a separate and, realistically, larger risk than a one-in-a-billion coincidence. The chance of a lab error is not one in a billion; it is whatever the lab's real error rate is, and it does not shrink no matter how many loci you type. An honest expert keeps these two risks distinct and never lets the astronomical coincidence figure stand in for the everyday, human risk of a mistake.

🔍 Check Your Understanding 1. An expert testifies: "The random match probability is one in a billion, so there's a one-in-a-billion chance the defendant is innocent." Name the fallacy and explain, in one sentence, what the one-in-a-billion figure actually describes. 2. Why does multiplying per-locus probabilities require that the loci be (approximately) independent — and what would over-stating the rarity look like if they were not? 3. The RMP for a profile is one in 50 billion. Does this number account for the possibility that the suspect's identical twin, or the lab, made the match happen? Explain why not.

7.6 What DNA cannot tell you (presence ≠ guilt; the transfer problem)

We end where this chapter's epigraph began: with the limits, because a method this powerful is only safe in the hands of someone who knows exactly what it does not say. A DNA match — even a pristine, single-source, one-in-a-trillion match — establishes one thing and one thing only: that the cells came from this person. Everything beyond that is inference, and the most consequential errors in DNA evidence come from treating the source attribution as if it answered questions it cannot touch.

DNA cannot tell you when the cells were deposited. A profile recovered from a doorknob does not carry a timestamp; the suspect's DNA could have arrived an hour before the crime or three weeks earlier, on an innocent visit. DNA cannot tell you how the cells got there — whether they were deposited by the person directly, transferred from somewhere else, or planted. And DNA cannot tell you what the person was doing, or whether they did anything wrong at all. "His DNA is on the weapon" and "he wielded the weapon" are different statements, and the distance between them is exactly the distance between forensic science and a jury's verdict. The DNA is a fact about cells. Guilt is a conclusion about a person, and it requires far more than a profile.

The sharpest version of this limit, and the one that has grown more dangerous as the method has grown more sensitive, is the transfer problem. DNA can move. You shed cells onto objects you touch, and those objects can carry your cells onto other objects and people you never touched. If you shake someone's hand and they then grip a railing, your DNA may end up on that railing — secondary transfer, the deposit of a person's DNA at a location they were never present at. As STR typing has become sensitive enough to read profiles from a few dozen cells (the touched-surface "touch DNA" that Chapter 8 owns in full), the risk has grown that an evidence profile reflects not the suspect's presence at the crime but a chain of innocent contact. There are documented instances of a person's DNA turning up at a scene they had never visited, carried there on an object or another person — including at least one case where a hospital patient's DNA reached a homicide scene via paramedics who had handled both. The lesson is not that DNA is unreliable; it is that where DNA is found and how it got there are two different questions, and only the first is answered by the profile.

⚠️ Junk-Science Alert The overstatement to guard against here is not a debunked method — DNA typing is the opposite of junk science — but a debunked inference: the slide from "the defendant's DNA was on the object" to "the defendant committed the act." That move is not licensed by the science, no matter how small the random match probability. When you hear (or are tempted to make) the leap from source to act, stop and ask the three questions the profile cannot answer: When was it deposited? How did it get there? Does its presence at this location mean what the prosecution says it means? A one-in-a-trillion match that has been misattributed in meaning can convict an innocent person as surely as a bite mark — and with far more persuasive force, because the number sounds like certainty.

🧠 Cognitive-Bias Watch The transfer problem interacts with bias in a way worth naming. Once a DNA match focuses suspicion on a person, every other piece of evidence in the case tends to be re-read in light of it — the ambiguous statement now sounds like a lie, the weak eyewitness now seems credible. A strong DNA result can become a bias engine for the rest of the investigation, lending its statistical authority to inferences it does not actually support. The discipline (Chapter 31) is to keep the DNA's claim contained to what it proves — source — and to evaluate the question of act on its own evidence.

So where does DNA finally sit? At the top of the validity spectrum for its actual claim — identifying the source of a biological sample — and nowhere near the top for claims it was never built to make: timing, mechanism, guilt. This is the same lesson Chapter 1 taught about odontology (valid for identifying a body, worthless for bite marks): a discipline's place on the spectrum depends on which question you ask it. Ask DNA "whose cells are these?" and it is the finest tool forensic science has. Ask it "is this person guilty?" and it is silent — and the most dangerous moments in a courtroom are the ones where someone mistakes that silence for a yes.

🗂️ The Case File

Carrow County — the gas can on the cabin floor. The evidence inventory from Chapter 3 logged a one-gallon metal gas can, found on its side in the front room of the Mill Creek cabin near the area of heaviest fire damage. It is the kind of object that, in a suspected arson, you swab for DNA on the handle and the cap — the surfaces a person grips. The state lab (recall the county sends major evidence out, Chapter 4) has now done exactly that, and a profile has come back from the handle.

Here is what we can honestly say, and not a word more. The lab recovered a partial DNA profile from the gas-can handle — the kind of small, touched-surface deposit ("touch DNA," which Chapter 8 will develop) that pushes the method toward its sensitivity limit, and which a charred object near a fire makes harder still, since heat degrades DNA. The recovered profile was searched against CODIS. What we have, therefore, is this: a profile exists, and it was developed from an object central to the suspected arson. What we do not have is a source, a time, or a meaning. The profile alone does not tell us whose cells they are (the CODIS search result, and whether it points anywhere, is the kind of lead we now know must be confirmed by direct re-typing); it does not tell us when the cells reached the handle (a gas can in a renovation project is an object many hands touch, for innocent reasons); and it certainly does not tell us that the person who touched it set the fire. Every limit from §7.6 applies at once.

Two further complications are flagged, not resolved, and we leave them honestly open. First, a touched surface near a fire is a prime candidate to yield not one person's cells but several, mixed together and partly degraded — and a mixture is a different and harder interpretive problem than the clean single-source figure in §7.3. Whether this profile is single-source or a mixture is the question Chapter 8 picks up. Second, even a clean match here would establish only that someone touched the can, which on a renovation site is not, by itself, incriminating.

Status after Chapter 7: A DNA profile exists from the gas-can handle; its source, its timing, and its meaning are all undetermined. We have added the single most powerful identification tool in forensic science to the file — and, in the same breath, reminded ourselves that even it answers only whose cells, never who is guilty. No one is included; no one is excluded; the assumption of an accidental fire still stands, untouched by this evidence alone. We have a thread. We do not yet have a hand to put on the can.

Conclusion

DNA analysis is the forensic method that earned its authority honestly — validated by outsiders, grounded in understood biology, and quantified with a real number — and it reset the standard for what "individualization" should mean by being the one discipline that could actually back the claim. We built the method from the ground up: the biology of loci and alleles (§7.2); the STR workflow of extraction, PCR, capillary electrophoresis, and the electropherogram, each step with its failure mode (§7.3); the DNA profile and the CODIS database that turns an unknown profile into an investigative lead (§7.4); the random match probability and the prosecutor's fallacy that lies in wait beside it (§7.5); and, above all, the limits — that a match identifies a source and never a time, a mechanism, or a guilt (§7.6).

Two of the book's themes ran straight through this chapter and should now feel sharper. The validity spectrum has its anchor: DNA sits at the top for its actual claim, and seeing why — understood biology, quantified error, objective core measurement — gives you the criteria to judge every weaker method against it. And exclusion over proof found its purest expression: a single clean mismatch excludes with near-certainty, while even a one-in-a-trillion match only supports a source and must still answer the questions the profile cannot. The prosecutor's fallacy, lurking in §7.5, was the chapter's quiet third theme — a preview of how even the strongest evidence is corrupted the moment its statistics are misstated.

But we left the method at its edge, and that edge is the next chapter. The clean, single-source profile of §7.3 is the easy case. The cold case's gas-can sample is not clean: it is a touched, heat-degraded deposit that may well be a mixture of several people. What happens to DNA's authority when the sample is a trace too small to trust, or a blend of contributors that must be teased apart, or so degraded that only fragments survive? And what is the genealogy method that caught the Golden State Killer, and what does it cost? Chapter 8 takes DNA to its limits — where the method is still powerful, but where interpretation, and the possibility of error, quietly return.

Key Terms

DNA (deoxyribonucleic acid) — the molecule carrying genetic instructions, present and identical in nearly every nucleated cell of one individual; the basis of forensic genetic identification.
STR (short tandem repeat) — a short DNA sequence (typically 2–6 bases) repeated a variable number of times at a fixed location, the variation forensic typing exploits.
Locus (plural loci) — a specific, named physical location in the genome that forensic scientists target for typing.
Allele — a specific version (e.g., a particular repeat number) found at a locus; each person carries two alleles per locus, one inherited from each parent.
PCR (polymerase chain reaction) — a laboratory technique that copies a targeted DNA region exponentially, turning a trace amount into billions of copies; the basis of modern STR typing and the source of its sensitivity-driven contamination risk.
Electropherogram — the visual readout of a DNA profile produced by capillary electrophoresis, a graph of fluorescence versus fragment size in which each allele appears as a peak.
DNA profile — the complete set of alleles a sample shows across all tested loci; the numerical "fingerprint" compared between samples.
CODIS (Combined DNA Index System) — the U.S. national DNA database that searches unknown crime-scene profiles against offender and casework profiles; it stores numerical profiles, not identities, and produces leads, not verdicts.
Random match probability (RMP) — the probability that a randomly chosen, unrelated person from the relevant population would coincidentally share the evidence profile; states the rarity of a match, not the probability of innocence.

Spaced Review

(This chapter.) Explain, in one sentence each, why a single-locus mismatch can exclude a suspect but a 20-locus match only supports that the suspect is the source. (§7.4, §7.6)
(Chapter 6.) The post-conviction DNA exonerations revealed several leading causes of wrongful conviction. Why is DNA uniquely suited to prove a past conviction wrong, where the original (non-DNA) evidence could not prove it right? Connect this to the exclusion-over-proof theme. (Ch. 6; §7.1)
(Chapter 5.) A defense attorney challenges DNA evidence under Daubert. Of the Daubert factors — testability, known error rate, peer review, general acceptance — which does DNA typing satisfy most clearly, and why does that satisfaction not extend automatically to a degraded mixture? (Ch. 5; §7.3, §7.5)
(Chapter 3.) A gas-can handle is swabbed and yields a profile. Using Locard's exchange principle, explain why the presence of a person's DNA on the can is expected on a renovation site and therefore weak as associative evidence by itself. (Ch. 3; §7.6, Case File)
(Validity spectrum.) Where does single-source nuclear DNA sit on the NAS 2009 / PCAST 2016 validity spectrum, and name the one condition under which that high-validity method can still produce a wrong result. (§7.1, §7.3; theme 2)