Chapter 6: Computer Vision — How Machines See the World

Contributors to AI Literacy

29 min read

> "The question is not whether machines can see. It's what they see that we don't — and what we see that they never will."

In This Chapter

6.1 How Computers "See": Pixels, Features, Patterns
6.2 Image Classification: Teaching Machines to Categorize
6.3 Object Detection, Segmentation, and Beyond
6.4 Facial Recognition: Power and Peril
6.5 Deepfakes and Synthetic Media
6.6 When Vision Fails: Adversarial Examples and Edge Cases
6.7 Chapter Summary
Spaced Review — Looking Back
Project Checkpoint: AI Audit Report
What's Next
Optional Python Code

Chapter 6: Computer Vision — How Machines See the World

"The question is not whether machines can see. It's what they see that we don't — and what we see that they never will." — Fei-Fei Li, Co-Director, Stanford Human-Centered AI Institute

What you'll learn in this chapter:

How computers turn images into numbers they can process
How convolutional neural networks learn to recognize visual patterns
Where computer vision shows up in your daily life (hint: more than you think)
Why vision systems fail in ways that surprise us
Why facial recognition is one of the most debated AI technologies in the world

Why it matters: Every day, AI systems are interpreting the visual world on your behalf — unlocking your phone, filtering your photos, guiding surgical instruments, scanning crowds for faces. These systems are powerful, but they are not your eyes. Understanding what they actually do, where they break, and who controls them is one of the most important pieces of AI literacy you can develop.

6.1 How Computers "See": Pixels, Features, Patterns

Close your eyes for a moment and then open them. In the fraction of a second it takes you to focus, your brain has identified objects, estimated distances, recognized faces, read text, and assessed whether anything in your environment is dangerous. You do this effortlessly, constantly, without conscious effort. It is so natural that it seems simple.

It is not simple. Vision is one of the hardest problems in all of artificial intelligence, and the reason comes down to a fundamental mismatch: you see a world of objects, meaning, and context. A computer sees a grid of numbers.

The Grid of Numbers

Every digital image is, at its most basic level, a two-dimensional grid of tiny squares called pixels. A standard smartphone photo might be 4,000 pixels wide and 3,000 pixels tall — that's 12 million pixels in a single image. Each pixel stores a color value. In a grayscale image, that value is a single number between 0 (pure black) and 255 (pure white). In a color image, each pixel stores three numbers — one for red, one for green, one for blue — each between 0 and 255. Mix them together and you can represent about 16.7 million different colors.

So when a computer "looks" at a photo of a cat sitting on a couch, it does not see a cat or a couch. It sees something like this:

[142, 138, 125], [144, 139, 127], [147, 141, 130], ...
[140, 136, 124], [143, 138, 126], [145, 140, 129], ...
[139, 135, 123], [141, 137, 125], [144, 139, 128], ...
...

Millions of number triplets. No whiskers, no cushions, no "catness." Just numbers. The entire challenge of computer vision is getting from those numbers to meaning.

💡 Intuition: Imagine someone handed you a novel, but instead of words, it was encoded as a sequence of numbers where each number represented a letter (A=1, B=2, and so on). You could decode it — eventually. But understanding the story would require much more than knowing the code. You'd need grammar, vocabulary, cultural context, and life experience. Computer vision faces the same gap: the numbers are easy; the meaning is hard.

From Pixels to Features

Humans don't perceive individual pixels any more than you perceive individual letters when you're reading this sentence. Your visual system groups pixels into features — edges, textures, shapes, contours — and then assembles those features into objects. A straight vertical line next to a horizontal line might be the corner of a table. A circular shape with two dark spots near the top might be a face.

Early computer vision researchers tried to program these rules by hand. In the 1960s and 1970s, they wrote explicit instructions: "Look for edges. Group edges into shapes. Match shapes to a catalog of known objects." This approach, called classical computer vision, worked reasonably well for controlled environments — say, identifying parts on a factory assembly line where the lighting was consistent and the objects were predictable.

But it fell apart in the real world. The real world has shadows, occlusion (objects blocking other objects), wildly variable lighting, and near-infinite variety. A chair can be wooden, metal, upholstered, folding, beanbag-shaped, or a tree stump someone sits on. No set of hand-written rules could capture that range.

🔄 Check Your Understanding: Why would a hand-programmed computer vision system that works perfectly in a well-lit factory struggle to identify the same objects outdoors? Think of at least three reasons before reading on.

Some answers: changing sunlight and shadows, varying angles of view, objects partially hidden behind other objects, weather effects like rain or glare, background clutter that doesn't exist in a factory.

The Hierarchy of Recognition

Here is what turned out to be the key insight: vision is hierarchical. You don't jump from pixels to "cat." You go through layers:

Pixels → raw numbers
Edges → where sharp changes in brightness occur
Textures and shapes → patterns of edges (fur-like texture, round shape)
Parts → combinations of shapes (pointy ears, oval body, thin tail)
Objects → the whole thing (a cat)
Scenes → objects in context (a cat on a couch in a living room)

This hierarchy is exactly what modern computer vision systems learn to extract — not through hand-written rules, but through training on thousands or millions of examples.

6.2 Image Classification: Teaching Machines to Categorize

In Chapter 3, we discussed how machine learning works in general: show a model lots of examples, let it find patterns, and then see if it can apply those patterns to new examples it hasn't seen before. Image classification is this exact process applied to images: given a picture, assign it a label. "This is a dog." "This is a traffic sign." "This is melanoma."

The Architecture That Changed Everything: CNNs

The breakthrough that made modern computer vision possible came from an architecture called a convolutional neural network, or CNN. If you remember from Chapter 3 that neural networks are loosely inspired by how neurons connect in the brain, CNNs are loosely inspired by how the visual cortex processes images — in layers, from simple to complex.

Here is the core idea, no math required: a CNN slides a small window — called a filter or kernel — across the image, checking whether a particular pattern is present at each location. One filter might look for vertical edges. Another might look for horizontal edges. Another might look for diagonal lines. These filters aren't designed by humans; the network learns what patterns to look for during training.

💡 Intuition: Imagine you're searching a "Where's Waldo?" book. You're not examining every pixel. You're scanning for a specific pattern — red and white stripes, a bobble hat, round glasses. You slide your attention across the page looking for that combination. A CNN does something similar, except it learns thousands of different patterns to scan for, and it does it across multiple layers of increasing complexity.

In the first layer, the CNN might learn to detect edges and simple textures. In the second layer, it combines those edges into corners, curves, and shapes. In the third layer, those shapes combine into parts of objects — an eye, a wheel, a leaf. By the final layers, the CNN is recognizing whole objects: faces, cars, animals.

This is exactly the hierarchy from Section 6.1, but the CNN learns it from data rather than having it programmed in.

The ImageNet Moment

The moment computer vision went from a niche research area to a world-changing technology has a specific date: September 30, 2012. That's when a deep CNN called AlexNet, built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto, won the ImageNet Large Scale Visual Recognition Challenge. ImageNet was a competition where systems had to classify images into 1,000 categories — everything from "goldfish" to "volcano" to "iPod."

AlexNet didn't just win. It cut the error rate nearly in half compared to the previous year's best system. The difference between AlexNet and its competitors was so large that it effectively ended the debate about whether deep learning could outperform traditional approaches to computer vision. Within two years, nearly every top-performing system in the competition used deep neural networks.

📊 Real-World Application: Today, image classification powers systems you encounter constantly. Your smartphone's photo app automatically organizes pictures by who's in them. Google Photos can search for "beach sunset" and find matching images you took years ago. Social media platforms scan uploaded photos for policy violations. E-commerce sites let you search for products by uploading a picture. Each of these relies on a descendant of the approach AlexNet demonstrated.

MedAssist AI: Seeing What Doctors See (And What They Might Miss)

Remember MedAssist AI from Chapter 1? The hospital diagnostic system that helps physicians interpret medical images? This is where its story deepens. MedAssist uses a CNN trained on hundreds of thousands of labeled medical images — X-rays, CT scans, MRI images, and pathology slides — to flag potential abnormalities.

Here's how it works in practice: a radiologist reviews a chest X-ray. MedAssist AI processes the same image and highlights areas that its CNN identifies as potentially showing signs of pneumonia, lung nodules, or other conditions. It provides a confidence score — say, "87% probability of pneumonia in the right lower lobe."

The system doesn't replace the radiologist. It acts as a second pair of eyes. And in studies, AI-assisted radiologists have caught findings they might otherwise have missed, particularly for subtle abnormalities that are easy to overlook when a physician is reviewing dozens of scans during a long shift.

But — and this is a crucial "but" we'll return to — MedAssist's CNN was primarily trained on images from teaching hospitals that serve predominantly white, affluent patient populations. When deployed at a community hospital serving a diverse patient base, its performance dropped measurably for certain conditions and certain demographic groups. The patterns it learned were real, but they were not universal.

🔗 Connection: This connects directly to Chapter 4's threshold concept: "Data is never neutral — it encodes the world that created it." MedAssist's training data encoded the demographics, equipment, and protocols of specific hospitals. A broader dataset would have taught it broader patterns.

🧩 Productive Struggle: Before reading the next section, think about this: how would you build a system that can not only tell you what is in an image, but where each object is and which pixels belong to which object? Image classification gives you one label for the whole image. What would you need to do differently to locate specific objects within a complex scene? Jot down your intuition before continuing.

6.3 Object Detection, Segmentation, and Beyond

Image classification answers a simple question: "What is this picture of?" But the real world rarely presents us with images that contain a single, centered object. A photo from a street corner might contain cars, pedestrians, traffic lights, buildings, a dog, a bicycle, and a pigeon. For many applications, we need to know not just what's in the image, but where each thing is.

Object Detection: Drawing the Boxes

Object detection takes classification a step further: it identifies multiple objects in an image and draws a bounding box around each one, along with a label and a confidence score. Instead of just "this image contains a car," an object detection system says "there is a car at this location, a pedestrian at that location, and a traffic sign in the upper right corner."

The systems that do this — architectures with names like YOLO (You Only Look Once), SSD (Single Shot MultiBox Detector), and Faster R-CNN — process images at remarkable speed. YOLO, for instance, can detect objects in video at 30 frames per second or faster, making it suitable for real-time applications like autonomous driving and security monitoring.

Segmentation: Coloring Within the Lines

Object detection gives you a rough box around each object, but sometimes you need more precision. Image segmentation assigns a label to every single pixel in the image. Instead of a box around the pedestrian, you get a precise silhouette. This is essential in medical imaging (where you need to know the exact boundary of a tumor), autonomous driving (where you need to know exactly where the road surface ends), and augmented reality (where you need to overlay virtual objects on the real world seamlessly).

There are two flavors:

Semantic segmentation labels every pixel by category: "these pixels are road, these are sidewalk, these are sky, these are car." But it doesn't distinguish between individual instances. Two adjacent cars are both just "car."
Instance segmentation goes further: it labels each pixel and separates individual objects. The two adjacent cars become "Car 1" and "Car 2."

📊 Real-World Application: When you use a video call background blur feature, instance segmentation is what separates "you" from "everything else." When your phone's camera creates a portrait mode photo with an artificially blurred background, it's using segmentation to figure out which pixels are the subject and which are the background. The technology feels seamless, but underneath, a neural network is making millions of per-pixel decisions in real time.

Beyond Still Images: Video, 3D, and Multimodal Vision

Computer vision doesn't stop at photographs:

Video analysis adds the dimension of time. Tracking objects across frames, recognizing actions ("the person is walking," "the car is turning"), and detecting anomalies (an unattended bag in an airport) all require understanding how scenes change over time.
3D vision reconstructs depth from images. This is how autonomous vehicles understand the geometry of their environment and how AR apps place virtual furniture in your living room.
Multimodal vision combines images with other data — text, audio, sensor readings — to build richer understanding. The vision-language models we hear about today, like those that can describe an image in natural language or answer questions about a photo, sit at this intersection.

🔄 Check Your Understanding: A self-driving car needs to know that the object ahead is a pedestrian (classification), where exactly the pedestrian is (detection), the precise outline of the pedestrian versus the road (segmentation), and whether the pedestrian is walking into the street (action recognition). Why would getting any one of these wrong be dangerous? Consider a scenario for each.

6.4 Facial Recognition: Power and Peril

Of all the applications of computer vision, none has generated more public debate than facial recognition technology (FRT). It is simultaneously one of the most impressive and one of the most concerning uses of AI. Understanding both sides of that equation is essential for AI literacy.

How Facial Recognition Works

At a high level, facial recognition involves three steps:

Face detection — Finding faces in an image. This is the technology that draws a box around faces in your camera's viewfinder. It's relatively straightforward and is embedded in nearly every smartphone and digital camera.
Feature extraction — Measuring the unique geometry of each face: the distance between the eyes, the shape of the jawline, the width of the nose, the depth of the eye sockets. Modern systems convert these measurements into a mathematical representation called a face embedding — a string of numbers that serves as a compact "fingerprint" of that face.
Face matching — Comparing the embedding against a database of known faces to find a match. This is where the "recognition" happens.

💡 Intuition: Think of face embeddings like a recipe that describes a face. Instead of "two cups flour, one cup sugar," it's "eye spacing: 0.42, nose width: 0.38, jaw angle: 0.71" — except with hundreds of measurements, and computed by a neural network rather than measured by hand. Two photos of the same person produce similar recipes. Two different people produce different recipes. The system compares recipes, not photos.

Where Facial Recognition Is Used

Facial recognition is already deployed far more widely than most people realize:

Phone unlocking: Apple's Face ID and similar systems on Android devices use facial recognition to authenticate the device owner.
Airport security: Customs and Border Protection uses facial recognition at many U.S. airports to verify travelers' identities. The European Union and many other countries are expanding similar programs.
Law enforcement: Police departments use facial recognition to identify suspects from surveillance footage, mugshot databases, and sometimes social media. The FBI's facial recognition system has access to over 600 million photos.
Retail: Some stores use facial recognition to identify known shoplifters or to track customer demographics and behavior.
Social media: Platforms have used (and in some cases discontinued, under pressure) facial recognition for automatic photo tagging.
Workplace access: Offices and secure facilities use face scanning instead of keycards.

The Accuracy Problem

Here's where it gets complicated — and where one of the most important pieces of AI research comes in.

In 2018, MIT researcher Joy Buolamwini and Timnit Gebru published a study that sent shockwaves through the AI community. They evaluated three commercial facial recognition systems — from Microsoft, IBM, and a company called Face++ — and found dramatic accuracy disparities. The systems performed best on lighter-skinned male faces (error rates below 1%) and worst on darker-skinned female faces (error rates as high as 34.7%). The technology was not failing randomly. It was failing along the exact lines of race and gender.

Why? The training data. The datasets used to train these systems were disproportionately composed of lighter-skinned faces. The systems had simply seen more examples of some faces than others, and they learned their patterns accordingly. This is the "data is never neutral" principle from Chapter 4 playing out with real-world consequences.

📜 Historical Context: The Buolamwini and Gebru study, titled "Gender Shades," didn't just identify a technical problem. It catalyzed a social movement. IBM improved its systems and eventually exited the facial recognition market for law enforcement. Microsoft adopted stronger principles for selling the technology. Multiple cities — including San Francisco, Boston, and Minneapolis — banned government use of facial recognition. The study demonstrated that rigorous, independent AI auditing can drive change.

The Civil Liberties Debate

Even if facial recognition were perfectly accurate across all demographics — and it is getting closer — profound questions would remain:

Consent: In most deployments, the people being scanned didn't choose to participate. You walk through an airport, attend a concert, or drive past a traffic camera, and your face is processed without your explicit agreement.
Chilling effects: When people know they're being watched, they change their behavior. The knowledge that facial recognition is operating at a protest, for instance, may deter people from exercising their right to assemble. Research from the Brookings Institution has documented this effect.
Function creep: A system deployed for one purpose (finding missing children) can expand to another (tracking political dissidents). History is full of surveillance technologies that began with benign justifications and expanded far beyond them.
Power asymmetry: Facial recognition is primarily used by powerful institutions — governments, corporations, law enforcement — to identify and track individuals. The power flows in one direction.

⚠️ Common Pitfall: It's tempting to frame facial recognition as a simple "privacy vs. security" trade-off. But this framing assumes that more surveillance always produces more security and that privacy is a luxury rather than a right. In practice, facial recognition has led to wrongful arrests (at least six documented cases in the U.S. as of 2024, all involving Black men), which means it can make people less safe. The trade-off is not as clean as it appears.

🔄 Check Your Understanding: Imagine your university announces it will use facial recognition to automate attendance tracking. List three potential benefits and three potential concerns. Then ask yourself: who gets to make this decision, and who should?

6.5 Deepfakes and Synthetic Media

In 2017, a new word entered the cultural lexicon: deepfake. A portmanteau of "deep learning" and "fake," it describes synthetic media — primarily video and audio — generated or manipulated by AI to depict people saying or doing things they never actually said or did.

How Deepfakes Work

Most deepfake systems rely on a type of neural network called a generative adversarial network (GAN) or, more recently, diffusion models. The basic principle of a GAN involves two networks in competition:

A generator creates fake images or video.
A discriminator tries to distinguish real from fake.

They train against each other: the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes. The result, after extensive training, is a generator that can produce images and videos realistic enough to deceive not just the discriminator but human observers.

For face-swapping deepfakes, the process typically involves: 1. Collecting many images or video clips of the target person 2. Training a model to learn the structure and expressions of that face 3. Mapping those learned features onto a source video, replacing the original face

The technology has improved rapidly. Early deepfakes had telltale flaws — unnatural blinking, blurry edges around the face, inconsistent lighting. Today's best deepfakes are extremely difficult for untrained observers to detect.

The Spectrum of Harm

Not all synthetic media is harmful, and it is important to resist the temptation to view the entire technology as villainous:

Entertainment: Film studios use the same technology to de-age actors, create digital doubles, and bring deceased performers back to the screen (with the estate's permission). Video game and animation studios use it for character creation.
Accessibility: AI-generated avatars can represent people in video calls when they don't want to appear on camera. Synthetic voice technology helps people who have lost the ability to speak.
Education and art: Synthetic media enables historical recreations, art installations, and educational simulations.

But the harms are real and serious:

Non-consensual intimate imagery: The majority of deepfakes online are non-consensual pornography, overwhelmingly targeting women. This is not a hypothetical concern — it is the most common use of the technology.
Political manipulation: Deepfakes of politicians making inflammatory statements could influence elections, especially if released in the final hours before voting when there's no time for debunking.
Fraud: AI-generated voice clones have been used in scam phone calls where criminals impersonate family members in distress.
Erosion of trust: Perhaps the most insidious effect is what researchers call the "liar's dividend." Once people know deepfakes exist, real evidence can be dismissed as fake. A politician caught on video making a damaging statement can simply claim, "That's a deepfake."

📊 Real-World Application: In 2024, a finance worker at a multinational firm was tricked into transferring $25 million after a video call in which AI-generated deepfakes impersonated the company's chief financial officer and other colleagues. The worker initially had suspicions about a phishing email, but the video call — in which everyone looked and sounded like real colleagues — convinced him the request was legitimate. This case illustrates how deepfakes can bypass not just technical systems but human judgment.

Detection and Defense

Can we detect deepfakes? Yes — sometimes. Researchers have developed detection tools that look for inconsistencies in lighting, skin texture, eye reflections, and other subtle artifacts. Some approaches analyze the biological signals embedded in video — genuine video of a living person shows tiny fluctuations in skin color as blood pulses through capillaries, and current deepfake generators don't replicate this.

But detection is an arms race. As detection tools improve, generators adapt. The long-term solution may not be detection at all but rather provenance — establishing a verified chain of custody for media. Technologies like C2PA (Coalition for Content Provenance and Authenticity) aim to embed cryptographic signatures in photos and videos at the moment of capture, creating a tamper-evident record of where media came from and whether it's been modified.

🔗 Connection: The deepfake challenge connects to the broader theme of "capability vs. understanding" that runs through this book. These AI systems are extraordinarily capable at generating realistic images and video. But they don't understand what they're creating, and they certainly don't understand the consequences. The capability outpaces the safeguards.

6.6 When Vision Fails: Adversarial Examples and Edge Cases

Computer vision systems can seem almost magical in their accuracy. But they fail in ways that reveal something important about how they work — and how different their "seeing" is from yours.

Adversarial Examples: Fooling AI on Purpose

In 2013, researchers made a discovery that was both fascinating and alarming: you could make tiny, almost imperceptible changes to an image — changes so small that no human would notice them — and completely change how a neural network classified it. A picture of a panda, modified by adding a carefully calculated layer of noise invisible to the human eye, could be classified by a CNN as a gibbon with 99% confidence.

These are called adversarial examples, and they expose a fundamental difference between human and machine vision. Humans see objects; CNNs see statistical patterns in pixels. When you add adversarial noise, you're not changing the object in the image — you're changing the statistical pattern in a way that redirects the network's computation.

This is not just an academic curiosity:

Physical-world attacks: Researchers have shown that placing specific stickers on a stop sign can cause a self-driving car's vision system to classify it as a speed limit sign. Small patches applied to clothing can make a person invisible to object detection systems.
Medical imaging: Adversarial perturbations added to medical scans could, in principle, cause a diagnostic AI to misread the scan — though this scenario remains theoretical for now.
Security systems: If facial recognition can be fooled by subtle makeup patterns or specially designed glasses, the reliability of AI-based security is called into question.

Myth vs. Reality

Myth: "AI vision systems see the way humans do, just faster."

Reality: AI vision systems process pixel statistics in ways fundamentally different from human perception. They can be fooled by imperceptible noise, they don't understand context or physics, and they have no common sense about what objects "should" look like. A CNN might correctly identify 10,000 cats in a row and then confidently classify a slightly modified cat image as a toaster. Human vision is robust in ways we take for granted; machine vision is brittle in ways we're still discovering.

Edge Cases: The Long Tail of the Unusual

Beyond deliberate attacks, computer vision systems struggle with the sheer variety of the real world. These are called edge cases — situations that fall outside the patterns the training data captured:

Unusual weather: Self-driving car systems trained primarily on clear-weather data may struggle with fog, heavy rain, snow glare, or the low sun angle common at dawn and dusk.
Unfamiliar objects: A vision system might never have seen a person in a wheelchair, a horse-drawn carriage on a highway, or a mattress that fell off a truck. Objects that don't fit its learned categories create dangerous uncertainty.
Context confusion: A vision system might correctly detect a person and a surfboard separately but fail to understand that the person is carrying the surfboard, not standing next to a free-floating board.
Demographic gaps: As we saw with MedAssist AI and facial recognition, systems trained on non-representative data perform worse on underrepresented groups. Skin detection algorithms have misidentified dark-skinned hands. Content moderation systems have flagged photos of Black people at higher rates.

📊 Real-World Application: In 2018, an Uber self-driving test vehicle struck and killed a pedestrian in Tempe, Arizona. The investigation revealed that the car's perception system detected the pedestrian but repeatedly reclassified her — first as an unknown object, then as a vehicle, then as a bicycle — and never settled on a classification long enough to trigger emergency braking. The system had not been trained to recognize a pedestrian walking a bicycle across an unlit road at night. The case is discussed in depth in Case Study 1.

Why These Failures Matter

These failures are not just technical bugs to be patched. They reveal something fundamental: computer vision systems learn correlations in pixel data, not concepts about the physical world. They don't know that stop signs mean "stop" or that people are fragile. They don't reason about physics, context, or consequences. They are sophisticated pattern-matching systems, and patterns have limits.

This connects to the recurring theme of this book: capability versus understanding. A computer vision system that identifies cancer in medical scans with 95% accuracy is enormously capable. But it doesn't understand what cancer is, why it matters, or what happens to the patient if it's wrong. Humans in the loop — radiologists, drivers, security officers — remain essential not because the AI isn't good enough, but because the AI doesn't comprehend the stakes.

🔄 Check Your Understanding: Consider a computer vision system used for quality control in a food processing plant. It's trained to detect defective products on a conveyor belt. Describe two realistic edge cases where this system might fail. Then explain why a human quality inspector would handle those cases better.

6.7 Chapter Summary

Computer vision has come remarkably far in a remarkably short time. From hand-coded rules in the 1960s to deep convolutional neural networks that outperform humans on specific image recognition tasks, the field has transformed what machines can do with visual information.

But throughout this chapter, a consistent theme has emerged: seeing is not understanding. Computer vision systems process pixel statistics with extraordinary speed and accuracy, but they lack the common sense, context, and causal reasoning that humans bring to visual interpretation. This gap matters — in medicine, in driving, in policing, in the integrity of the information we trust.

Key concepts from this chapter:

Pixels and features: Images are grids of numbers; vision systems learn to extract hierarchical features (edges, textures, shapes, objects) from those grids
Convolutional neural networks (CNNs): Architectures that learn visual patterns through layers of filters, from simple features to complex objects
Object detection and segmentation: Moving beyond "what is this?" to "where is everything, and which pixels belong to which thing?"
Facial recognition: A powerful technology with documented accuracy disparities across race and gender, raising fundamental questions about consent, surveillance, and civil liberties
Deepfakes and synthetic media: AI-generated visual content that challenges our ability to trust what we see
Adversarial examples and edge cases: Failures that reveal the fundamental difference between machine pattern-matching and human understanding

The recurring themes in action:

Tools built by humans: Vision systems trained on biased data reproduce and amplify those biases. MedAssist AI worked better for some patients than others because of who was in the training set.
Capability vs. understanding: CNNs can classify images with superhuman accuracy on benchmarks, but they don't understand what they're looking at. Adversarial examples prove it.
Who benefits, who is harmed: Facial recognition offers convenience and security for some while subjecting others — disproportionately people of color — to false matches and unwarranted scrutiny.
Human in the loop: Whether in radiology, autonomous driving, or law enforcement, humans remain essential because machines process pixels, not consequences.

Spaced Review — Looking Back

These questions revisit material from earlier chapters to strengthen your long-term retention.

From Chapter 1: We introduced the distinction between narrow AI and general AI. Computer vision systems are narrow AI — they perform specific visual tasks, often at superhuman levels. Why does narrow AI excellence at one task (like identifying objects) not bring us closer to general intelligence?
From Chapter 3: We discussed the difference between supervised and unsupervised learning. Image classification using CNNs is a supervised learning task — the model trains on labeled images. What would an unsupervised approach to computer vision look like? (Hint: think about what the system would learn without labels.)
From Chapter 4: We explored how bias enters data. The Gender Shades study showed that facial recognition worked best on lighter-skinned males. Using the framework from Chapter 4, trace how this bias entered the system. Where in the pipeline — data collection, labeling, model architecture, testing — did the problem originate?

Project Checkpoint: AI Audit Report

📐 Does your AI system use computer vision? This checkpoint applies whether the answer is yes or no.

If your system processes visual information: - What types of visual tasks does it perform? (Classification, detection, segmentation, facial recognition, other?) - What training data was likely used? Consider the demographics, contexts, and conditions represented. - Can you identify potential edge cases — situations where the visual processing might fail? - Are there documented accuracy disparities across different user groups?

If your system does not use computer vision: - Could computer vision be added to your system? Would it help or create new risks? - Does your system process any sensory data (audio, text, sensor readings)? How does the pixel-to-meaning gap in computer vision parallel challenges in your system?

Add to your audit report: - A section on visual processing capabilities (or their absence) - At least two potential failure scenarios involving visual inputs - An assessment of whether the system's visual training data is representative of its deployment context

What's Next

In Chapter 7, we move from understanding individual AI capabilities — language models, computer vision — to understanding how AI systems make decisions. How does an AI system go from processing inputs to producing an output that affects someone's life? What does the decision pipeline look like, and where can things go wrong? We'll trace the full journey from data input to actionable decision, using ContentGuard and CityScope Predict to see how AI decisions play out in content moderation and predictive policing.

Optional Python Code

🐍 Optional Code: This code example is supplementary. You can understand the chapter fully without it. See Appendix E for setup instructions.

The following example uses a pre-trained image classification model to classify an image. It demonstrates in just a few lines how a CNN processes an image and produces predictions.

# Image classification with a pre-trained model
# Requires: pip install torch torchvision Pillow

from torchvision import models, transforms
from PIL import Image

# Load a pre-trained ResNet model (trained on ImageNet)
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model.eval()

# Prepare an image (replace 'your_image.jpg' with any image file)
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])
img = Image.open("your_image.jpg")
input_tensor = preprocess(img).unsqueeze(0)

# Classify the image
import torch
with torch.no_grad():
    output = model(input_tensor)
    probabilities = torch.nn.functional.softmax(output[0], dim=0)

# Print top 5 predictions
_, indices = torch.topk(probabilities, 5)
categories = models.ResNet18_Weights.DEFAULT.meta["categories"]
for idx in indices:
    print(f"{categories[idx]:30s} {probabilities[idx]:.2%}")

Try classifying different images — a pet, a household object, a food item — and observe what the model gets right and wrong. Pay attention to its confidence scores. Does a 95% confidence score mean the model is "sure"?