40 min read

> "The greatest value of a picture is when it forces us to notice what we never expected to see."

Learning Objectives

  • Create and interpret histograms, bar charts, pie charts, and stem-and-leaf plots
  • Describe the shape of a distribution (symmetric, skewed, unimodal, bimodal)
  • Identify outliers visually and explain their potential impact
  • Choose the appropriate graph type for different variable types
  • Create basic visualizations using Python (matplotlib/seaborn) and Excel

Chapter 5: Exploring Data: Graphs and Descriptive Statistics

"The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey, statistician

Chapter Overview

Let me ask you a strange question: have you ever looked at a spreadsheet and felt... nothing?

You're staring at rows and columns of numbers. Hundreds of them. Maybe thousands. Your eyes glaze over. The data is right there, but it's not telling you anything. It's like trying to understand a song by reading the sheet music without ever hearing it played.

That's because numbers in a table are raw ingredients. They're flour and sugar and eggs sitting on a counter. A graph is what happens when you bake them into something — when the patterns, shapes, and stories hiding inside all those rows suddenly become visible.

This chapter is about learning to see data.

And I don't mean that metaphorically. I mean literally: you're going to take columns of numbers and transform them into pictures that reveal things you'd never notice otherwise. You'll see that flu cases don't spread evenly — they cluster. You'll see that watch-time data has a long tail of binge-watchers. You'll see that a basketball player's shooting percentage isn't just a single number — it's a distribution with a shape that tells a story.

That shift — from looking at individual numbers to seeing the shape of an entire distribution — is one of the most important mental leaps in all of statistics. We'll call it distribution thinking, and by the end of this chapter, you won't be able to look at data the same way again.

In this chapter, you will learn to: - Create and interpret histograms, bar charts, pie charts, and stem-and-leaf plots - Describe the shape of a distribution (symmetric, skewed, unimodal, bimodal) - Identify outliers visually and explain their potential impact - Choose the appropriate graph type for different variable types - Create basic visualizations using Python (matplotlib/seaborn) and Excel

Fast Track: If you've created histograms and bar charts before and can already describe distributions as "skewed right" or "bimodal," skim Sections 5.1-5.3 and jump to Section 5.6 ("Distribution Shapes: The Vocabulary of Shape"). Complete quiz questions 1, 7, and 14 to verify your foundation.

Deep Dive: After this chapter, read Case Study 1 (misleading graphs in the media) for a sharp lesson in how visualizations can deceive, then Case Study 2 (Florence Nightingale's revolutionary data visualization) for inspiration on what great graphs can accomplish.


5.1 Why We Graph Data: Seeing What Numbers Can't Show

Let's start with Dr. Maya Chen.

Maya has been tracking flu cases across three communities in her county. Last week, her colleague emailed her a spreadsheet with flu case counts for 200 patients — each row containing a patient's age, the community they live in, the date they were diagnosed, and whether they were hospitalized.

She could calculate the average age of flu patients in each community. In fact, if you remember .describe() from Chapter 3, she could get the mean, median, min, max, and standard deviations for all three communities in a single line of code. And those numbers would tell her... something.

But here's what happened when she made a histogram instead.

Maya's Histogram: The Picture That Changed Everything

Visual description (histogram): A histogram showing the age distribution of flu cases in Community A. The horizontal axis shows age in 10-year bins from 0 to 90. The vertical axis shows the number of cases. Two prominent peaks appear: one tall bar around ages 0-9 (about 35 cases) and another tall bar around ages 60-69 (about 30 cases). The bars for ages 20-49 are much shorter, with only 5-10 cases each. The result is a clearly bimodal distribution — two humps with a valley in between.

The average age of flu patients in Community A? About 38 years old. Which sounds completely unremarkable. But look at that histogram: almost nobody around age 38 is getting the flu. The disease is hitting two very different groups — young children and older adults — and the "average" patient doesn't exist.

This is why we graph data. The average told Maya nothing. The histogram told her everything.

The fundamental lesson: A single number can never fully describe a dataset. A graph reveals the shape, the spread, the clusters, the gaps, and the surprises that summary statistics hide.

If statistics is a superpower — and I've argued since Chapter 1 that it is — then visualization is the moment your superpower first switches on. It's the moment you stop flying blind and start seeing.


5.2 Bar Charts: Picturing Categorical Data

Let's start with the simplest and most common type of graph: the bar chart.

You already know from Chapter 2 that variables come in two fundamental types: categorical and numerical. Bar charts are designed for categorical variables — variables where the values are categories or groups, not numbers on a scale.

What Is a Bar Chart?

A bar chart displays the frequency (count) or relative frequency (proportion) of each category using rectangular bars. Each category gets its own bar, and the height (or length) of the bar represents how many observations fall in that category.

Here's the key feature: the bars in a bar chart don't touch. There are gaps between them. This matters more than you might think — those gaps signal that the categories are separate and distinct. Moving from one bar to the next doesn't mean moving along a continuous scale. It means jumping to a completely different group.

Alex's Streaming Data

Alex Rivera is analyzing user data at StreamVibe. She wants to know which device types her users prefer for streaming. Her dataset has a categorical variable called device_type with five categories: Phone, Tablet, Laptop, Smart TV, and Desktop.

Here's what her bar chart looks like:

Visual description (bar chart): A vertical bar chart showing StreamVibe device usage. The horizontal axis lists five device categories: Phone, Tablet, Laptop, Smart TV, and Desktop. The vertical axis shows the number of users, ranging from 0 to 5,000. The bars have gaps between them. Smart TV is the tallest bar (about 4,200 users), followed by Phone (about 3,800), Laptop (about 2,500), Tablet (about 1,500), and Desktop (about 800). Each bar is the same width and a uniform blue color.

Immediately, Alex can see the story: Smart TVs and phones dominate. Desktop is dying. This wasn't obvious from a table of numbers — but in one glance, the picture is unmistakable.

Bar Chart Rules (the Ones That Matter)

  1. Bars don't touch. Gaps between bars show that categories are distinct groups, not points on a continuous scale.
  2. Bar width should be uniform. If one bar is wider than another, your brain reads it as "more" — even if it isn't. Keep all bars the same width.
  3. The axis must start at zero. If you start the vertical axis at, say, 2,000 instead of 0, a bar representing 4,200 users looks twice as tall as a bar representing 3,000 users — but it's actually only 40% more. Starting at zero keeps visual comparisons honest.
  4. Order categories thoughtfully. For nominal categories (no natural order), sort by frequency — tallest to shortest — to make comparisons easier. For ordinal categories (like satisfaction ratings from "Very Dissatisfied" to "Very Satisfied"), keep the natural order.

Python: Creating a Bar Chart

If you completed Chapter 3, you already have pandas installed. Now we're adding two new libraries: matplotlib and seaborn. Together, they're the standard toolkit for data visualization in Python.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample data: StreamVibe device usage
devices = pd.Series(['Smart TV', 'Phone', 'Laptop', 'Tablet', 'Desktop'],
                     name='Device')
counts = pd.Series([4200, 3800, 2500, 1500, 800], name='Users')

plt.figure(figsize=(8, 5))
sns.barplot(x=devices, y=counts, color='steelblue')
plt.title('StreamVibe Users by Device Type')
plt.xlabel('Device')
plt.ylabel('Number of Users')
plt.tight_layout()
plt.show()

That's it. Ten lines. And you get a clean, professional-looking bar chart.

If you're working from a DataFrame (which you probably are if you loaded a CSV in Chapter 3), it's even simpler:

# If 'device_type' is a column in your DataFrame df:
sns.countplot(data=df, x='device_type', color='steelblue',
              order=df['device_type'].value_counts().index)
plt.title('StreamVibe Users by Device Type')
plt.xlabel('Device')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

The order=df['device_type'].value_counts().index part sorts the bars from most to least common — following our rule #4 about thoughtful ordering.

Excel/Google Sheets: Creating a Bar Chart

  1. Enter your categories in Column A and your counts in Column B
  2. Select both columns (including headers)
  3. Excel: Insert tab → Charts group → click "Bar Chart" or "Column Chart" → select the simple 2D option (avoid 3D — more on this later)
  4. Google Sheets: Insert menu → Chart → Chart type → select "Column chart"
  5. Click on the chart title to rename it
  6. Right-click the vertical axis to format it — make sure it starts at zero

5.3 Pie Charts: Love Them or Leave Them

Let's address the elephant in the room: pie charts.

A pie chart displays the proportion of each category as a slice of a circle. The whole circle represents 100% of the data, and each slice's angle is proportional to that category's share.

Here's the same StreamVibe device data as a pie chart:

Visual description (pie chart): A circular pie chart divided into five slices representing StreamVibe device usage. Smart TV takes up the largest slice (about 33%), followed by Phone (about 30%), Laptop (about 19%), Tablet (about 12%), and Desktop (about 6%). Each slice is a different color. Percentage labels appear on each slice.

Pie charts are probably the most used chart type in the world — and also the most criticized by statisticians and data visualization experts. Here's why both sides have a point:

When pie charts work: - You have a small number of categories (3-5) - You want to show parts of a whole (must sum to 100%) - One or two categories dominate, and that dominance is the main story - Your audience is non-technical and familiar with pie charts

When pie charts fail: - You have more than 5-6 categories (the slices become impossible to compare) - Categories have similar proportions (can you tell the difference between 22% and 24% slices? Neither can anyone else) - The data doesn't represent parts of a whole - You need precise comparisons (bar charts are always better for this)

Here's my honest take: bar charts do everything pie charts do, and they do it better. The human eye is great at comparing lengths (bar heights) but terrible at comparing angles (pie slices). That said, pie charts aren't evil — they're familiar and intuitive for simple data. Use them when the audience and the message justify it, but default to bar charts when in doubt.

Python: Creating a Pie Chart

labels = ['Smart TV', 'Phone', 'Laptop', 'Tablet', 'Desktop']
sizes = [4200, 3800, 2500, 1500, 800]

plt.figure(figsize=(7, 7))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('StreamVibe Users by Device Type')
plt.tight_layout()
plt.show()

The autopct='%1.1f%%' adds percentage labels to each slice. The startangle=90 rotates the chart so the first slice starts at 12 o'clock.


Check Your Understanding — Retrieval Practice #1 (try to answer without scrolling up)

  1. What's the key visual difference between a bar chart and a pie chart?
  2. Why must the vertical axis of a bar chart start at zero?
  3. Alex's bar chart showed five device categories. Should the bars touch each other? Why or why not?
  4. Name one situation where a pie chart works well and one where it doesn't.

Check your thinking

  1. A bar chart uses rectangular bars with heights proportional to frequency; a pie chart uses slices of a circle proportional to relative frequency. Bar charts display counts or proportions; pie charts always display proportions (parts of a whole).
  2. If the axis doesn't start at zero, the visual ratios between bars become distorted. A bar that's twice as tall should represent twice the frequency — but that only works when the baseline is zero.
  3. No. Gaps between bars signal that the categories are distinct groups with no inherent ordering or continuity. Touching bars (no gaps) is a feature of histograms, where the data is continuous.
  4. Works well: showing that Smart TV and Phone together account for over 60% of users (few categories, clear dominance, parts-of-a-whole story). Doesn't work: comparing 6+ categories with similar percentages — the angle differences become impossible to distinguish.

5.4 Histograms: Picturing Numerical Data

Now we get to the most important graph in introductory statistics: the histogram.

If bar charts are for categorical data, histograms are for numerical data — variables where the values are numbers on a meaningful scale (ages, incomes, test scores, watch times). And understanding histograms is the gateway to one of the most powerful ideas in all of statistics: distribution thinking.

What Is a Histogram?

A histogram divides numerical data into equal-width intervals called bins and displays the count (or proportion) of observations falling in each bin as a rectangular bar. Unlike bar charts, the bars in a histogram touch — because the data is continuous. Moving from one bin to the next means moving along a continuous number line, not jumping to a different category.

A Concrete Example: Sam's Shooting Data

Sam Okafor, our sports analytics intern, has game-by-game shooting percentages for every player on the Riverside Raptors roster — 15 players, 30 games each, giving him 450 individual game shooting percentages. He could calculate the team's average (let's say it's 44.2%), but he wants to see the full picture.

He creates a histogram:

Visual description (histogram): A histogram of game-by-game shooting percentages for the Riverside Raptors. The horizontal axis shows shooting percentage from 0% to 80% in bins of 5 percentage points. The vertical axis shows frequency (number of games). The distribution is roughly bell-shaped and centered around 42-47%, with the tallest bars at 40-45% (about 85 games) and 45-50% (about 78 games). The distribution tails off on both sides, with very few games below 15% or above 70%. It is approximately symmetric with a very slight right skew.

Now Sam can see something crucial: the team's shooting isn't clustered tightly around 44%. It's spread out across a wide range — from below 15% (terrible games) to above 70% (incredible games). The shape tells him how consistent the team is, not just how good they are on average.

Histogram vs. Bar Chart: The Confusion Everyone Has

This is, without exaggeration, one of the most common points of confusion in all of introductory statistics. Let me lay it out clearly:

Feature Bar Chart Histogram
Used for Categorical variables Numerical variables
Bars touch? No — gaps between bars Yes — bars are adjacent
X-axis Category labels Numerical scale (bins)
Bar order Can be rearranged Must follow numerical order
Bar width Arbitrary (cosmetic) Meaningful (defines bin width)
What bars represent Count per category Count per interval

The simplest way to remember: if the x-axis has words, it's a bar chart. If the x-axis has numbers, it's probably a histogram.

And here's the deeper reason the distinction matters: rearranging the bars of a bar chart changes nothing — putting "Desktop" first or last doesn't change the story. But rearranging the bars of a histogram would destroy it — because the order of the bins represents the number line. The shape of a histogram is the story, and shape depends on order.

Building a Frequency Distribution Table

Before we make a histogram, it helps to see the underlying data structure: a frequency distribution.

A frequency distribution organizes data into classes (bins) and counts how many observations fall in each class. Here's a simplified example using Maya's flu patient ages:

Age Group (Bin) Frequency (Count) Relative Frequency (Proportion)
0–9 35 0.175
10–19 18 0.090
20–29 8 0.040
30–39 10 0.050
40–49 12 0.060
50–59 22 0.110
60–69 30 0.150
70–79 38 0.190
80–89 27 0.135
Total 200 1.000

The frequency is the raw count. The relative frequency is the proportion — the count divided by the total number of observations. Relative frequencies always sum to 1 (or 100%).

Why do we care about relative frequencies? Because they let you compare datasets of different sizes. If Community A has 200 flu patients and Community B has 800, raw counts would make Community B look like it always has more cases in every age group. Relative frequencies show the shape of each distribution on the same scale.

How Many Bins? The Goldilocks Problem

Choose too few bins and you smush everything together — the histogram looks like a blob and you can't see the details. Choose too many bins and the histogram looks like a jagged mess of spikes with no clear pattern. You want something in between.

Rules of thumb: - For small datasets (< 50 observations): 5-7 bins - For medium datasets (50-300): 8-15 bins - For large datasets (300+): 15-25 bins - A popular formula: number of bins ≈ √n (the square root of the number of observations)

The good news: Python and Excel will choose a reasonable number of bins for you automatically. You can always override it if the default doesn't look right.

Python: Creating a Histogram

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'ages' is a pandas Series or column from a DataFrame
# Example: ages = df['age']

plt.figure(figsize=(8, 5))
sns.histplot(data=ages, bins=10, color='steelblue', edgecolor='white')
plt.title('Age Distribution of Flu Patients — Community A')
plt.xlabel('Age (years)')
plt.ylabel('Number of Patients')
plt.tight_layout()
plt.show()

Want to try different bin counts? Just change bins=10 to bins=15 or bins=20 and run the cell again. Watch how the shape changes — this builds intuition for the Goldilocks problem.

For a relative frequency histogram (showing proportions instead of counts), add stat='proportion':

sns.histplot(data=ages, bins=10, stat='proportion',
             color='steelblue', edgecolor='white')

Excel/Google Sheets: Creating a Histogram

Excel (2016+): 1. Select your numerical data column (including the header) 2. Insert tab → Charts group → click "Statistical Chart" (the icon with a histogram shape) → Histogram 3. Double-click the horizontal axis to adjust bin width and number of bins 4. Format as needed (title, labels, colors)

Google Sheets: 1. Select your numerical data column 2. Insert → Chart → Chart type → select "Histogram" 3. In the Chart Editor panel on the right, click "Customize" → "Histogram" to adjust bucket size (bin width)

Older Excel versions (before 2016): You'll need the Analysis ToolPak add-in. Go to File → Options → Add-ins → select "Analysis ToolPak" → OK. Then Data tab → Data Analysis → Histogram. You'll need to define your bin boundaries manually.


5.5 Stem-and-Leaf Plots: The Analog Histogram

Before computers made histograms easy, statisticians used a clever alternative: the stem-and-leaf plot (also called a stemplot). It's partly a table, partly a graph, and it preserves the actual data values — something a histogram loses.

How It Works

In a stem-and-leaf plot, each number is split into a stem (leading digits) and a leaf (last digit). All values sharing the same stem are grouped on the same row.

Here's an example using the ages of 20 flu patients from Maya's dataset:

Raw data: 3, 5, 7, 12, 14, 15, 23, 24, 31, 33,
          45, 47, 52, 58, 61, 63, 67, 71, 74, 82

Stem | Leaves
  0  | 3 5 7
  1  | 2 4 5
  2  | 3 4
  3  | 1 3
  4  | 5 7
  5  | 2 8
  6  | 1 3 7
  7  | 1 4
  8  | 2

Turn your head sideways (or imagine the plot rotated 90 degrees) and you've got a histogram — but one where you can still read every individual value. The stem "6" has three leaves (1, 3, 7), meaning three patients were aged 61, 63, and 67.

When to Use Stem-and-Leaf Plots

  • Small datasets (10-50 values): stem-and-leaf plots shine because you see both shape and exact values
  • Comparing two groups: Back-to-back stem-and-leaf plots (leaves spreading left for one group, right for another) are great for quick comparisons
  • Quick analysis by hand: When you need a fast visual and don't have a computer handy

For datasets with more than 50 or so observations, histograms are usually better — stem-and-leaf plots get too long and unwieldy.

Back-to-Back Stem-and-Leaf: Comparing Two Groups

Sam wants to compare two players' scoring performances. Here's a back-to-back stem-and-leaf plot:

        Player A | Stem | Player B
       8 5 3 2 0 |   1  | 2 4 6 8
         7 5 3 1 |   2  | 0 3 5 7 9
           8 4 2 |   3  | 1 3 4 6 8
               6 |   4  | 2 5 8
                 |   5  | 1 3

Read Player A's leaves right-to-left from the stem: 10, 12, 13, 15, 18, 21, 23, 25, 27... Read Player B's leaves left-to-right: 12, 14, 16, 18, 20, 23, 25, 27, 29...

At a glance, you can see that Player B tends to score higher, with more values in the 30s and 40s.


5.6 Productive Struggle: Before I Tell You the Vocabulary

Before I teach you the official terms for describing distribution shapes, I want you to try something. Look at the three histogram descriptions below and, in your own words, describe what you see. Don't worry about using the "right" terms — just describe the shape as if you were telling a friend what the picture looks like.

Histogram A: A histogram of household income in the United States. Most bars are clustered on the left side (lower incomes), with the tallest bars around $30,000-$50,000. The bars get progressively shorter as you move to the right, with a long thin tail stretching out toward $200,000 and beyond. A few very short bars appear far to the right near $500,000+.

Histogram B: A histogram of human body temperatures (in Fahrenheit) for 1,000 healthy adults. The bars form a neat, bell-shaped mound centered around 98.6°F. The bars on the left side of the mound are roughly a mirror image of the bars on the right side. Almost all observations fall between 97.0°F and 100.0°F.

Histogram C: A histogram of arrival times at a national park visitor center on a summer Saturday. There are two distinct humps: one in the morning (peaking around 10 AM) and one in the afternoon (peaking around 2 PM). The bars are low around noon (lunch time) and very low before 8 AM and after 5 PM.

Your task: For each histogram, answer these questions in your own words: 1. Where is the "center" of the data? (Or does it even have a single center?) 2. How spread out is the data? 3. Is the shape balanced/symmetric, or does it lean to one side? 4. Are there any unusual features — gaps, isolated bars, multiple humps?

Take a genuine minute with this before moving on. Your descriptions don't have to be fancy. "It looks like a hill that leans to the right" is a perfectly good answer.

Compare your descriptions to the vocabulary below ↓ **Histogram A (household income):** You probably noticed that the data is piled up on the left with a long tail stretching to the right. That's called **skewed right** (or positively skewed). There's no single clear center — the mean is pulled toward the tail (higher than the median). The data is spread across a huge range, from near zero to $500,000+. There are a few extreme values far to the right — those are **outliers**. **Histogram B (body temperatures):** You probably described this as "balanced," "symmetric," or "bell-shaped." The data has a single clear center (around 98.6°F) and tapers off equally on both sides. Statisticians call this **symmetric** and **unimodal** (one peak/hump). This shape is incredibly common in nature — we'll meet it again as the **normal distribution** in Chapter 10. **Histogram C (park arrival times):** You probably noticed the two humps. That makes it **bimodal** — two modes, two peaks. The data doesn't have a single center; it has two. Bimodal distributions often mean your data contains two distinct groups behaving differently (morning visitors and afternoon visitors).

If your descriptions captured the key ideas — even without the official terms — you're already thinking about distributions the right way. Now let's formalize it.


5.7 Distribution Shapes: The Vocabulary of Shape

When statisticians look at a histogram, they describe its shape using a specific vocabulary. This vocabulary is your toolkit for communicating what data looks like — quickly, precisely, and in a way other analysts will understand immediately.

The Four Things to Describe

Every time you look at a histogram, describe these four features:

  1. Shape — Is it symmetric? Skewed? How many peaks?
  2. Center — Where is the "middle" of the data?
  3. Spread — How far does the data stretch?
  4. Unusual features — Outliers? Gaps? Clusters?

Let's tackle shape first.

Symmetric vs. Skewed

A distribution is symmetric if the left side is (approximately) a mirror image of the right side. If you folded the histogram in half at the center, the two halves would roughly overlap.

A distribution is skewed if one tail is longer than the other. There are two types:

  • Skewed right (positively skewed): The right tail is longer. Data is piled up on the left with a few extreme values stretching to the right. Example: household income (most people earn moderate amounts; a few earn millions).

  • Skewed left (negatively skewed): The left tail is longer. Data is piled up on the right with a few extreme values stretching to the left. Example: exam scores on an easy test (most students score high; a few bomb it).

Memory trick: The skew is named for the direction of the tail, not the hump. A right-skewed distribution has its hump on the left and its tail pointing right. This trips people up, so let me say it again: the tail tells you the skew.

Unimodal, Bimodal, and Multimodal

A mode in a histogram is a peak — a bar that's taller than its neighbors on both sides. (This is related to but not identical to the statistical mode, which is the most frequent value.)

  • Unimodal: One peak. The most common shape. Body temperatures, test scores, heights.
  • Bimodal: Two peaks. Often signals two distinct groups in the data. Maya's flu cases (young children and older adults). Park arrival times. Geyser eruption durations.
  • Multimodal: Three or more peaks. Less common, but it happens. Course evaluation ratings sometimes show three peaks (students who loved it, hated it, or were neutral).
  • Uniform: No peaks at all — all bars are roughly the same height. Rolling a fair die many times produces a uniform distribution.

Outliers

An outlier is an observation that falls far from the rest of the data. In a histogram, outliers show up as isolated bars separated from the main body of the distribution by a gap.

Here's the important thing about outliers: they're not automatically errors, and they're not automatically unimportant. An outlier could be:

  • A data entry error: Someone typed 1000 instead of 100. (Fix it.)
  • A measurement anomaly: The thermometer malfunctioned. (Investigate and possibly remove it.)
  • A genuine extreme value: Someone actually earns $50 million a year. (Keep it, but be aware it pulls the mean.)
  • The most interesting part of your data: A patient who recovered impossibly fast might be the key to a medical breakthrough. (Study it carefully.)

The right response to an outlier is always to investigate, not to automatically delete it.


Check Your Understanding — Retrieval Practice #2 (try to answer without scrolling up)

  1. What does "skewed right" mean? Which way does the tail point? Which way does the hump lean?
  2. Maya's flu data showed two peaks — one for children, one for older adults. What's the term for this shape?
  3. What's the difference between frequency and relative frequency?
  4. Name three possible explanations for an outlier.

Check your thinking

  1. Skewed right means the right tail is longer. The tail points to the right (toward larger values). The bulk of the data (the hump) is on the left (at smaller values). Income distributions are a classic example.
  2. Bimodal — two modes (peaks). Bimodal distributions often indicate two distinct subgroups in the data.
  3. Frequency is the raw count of observations in a category or bin. Relative frequency is the proportion — frequency divided by the total number of observations. Relative frequencies sum to 1 (or 100%).
  4. (Any three of): data entry error, measurement anomaly, genuine extreme value, or an observation that's actually the most important part of your data and deserves deeper investigation.

5.8 The Threshold Concept: Distribution Thinking

THRESHOLD CONCEPT: Distribution Thinking

Here's the mental shift that separates someone who "does statistics" from someone who thinks statistically.

Before this chapter, you probably thought of data as individual numbers. Sam's player shot 44% from the field. Maya's average flu patient is 38 years old. Alex's users watch an average of 47 minutes per session.

After this chapter, you should start seeing data as distributions — entire shapes with centers, spreads, peaks, tails, and outliers. Not "the average is 44%" but "the distribution of shooting percentages is symmetric and unimodal, centered around 44%, with a spread from about 15% to 70%." Not "the average age is 38" but "the age distribution is bimodal, with peaks in children and older adults, meaning the average age describes almost nobody."

This is distribution thinking: the habit of asking not just "what's the typical value?" but "what does the whole shape look like?"

Why does this matter? Because two datasets can have the same average but completely different shapes — and therefore completely different stories. A symmetric distribution with mean 50 tells a very different story than a skewed distribution with mean 50. An average of $60,000 means something very different in a community where everyone earns between $50,000 and $70,000 versus a community where half earn $30,000 and half earn $90,000.

The shape is the story. And from now on, every time someone gives you a single summary number — a mean, a percentage, a "typical value" — I want you to ask: but what does the distribution look like?

This threshold concept will resurface in Chapter 7 (checking distributions during data cleaning), Chapter 10 (the normal distribution as a model), and Chapter 15 (choosing the right test based on distribution shape).


5.9 Which Graph Should I Use? A Decision Guide

This is the question students ask me most often, and it has a clear answer. Here's your decision flowchart:

What type of variable(s) are you graphing?
│
├── ONE CATEGORICAL variable
│   ├── Bar chart (default — always works)
│   └── Pie chart (only if ≤ 5 categories and showing parts of a whole)
│
├── ONE NUMERICAL variable
│   ├── Histogram (default — always works)
│   ├── Stem-and-leaf plot (small datasets, ≤ 50 observations)
│   └── Box plot (comparing groups — see Chapter 6)
│
├── TWO CATEGORICAL variables
│   ├── Grouped bar chart (side-by-side bars)
│   └── Stacked bar chart (bars divided into segments)
│
├── ONE CATEGORICAL + ONE NUMERICAL
│   ├── Side-by-side histograms
│   ├── Side-by-side box plots (Chapter 6)
│   └── Overlaid histograms (with transparency)
│
└── TWO NUMERICAL variables
    └── Scatterplot (Chapter 22 — quick preview below)

The Quick Reference Table

Your Data Best Graph Why
Categories (one variable) Bar chart Compares frequencies across groups
Categories (parts of whole) Pie chart (use sparingly) Shows proportions summing to 100%
Numbers (one variable) Histogram Shows distribution shape
Numbers (small dataset) Stem-and-leaf Preserves exact values + shows shape
Categories vs. numbers Side-by-side histograms or box plots Compares distributions across groups
Numbers vs. numbers Scatterplot Shows relationship between two numerical variables

Remember from Chapter 2: the first step is always identifying your variable types. Categorical variables get bar charts. Numerical variables get histograms. This connection between variable classification and graph choice is one of the most practical skills you'll use throughout the course.

A Quick Preview: Time Series Plots

There's one more graph type you'll encounter often, even though we won't study it in depth until later: the time series plot (also called a line graph). A time series plot shows how a numerical variable changes over time, with time on the horizontal axis and the measured variable on the vertical axis. Points are connected by lines to emphasize the trend.

Maya might plot weekly flu case counts over a year to see the seasonal pattern. Alex might plot daily average watch time over six months to spot the effect of a new feature launch.

We mention this here because time series plots are everywhere — in news articles, business dashboards, COVID tracking websites — and you should be able to read them. But the statistical techniques for analyzing time series data are beyond the scope of this chapter.


5.10 Python: Building Your Visualization Toolkit

Let's put it all together. Here are the Python patterns you'll use most often, building on the pandas skills from Chapter 3.

Setting Up Your Visualization Environment

At the top of any notebook where you'll make graphs, include these imports:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: make plots look nicer
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)

The Complete Histogram Workflow

Let's say Maya loaded her flu data into a DataFrame called flu_df with a column called age:

# Basic histogram
plt.figure(figsize=(8, 5))
sns.histplot(data=flu_df, x='age', bins=10,
             color='steelblue', edgecolor='white')
plt.title('Age Distribution of Flu Patients')
plt.xlabel('Age (years)')
plt.ylabel('Number of Patients')
plt.tight_layout()
plt.show()

Comparing Groups with Overlaid Histograms

Maya wants to compare age distributions across two communities:

plt.figure(figsize=(9, 5))
sns.histplot(data=flu_df, x='age', hue='community',
             bins=10, alpha=0.5, edgecolor='white')
plt.title('Age Distribution by Community')
plt.xlabel('Age (years)')
plt.ylabel('Number of Patients')
plt.tight_layout()
plt.show()

The hue='community' parameter splits the data by the community column and uses different colors. The alpha=0.5 makes the bars semi-transparent so overlapping areas are visible.

Bar Chart from a DataFrame

plt.figure(figsize=(8, 5))
sns.countplot(data=stream_df, x='device_type',
              order=stream_df['device_type'].value_counts().index,
              color='steelblue')
plt.title('Users by Device Type')
plt.xlabel('Device')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Saving Your Plots

Instead of just displaying a plot, you can save it as an image file:

plt.savefig('flu_age_histogram.png', dpi=150, bbox_inches='tight')

Put this line before plt.show() — after show() is called, the figure is cleared from memory.

From .describe() to Visualization

In Chapter 3, you used .describe() to get summary statistics. Now you can see what those numbers mean. Compare:

# Chapter 3 approach — numbers
print(flu_df['age'].describe())

# Chapter 5 approach — picture
sns.histplot(data=flu_df, x='age', bins=10)
plt.show()

The .describe() output tells you the mean is 38 and the standard deviation is 25. The histogram shows you why — it's because the distribution is bimodal with peaks at both ends of the age range, not because everyone clusters around 38. The numbers and the pictures complement each other. Neither is complete without the other.


5.11 Common Graphing Mistakes (and How to Avoid Them)

Data visualization is powerful — which means it's also easy to misuse, whether intentionally or accidentally. Here are the mistakes you'll see most often:

Mistake 1: Truncated Axes

A bar chart where the vertical axis starts at 50 instead of 0. Suddenly a bar at 55 looks five times taller than a bar at 51 — even though the actual difference is tiny. This is the single most common way graphs mislead.

The fix: For bar charts, always start the axis at zero. (Line graphs and scatterplots are different — zero isn't always meaningful for those, and starting at zero can actually compress the interesting variation into a flat line.)

Mistake 2: 3D Charts

Three-dimensional bar charts and pie charts look flashy. They're also actively misleading. The perspective distortion makes bars in the back look shorter than bars in the front, even when they represent the same value. Pie slices angled toward the viewer look larger than slices angled away.

The fix: Never use 3D charts for serious data analysis. Ever. If someone sends you a 3D pie chart, politely ask for the 2D version.

Mistake 3: Misleading Bin Widths in Histograms

Using unequal bin widths in a histogram makes bars with wider bins look more prominent — not because they contain more data, but because they take up more space. This distorts the shape of the distribution.

The fix: Use equal bin widths. Let Python or Excel choose the default, or set a consistent width manually.

Mistake 4: Using a Bar Chart for Numerical Data (or a Histogram for Categorical Data)

This is the bar-chart-vs-histogram confusion in action. If you make a bar chart of ages (with one bar for age 23, another for age 24, another for age 25...), you'll get a useless mess of hundreds of bars. If you make a histogram of device types... actually, you can't, because Python will throw an error. Good.

The fix: Check your variable type first. Categorical → bar chart. Numerical → histogram. Always.

Mistake 5: Pie Charts with Too Many Categories

A pie chart with 15 slices is a circle of confusion. Nobody can compare twelve narrow slices and extract meaning.

The fix: Limit pie charts to 5 categories, max. If you have more, use a bar chart — or group smaller categories into an "Other" category.

Mistake 6: Missing Labels and Titles

A graph without axis labels is like a sentence without a verb — technically possible, but it doesn't communicate. Every graph needs: - A descriptive title - Labels on both axes (with units!) - A legend if multiple groups are plotted

The fix: Always add plt.title(), plt.xlabel(), and plt.ylabel() in your Python code. In Excel, click the "+" button next to the chart to add chart elements.


Check Your Understanding — Retrieval Practice #3 (try to answer without scrolling up)

  1. Name three common graphing mistakes that can make a visualization misleading.
  2. In the graph decision guide, what type of graph do you use for one categorical variable? For one numerical variable?
  3. What Python library do you use for creating statistical visualizations? What's the function for making a histogram?
  4. Why does Maya's histogram tell a different story than her .describe() output?

Check your thinking

  1. (Any three of): truncated axes (not starting at zero), 3D charts, unequal bin widths, wrong graph type for the variable, pie charts with too many categories, missing labels/titles.
  2. One categorical variable → bar chart. One numerical variable → histogram.
  3. seaborn (often abbreviated sns). The function for histograms is sns.histplot(). (matplotlib's plt.hist() also works but seaborn provides nicer defaults.)
  4. .describe() reported a mean age of about 38, which sounds like flu patients are middle-aged. But the histogram revealed a bimodal distribution — most patients are either very young (children) or older adults. The "average" patient (age 38) barely exists in the data. The shape of the distribution tells a completely different story than the single mean value.

5.12 Putting It All Together: Alex's StreamVibe Dashboard

Let's walk through a complete visualization analysis using Alex Rivera's StreamVibe data. She has a dataset of 5,000 users with several variables:

  • device_type (categorical — Phone, Tablet, Laptop, Smart TV, Desktop)
  • subscription_tier (categorical — Free, Basic, Premium)
  • watch_time_minutes (numerical — minutes watched per session)
  • age (numerical — user age in years)
  • satisfaction_rating (ordinal — 1 to 5 stars)

Step 1: Categorical Variables → Bar Charts

Alex starts by visualizing her categorical variables:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Device type bar chart
sns.countplot(data=stream_df, x='device_type',
              order=stream_df['device_type'].value_counts().index,
              color='steelblue', ax=axes[0])
axes[0].set_title('Users by Device Type')
axes[0].set_xlabel('Device')
axes[0].set_ylabel('Count')

# Subscription tier bar chart
sns.countplot(data=stream_df, x='subscription_tier',
              order=['Free', 'Basic', 'Premium'],
              color='coral', ax=axes[1])
axes[1].set_title('Users by Subscription Tier')
axes[1].set_xlabel('Tier')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

Visual description (side-by-side bar charts): Two bar charts displayed side by side. The left chart shows device type usage: Smart TV is the tallest bar (about 1,700 users), followed by Phone (about 1,500), Laptop (about 900), Tablet (about 550), and Desktop (about 350). The right chart shows subscription tiers: Free is tallest (about 2,200), Basic is next (about 1,800), and Premium is shortest (about 1,000). Both charts have gaps between bars, axes starting at zero, and clear labels.

Step 2: Numerical Variables → Histograms

Now Alex visualizes her numerical variables:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Watch time histogram
sns.histplot(data=stream_df, x='watch_time_minutes', bins=20,
             color='steelblue', edgecolor='white', ax=axes[0])
axes[0].set_title('Distribution of Watch Time per Session')
axes[0].set_xlabel('Minutes')
axes[0].set_ylabel('Number of Sessions')

# Age histogram
sns.histplot(data=stream_df, x='age', bins=15,
             color='coral', edgecolor='white', ax=axes[1])
axes[1].set_title('Age Distribution of Users')
axes[1].set_xlabel('Age (years)')
axes[1].set_ylabel('Number of Users')

plt.tight_layout()
plt.show()

Visual description (watch time histogram): A histogram of session watch times. The distribution is clearly skewed right: most sessions are clustered in the 10-40 minute range (with the tallest bars around 20-30 minutes), but a long tail stretches to the right, with a few sessions lasting 120-180+ minutes. These are the binge-watchers — a small but important group. The center is around 30 minutes, but the mean is pulled to the right by the long tail, so it's higher than the median.

Visual description (age histogram): A histogram of user ages. The distribution is roughly unimodal and approximately symmetric, centered around 30-35 years old. It tails off gradually on both sides, with fewer users under 18 and over 55. There are no obvious outliers or gaps.

Step 3: Describe What You See

Now Alex writes up her findings using the vocabulary from Section 5.7:

"Watch time is skewed right and unimodal, with most sessions lasting 10-40 minutes and a long tail of binge sessions extending past 120 minutes. The outliers on the high end represent our most engaged users — roughly 3% of sessions exceed 90 minutes. These users likely have very different viewing habits from typical users and may warrant separate analysis.

User age is approximately symmetric and unimodal, centered around 30-35 years old. The distribution doesn't show major skew or any obvious bimodal pattern, suggesting our platform appeals fairly evenly across age groups in the 20-45 range, with drop-off at both extremes."

This is distribution thinking in action. Alex isn't just reporting means — she's describing shapes, identifying features, and drawing preliminary conclusions.


5.13 Project Checkpoint: Your Turn

DATA DETECTIVE PORTFOLIO — Chapter 5

It's time to apply what you've learned to your own dataset. If you've been following along with the progressive project, you already have a dataset loaded in a Jupyter notebook from Chapter 3 and you've evaluated its collection method in Chapter 4. Now you'll make it visual.

Your tasks:

  1. Identify 3-4 variables in your dataset — at least one categorical and at least one numerical.

  2. Create appropriate graphs for each variable: - Bar charts for categorical variables - Histograms for numerical variables - At least one graph that compares groups (e.g., overlaid histograms or side-by-side bar charts)

  3. Describe each distribution using the vocabulary from this chapter: - Shape (symmetric, skewed left/right, unimodal, bimodal) - Center (approximately where?) - Spread (narrow or wide?) - Unusual features (outliers, gaps, clusters)

  4. Write a 2-3 paragraph summary of what your graphs reveal that .describe() alone didn't show you. (Remember how Maya's mean age of 38 hid the bimodal pattern? Look for surprises like that in your own data.)

Code template to get started:

```python import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Load your dataset (update the path)

df = pd.read_csv('your_dataset.csv')

Categorical variable bar chart

plt.figure(figsize=(8, 5)) sns.countplot(data=df, x='your_categorical_column', order=df['your_categorical_column'].value_counts().index, color='steelblue') plt.title('Distribution of [Your Variable]') plt.xlabel('[Variable Name]') plt.ylabel('Count') plt.xticks(rotation=45) plt.tight_layout() plt.show()

Numerical variable histogram

plt.figure(figsize=(8, 5)) sns.histplot(data=df, x='your_numerical_column', bins=15, color='coral', edgecolor='white') plt.title('Distribution of [Your Variable]') plt.xlabel('[Variable Name and Units]') plt.ylabel('Frequency') plt.tight_layout() plt.show() ```

Suggested datasets and what to look for: - CDC BRFSS: Try _BMI5 (BMI) — likely skewed right. Try GENHLTH (general health rating) — ordinal, use a bar chart - Gapminder: Try lifeExp — may be bimodal (developing vs. developed countries). Try continent — bar chart - College Scorecard: Try ADM_RATE (admission rate) — interesting shape. Try REGION — bar chart - World Happiness Report: Try Happiness Score — approximately symmetric. Try Region — bar chart - NOAA Climate: Try temperature — may be bimodal (summer vs. winter if you have a full year)


5.14 Spaced Review: Strengthening Previous Learning

These questions revisit concepts from earlier chapters at expanding intervals, helping you build long-term retention.

SR.1 (From Chapter 1 — Descriptive vs. Inferential Statistics): Everything you did in this chapter — creating histograms, describing distribution shapes, identifying outliers — falls under descriptive statistics. Explain why. Then describe a question about one of the graphs in this chapter that would require inferential statistics to answer.

Check your thinking All of the graphs and descriptions in this chapter *summarize and display the data we already have* — that's descriptive statistics. We're describing the sample, not drawing conclusions about a larger population. An inferential question might be: "Based on the skewed-right watch-time distribution in our sample of 5,000 StreamVibe users, can we conclude that watch-time is also skewed right for *all* StreamVibe users?" Or: "Is the bimodal pattern in Maya's flu data from Community A statistically different from the distribution in Community B, or could the difference be due to random chance?" These questions require inferential methods (confidence intervals, hypothesis tests) that we'll learn in Chapters 12-13.

SR.2 (From Chapter 3 — pandas .describe()): In Chapter 3, you used .describe() to get summary statistics like the mean, standard deviation, min, and max. Now that you've seen histograms, explain a situation where .describe() alone could be misleading. What does a histogram reveal that .describe() hides?

Check your thinking `.describe()` gives you single numbers — mean of 38, standard deviation of 25, min of 2, max of 87. But these numbers don't tell you the *shape* of the distribution. Maya's flu data had a mean age of 38, but the histogram revealed the distribution was *bimodal* — most patients were either very young or very old, and almost nobody was near the "average" age of 38. Two completely different distributions can have the same mean and standard deviation. The histogram reveals shape (symmetric vs. skewed, unimodal vs. bimodal), clusters, gaps, and outliers — none of which appear in `.describe()` output. This is why Chapter 5 follows Chapter 3: tools first, then visualization to understand what the tools are showing you.

SR.3 (From Chapter 2 — Choosing Graphs by Variable Type): In Chapter 2, you learned to classify variables as categorical (nominal, ordinal) or numerical (discrete, continuous). Explain how this classification directly determines which graph you should use. Give an example of choosing the wrong graph type for a variable and explain why it fails.

Check your thinking Categorical variables → bar charts (or pie charts). Numerical variables → histograms (or stem-and-leaf plots). The variable type determines the graph type because categorical data has distinct groups (bars with gaps) while numerical data falls on a continuous scale (bars touching, bins on a number line). **Wrong graph example:** Making a histogram of `device_type` (Phone, Laptop, Tablet, Smart TV, Desktop). This fails because the x-axis of a histogram is a continuous number line — you can't place category names on it meaningfully. There's no ordering where "Phone" is between "Laptop" and "Tablet" on a numerical scale. The reverse mistake — making a bar chart of ages with one bar per year — fails because you'd get 70+ separate bars with tiny counts, and the gaps between bars would falsely suggest the ages are distinct categories rather than points on a continuous scale.

Chapter Summary

Let's step back and see the big picture of what you've learned.

The Big Ideas

  1. Graphs reveal what numbers hide. A single summary statistic (like the mean) can't capture the shape of a distribution. Histograms show you the full picture — peaks, tails, clusters, gaps, and outliers that summary statistics alone would miss.

  2. Variable type determines graph type. Categorical variables → bar charts (or pie charts). Numerical variables → histograms (or stem-and-leaf plots). Getting this match right is the first step in any visualization.

  3. Distribution thinking is a threshold concept. Instead of seeing data as individual numbers, start seeing it as a distribution with a shape. That shape — symmetric or skewed, unimodal or bimodal — tells you the story of your data.

  4. Describing distributions requires four elements: Shape, center, spread, and unusual features. Develop the habit of reporting all four every time you look at a histogram.

  5. Graphs can mislead as easily as they inform. Truncated axes, 3D effects, wrong graph types, and missing labels can all distort the truth. Be a critical consumer of visualizations — and an honest creator of them.

Key Terms

Term Definition
Histogram A graph that divides numerical data into equal-width bins and displays the count per bin as touching bars
Bar chart A graph that displays the frequency of each category as separate bars with gaps between them
Pie chart A circular graph that shows the proportion of each category as a slice
Stem-and-leaf plot A display that splits each value into a stem and leaf, showing both shape and exact values
Frequency distribution A table organizing data into classes with their counts
Relative frequency The proportion of observations in a class (frequency / total)
Distribution shape The overall pattern of a histogram — symmetric, skewed, unimodal, bimodal, etc.
Symmetric A distribution whose left and right sides are approximate mirror images
Skewed right A distribution with a longer tail extending to the right (toward larger values)
Skewed left A distribution with a longer tail extending to the left (toward smaller values)
Unimodal A distribution with one peak
Bimodal A distribution with two peaks
Outlier An observation that falls far from the rest of the data

What's Next

You can now see data. You can create histograms and bar charts, describe the shapes of distributions, spot outliers, and choose the right graph for the right variable. That's a powerful foundation.

But you've probably noticed that describing a distribution's "center" and "spread" has been somewhat vague — "around 44%" or "stretching from 15% to 70%." In Chapter 6: Numerical Summaries — Center, Spread, and Shape, you'll learn to quantify those descriptions with precise numbers: means, medians, standard deviations, percentiles, and the five-number summary. You'll also meet the box plot — a compact graph that summarizes an entire distribution in five numbers.

Think of it this way: Chapter 5 taught you to see the shape. Chapter 6 will teach you to measure it.

In Chapter 7: Data Wrangling, you'll learn how to clean messy data — handling missing values, fixing data types, and making decisions about outliers you identified in this chapter. Because real data is never as clean as the examples in textbooks.

The interplay between visualization and summary statistics is at the heart of exploratory data analysis. You need both. The numbers without the pictures can mislead (Maya's mean of 38). The pictures without the numbers lack precision (how much is the distribution skewed?). Together, they tell the complete story.

You've started to see data as distributions. That vision will only get sharper from here.