Key Takeaways: Exploring Data — Graphs and Descriptive Statistics
One-Sentence Summary
Data visualization transforms raw numbers into pictures that reveal patterns, shapes, and surprises — and the first skill of statistical thinking is learning to see data as a distribution, not as individual values.
Core Concepts at a Glance
| Concept | Definition | Why It Matters |
|---|---|---|
| Histogram | Divides numerical data into equal-width bins with touching bars | Reveals the shape of a distribution — the single most important graph in introductory statistics |
| Bar chart | Displays categorical frequencies with separate (non-touching) bars | Shows how observations are spread across categories |
| Distribution shape | The overall pattern: symmetric, skewed, unimodal, bimodal | The shape tells the story — two datasets with the same mean can have completely different shapes |
| Outlier | An observation far from the rest of the data | May be an error, an anomaly, a genuine extreme, or the most important data point — investigate before acting |
| Distribution thinking | Seeing data as a whole distribution rather than individual numbers | The threshold concept that separates looking at data from truly understanding it |
Graph Selection Guide
What type of variable(s) are you graphing?
│
├── ONE CATEGORICAL variable
│ ├── Bar chart ← DEFAULT (always works)
│ └── Pie chart (≤ 5 categories, parts of a whole only)
│
├── ONE NUMERICAL variable
│ ├── Histogram ← DEFAULT (always works)
│ └── Stem-and-leaf plot (≤ 50 observations, want exact values)
│
├── TWO CATEGORICAL variables
│ ├── Grouped bar chart (side-by-side)
│ └── Stacked bar chart
│
├── ONE CATEGORICAL + ONE NUMERICAL
│ ├── Side-by-side histograms
│ └── Side-by-side box plots (Chapter 6)
│
└── TWO NUMERICAL variables
└── Scatterplot (Chapter 22)
Quick Reference: Bar Chart vs. Histogram
| Feature | Bar Chart | Histogram |
|---|---|---|
| Variable type | Categorical | Numerical |
| Bars touch? | No (gaps between bars) | Yes (bars are adjacent) |
| X-axis shows | Category names | Numerical scale (bins) |
| Bar order | Can rearrange | Must follow number line |
| Bar width | Cosmetic only | Defines bin width |
The simplest test: If the x-axis has words, it's a bar chart. If the x-axis has numbers, it's probably a histogram.
Distribution Shape Vocabulary
| Term | What It Looks Like | Real-World Example |
|---|---|---|
| Symmetric | Left and right sides are mirror images | Human body temperatures |
| Skewed right | Long tail stretches to the right (higher values) | Household income |
| Skewed left | Long tail stretches to the left (lower values) | Easy exam scores |
| Unimodal | One peak | Heights of adult women |
| Bimodal | Two peaks | Flu cases by age (children + elderly) |
| Uniform | All bars roughly equal height | Rolling a fair die |
Memory trick: The skew is named for the direction of the tail, not the hump. Right-skewed = tail points right, hump on the left.
The Four-Part Description (Use Every Time)
When you look at any histogram, describe:
- Shape — Symmetric or skewed? Unimodal, bimodal, or uniform?
- Center — Where is the approximate middle?
- Spread — How wide is the distribution? (Range from min to max)
- Unusual features — Any outliers? Gaps? Clusters?
Example: "The distribution of watch times is skewed right and unimodal, centered around 25-30 minutes, with a spread from 5 to 180 minutes. A few outliers beyond 120 minutes represent binge-watching sessions."
Common Graphing Mistakes
| Mistake | The Problem | The Fix |
|---|---|---|
| Truncated axis | Bar heights don't reflect true ratios | Start bar chart axes at zero |
| 3D effects | Perspective distortion skews comparisons | Always use 2D charts |
| Unequal bin widths | Wider bins appear more prominent | Use equal-width bins |
| Wrong graph type | E.g., histogram for categorical data | Match graph to variable type |
| Too many pie slices | Impossible to compare similar-sized slices | Limit to 5 categories or use a bar chart |
| Missing labels | Reader can't interpret the graph | Always include title, axis labels, and units |
Python Quick Reference
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Bar chart (categorical variable)
sns.countplot(data=df, x='category_column', color='steelblue')
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Count')
plt.show()
# Histogram (numerical variable)
sns.histplot(data=df, x='number_column', bins=15, edgecolor='white')
plt.title('Title')
plt.xlabel('X Label (units)')
plt.ylabel('Frequency')
plt.show()
# Overlaid histograms (comparing groups)
sns.histplot(data=df, x='number_column', hue='group_column',
bins=10, alpha=0.5, edgecolor='white')
plt.show()
Key Terms
| Term | Definition |
|---|---|
| Histogram | Numerical data divided into equal-width bins, displayed as touching bars |
| Bar chart | Categorical data displayed as separate bars with gaps |
| Pie chart | Proportions shown as slices of a circle |
| Stem-and-leaf plot | Data split into stems and leaves, preserving exact values |
| Frequency distribution | Table organizing data into classes with counts |
| Relative frequency | Proportion of observations in a class (count / total) |
| Distribution shape | Overall pattern of a histogram (symmetric, skewed, etc.) |
| Symmetric | Left and right sides are approximately mirror images |
| Skewed right | Longer tail extends toward larger values |
| Skewed left | Longer tail extends toward smaller values |
| Unimodal | Distribution with one peak |
| Bimodal | Distribution with two peaks |
| Outlier | Observation far from the rest of the data |
The One Thing to Remember
If you forget everything else from this chapter, remember this:
A single number can never fully describe a dataset. Two distributions can have the same mean but completely different shapes — and completely different stories. Always look at the shape. The shape is the story.