Key Takeaways: Your First Data Analysis
This is your reference card for Chapter 6 — the chapter where data science stopped being abstract and became something you actually did. Keep this nearby every time you sit down with a new dataset.
The EDA Workflow Checklist
Every data exploration follows the same arc. Use this checklist until it becomes second nature:
1. DEFINE QUESTIONS What do I want to learn from this data?
| (Write 2-3 specific questions before touching data.)
v
2. DOCUMENT PROVENANCE Where did this data come from?
| (Source, collection date, known limitations.)
v
3. LOAD AND VERIFY Does the file load correctly?
| (Row count, column names, no obvious corruption.)
v
4. INSPECT STRUCTURE What does each column contain?
| (Data types, unique values, first/last rows.)
v
5. ASSESS QUALITY What's missing or broken?
| (Missing values, outliers, inconsistencies.)
v
6. COMPUTE STATISTICS What do the numbers say?
| (Count, min, max, mean, median — overall and by group.)
v
7. COMPARE GROUPS Do patterns differ across categories?
| (Break down stats by region, year, category.)
v
8. DOCUMENT FINDINGS What did I learn?
| (Plain-English observations between code cells.)
v
9. LIST NEXT QUESTIONS What new questions did this raise?
(Every good EDA ends with more questions than it started with.)
Question Formulation Guide
Good data science starts with good questions. Use this framework to sharpen your questions before you start coding.
| Criterion | Good Example | Weak Example | Why It Matters |
|---|---|---|---|
| Specific | "What is the mean MCV1 coverage in AFRO for 2022?" | "What's going on with the data?" | Specific questions produce concrete, verifiable answers |
| Answerable with the data | "How does coverage vary by region?" | "Why do some countries have low coverage?" | "Why" usually requires causal information not in the dataset |
| Connected to a decision | "Which regions should receive additional funding?" | "How many rows are in the file?" | Analysis should inform understanding or action |
| Honest about limitations | "What patterns exist (noting this is correlation, not causation)?" | "This proves that income causes higher turnout" | Overstating conclusions undermines credibility |
Three types of questions: - Descriptive: What happened? What does the data look like? (Start here.) - Predictive: What is likely to happen? (Requires modeling — Part V.) - Causal: What would happen if we changed something? (Requires experimental design — Chapter 24.)
Summary Statistics Reference
These are the statistics you should compute for every numeric column in a new dataset:
| Statistic | What It Tells You | Pure Python |
|---|---|---|
| Count | How many non-missing values exist | len(values) |
| Min | The smallest value | min(values) |
| Max | The largest value | max(values) |
| Range | The spread from smallest to largest | max(values) - min(values) |
| Mean | The arithmetic average — sensitive to outliers | sum(values) / len(values) |
| Median | The middle value when sorted — resistant to outliers | Sort, then take middle value |
Interpreting mean vs. median: - Mean = Median: roughly symmetric distribution - Mean > Median: right-skewed (pulled up by high outliers) - Mean < Median: left-skewed (pulled down by low outliers)
Data Quality Checklist
Run these checks on every new dataset before trusting any results:
- [ ] Completeness: Count missing values for every column. Note any column above 5% missing.
- [ ] Validity: Check that values fall within expected ranges (no negative ages, no percentages above 100, no dates in the future).
- [ ] Consistency: Look for multiple representations of the same thing ("US" vs. "United States" vs. "USA").
- [ ] Uniqueness: Check for unexpected duplicate rows.
- [ ] Outliers: Identify extreme values and determine whether they're real or erroneous.
- [ ] Type correctness: Confirm that numeric columns are actually numeric (remember: CSV loads everything as strings).
Three types of missing data: - MCAR (Missing Completely at Random): No pattern to the missingness. Least problematic. - MAR (Missing at Random): Missingness is related to an observed variable. Manageable with care. - MNAR (Missing Not at Random): Missingness is related to the missing value itself. Most problematic — can systematically bias your analysis.
Key Concepts
-
Exploratory data analysis (EDA) is a conversation with your data. You ask, the data answers, and each answer leads to the next question. The purpose is to discover patterns and generate hypotheses, not to confirm them.
-
Data loading with
csv.DictReadergives you a list of dictionaries. All values are strings. You must convert to numeric types before doing math and handle empty strings carefully. -
Data provenance documents where your data came from, who collected it, when, how, and for what purpose. Without provenance, you can't assess trustworthiness.
-
A data dictionary describes each column: its name, meaning, expected type, valid range, and quality notes. Create one as your first analysis step.
-
Notebook narrative means writing your Jupyter notebook as a document that tells a story — with headers, questions, interpretation, and conclusions between code cells. A notebook without narrative is just a code dump.
-
Reproducibility means someone else can rerun your notebook and get the same results. This requires accessible data, cells that run in order, documented dependencies, and seeded randomness.
Terms to Remember
| Term | Definition |
|---|---|
| Exploratory data analysis (EDA) | The systematic process of examining a dataset to discover patterns, spot anomalies, test assumptions, and generate hypotheses |
| Data loading | The process of reading data from a file into a program's memory for analysis |
| Data inspection | Examining basic properties of a dataset: shape, column names, data types, unique values, first/last rows |
| Data quality | The degree to which data is accurate, complete, consistent, and suitable for its intended use |
| Missing values | Data points that are absent — represented as empty strings in CSV, None in Python, or NaN in pandas |
| Outlier | A data point that falls far outside the typical range of values — may be a genuine extreme or an error |
| Summary statistics | Numerical measures that describe a dataset's central tendency (mean, median) and spread (range, standard deviation) |
| Data dictionary | A reference document describing each column's name, meaning, data type, valid values, and quality notes |
| Data provenance | Documentation of a dataset's origin: who collected it, when, how, why, and any known limitations |
| Notebook narrative | The practice of writing a Jupyter notebook as a readable document that combines code, output, and explanatory text |
| Reproducibility | The principle that an analysis can be independently replicated to produce the same results |
What You Should Be Able to Do Now
Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section.
- [ ] Load a CSV file into a list of dictionaries using
csv.DictReader - [ ] Inspect a dataset's basic properties: row count, column names, first/last rows, unique values
- [ ] Compute summary statistics (count, min, max, mean, median) using pure Python
- [ ] Handle missing values when extracting numeric data (skip empty strings, use try/except)
- [ ] Assess data quality by counting missing values, checking ranges, and spotting inconsistencies
- [ ] Break down statistics by group (e.g., mean coverage by region) using loops and dictionaries
- [ ] Formulate specific, answerable questions before exploring data
- [ ] Write a notebook narrative that combines code, output, and plain-English interpretation
- [ ] Explain why pure Python is limited for data analysis and what tools (pandas, matplotlib) will address those limitations
- [ ] Describe the EDA workflow and apply it to any new dataset
If you checked every box, congratulations — you've completed Part I. You are now a person who does data science, not just a person who reads about it. Part II is going to arm you with power tools that make everything faster. Let's go.