Key Takeaways: Your First Data Analysis

Contributors to Introduction to Data Science

Key Takeaways: Your First Data Analysis

This is your reference card for Chapter 6 — the chapter where data science stopped being abstract and became something you actually did. Keep this nearby every time you sit down with a new dataset.

The EDA Workflow Checklist

Every data exploration follows the same arc. Use this checklist until it becomes second nature:

1. DEFINE QUESTIONS     What do I want to learn from this data?
    |                   (Write 2-3 specific questions before touching data.)
    v
2. DOCUMENT PROVENANCE  Where did this data come from?
    |                   (Source, collection date, known limitations.)
    v
3. LOAD AND VERIFY      Does the file load correctly?
    |                   (Row count, column names, no obvious corruption.)
    v
4. INSPECT STRUCTURE    What does each column contain?
    |                   (Data types, unique values, first/last rows.)
    v
5. ASSESS QUALITY       What's missing or broken?
    |                   (Missing values, outliers, inconsistencies.)
    v
6. COMPUTE STATISTICS   What do the numbers say?
    |                   (Count, min, max, mean, median — overall and by group.)
    v
7. COMPARE GROUPS       Do patterns differ across categories?
    |                   (Break down stats by region, year, category.)
    v
8. DOCUMENT FINDINGS    What did I learn?
    |                   (Plain-English observations between code cells.)
    v
9. LIST NEXT QUESTIONS  What new questions did this raise?
                        (Every good EDA ends with more questions than it started with.)

Question Formulation Guide

Good data science starts with good questions. Use this framework to sharpen your questions before you start coding.

Criterion	Good Example	Weak Example	Why It Matters
Specific	"What is the mean MCV1 coverage in AFRO for 2022?"	"What's going on with the data?"	Specific questions produce concrete, verifiable answers
Answerable with the data	"How does coverage vary by region?"	"Why do some countries have low coverage?"	"Why" usually requires causal information not in the dataset
Connected to a decision	"Which regions should receive additional funding?"	"How many rows are in the file?"	Analysis should inform understanding or action
Honest about limitations	"What patterns exist (noting this is correlation, not causation)?"	"This proves that income causes higher turnout"	Overstating conclusions undermines credibility

Three types of questions: - Descriptive: What happened? What does the data look like? (Start here.) - Predictive: What is likely to happen? (Requires modeling — Part V.) - Causal: What would happen if we changed something? (Requires experimental design — Chapter 24.)

Summary Statistics Reference

These are the statistics you should compute for every numeric column in a new dataset:

Statistic	What It Tells You	Pure Python
Count	How many non-missing values exist	`len(values)`
Min	The smallest value	`min(values)`
Max	The largest value	`max(values)`
Range	The spread from smallest to largest	`max(values) - min(values)`
Mean	The arithmetic average — sensitive to outliers	`sum(values) / len(values)`
Median	The middle value when sorted — resistant to outliers	Sort, then take middle value

Interpreting mean vs. median: - Mean = Median: roughly symmetric distribution - Mean > Median: right-skewed (pulled up by high outliers) - Mean < Median: left-skewed (pulled down by low outliers)

Data Quality Checklist

Run these checks on every new dataset before trusting any results:

[ ] Completeness: Count missing values for every column. Note any column above 5% missing.
[ ] Validity: Check that values fall within expected ranges (no negative ages, no percentages above 100, no dates in the future).
[ ] Consistency: Look for multiple representations of the same thing ("US" vs. "United States" vs. "USA").
[ ] Uniqueness: Check for unexpected duplicate rows.
[ ] Outliers: Identify extreme values and determine whether they're real or erroneous.
[ ] Type correctness: Confirm that numeric columns are actually numeric (remember: CSV loads everything as strings).

Three types of missing data: - MCAR (Missing Completely at Random): No pattern to the missingness. Least problematic. - MAR (Missing at Random): Missingness is related to an observed variable. Manageable with care. - MNAR (Missing Not at Random): Missingness is related to the missing value itself. Most problematic — can systematically bias your analysis.

Key Concepts

Exploratory data analysis (EDA) is a conversation with your data. You ask, the data answers, and each answer leads to the next question. The purpose is to discover patterns and generate hypotheses, not to confirm them.
Data loading with csv.DictReader gives you a list of dictionaries. All values are strings. You must convert to numeric types before doing math and handle empty strings carefully.
Data provenance documents where your data came from, who collected it, when, how, and for what purpose. Without provenance, you can't assess trustworthiness.
A data dictionary describes each column: its name, meaning, expected type, valid range, and quality notes. Create one as your first analysis step.
Notebook narrative means writing your Jupyter notebook as a document that tells a story — with headers, questions, interpretation, and conclusions between code cells. A notebook without narrative is just a code dump.
Reproducibility means someone else can rerun your notebook and get the same results. This requires accessible data, cells that run in order, documented dependencies, and seeded randomness.

Terms to Remember

Term	Definition
Exploratory data analysis (EDA)	The systematic process of examining a dataset to discover patterns, spot anomalies, test assumptions, and generate hypotheses
Data loading	The process of reading data from a file into a program's memory for analysis
Data inspection	Examining basic properties of a dataset: shape, column names, data types, unique values, first/last rows
Data quality	The degree to which data is accurate, complete, consistent, and suitable for its intended use
Missing values	Data points that are absent — represented as empty strings in CSV, `None` in Python, or `NaN` in pandas
Outlier	A data point that falls far outside the typical range of values — may be a genuine extreme or an error
Summary statistics	Numerical measures that describe a dataset's central tendency (mean, median) and spread (range, standard deviation)
Data dictionary	A reference document describing each column's name, meaning, data type, valid values, and quality notes
Data provenance	Documentation of a dataset's origin: who collected it, when, how, why, and any known limitations
Notebook narrative	The practice of writing a Jupyter notebook as a readable document that combines code, output, and explanatory text
Reproducibility	The principle that an analysis can be independently replicated to produce the same results

What You Should Be Able to Do Now

Use this checklist to verify you've absorbed the chapter. If any item feels shaky, revisit the relevant section.

[ ] Load a CSV file into a list of dictionaries using csv.DictReader
[ ] Inspect a dataset's basic properties: row count, column names, first/last rows, unique values
[ ] Compute summary statistics (count, min, max, mean, median) using pure Python
[ ] Handle missing values when extracting numeric data (skip empty strings, use try/except)
[ ] Assess data quality by counting missing values, checking ranges, and spotting inconsistencies
[ ] Break down statistics by group (e.g., mean coverage by region) using loops and dictionaries
[ ] Formulate specific, answerable questions before exploring data
[ ] Write a notebook narrative that combines code, output, and plain-English interpretation
[ ] Explain why pure Python is limited for data analysis and what tools (pandas, matplotlib) will address those limitations
[ ] Describe the EDA workflow and apply it to any new dataset

If you checked every box, congratulations — you've completed Part I. You are now a person who does data science, not just a person who reads about it. Part II is going to arm you with power tools that make everything faster. Let's go.