Capstone Project 3: The Data Dashboard

Contributors

Capstone Project 3: The Data Dashboard

"Data is not information. Information is not knowledge. Knowledge is not wisdom. Your job as a programmer is to build the bridge between the first and the second — and to do it honestly."

Project Overview

Elena Vasquez has been with you since Chapter 1. She started with a messy spreadsheet and a manual process that took four hours every week. By Chapter 24, she had an automated pipeline that processed her nonprofit's data, generated reports, and emailed them to stakeholders. But Elena's pipeline is custom-built for one dataset with one structure. What if you could build a general-purpose tool — one that reads any well-structured dataset, analyzes it, and produces useful output?

That's the Data Dashboard. It's a command-line application that reads CSV or JSON datasets, lets the user explore the data interactively (filter, sort, search), performs statistical analysis on numeric columns, generates text-based visualizations, and exports formatted reports. It's the kind of tool that a data analyst, a researcher, or a curious person with a spreadsheet would actually reach for.

This project is the most algorithmically demanding of the three capstones. You'll implement statistical functions from scratch, build a text rendering engine for charts, parse and validate messy real-world data, and design an architecture that works with any tabular dataset — not just one you've hardcoded for. If the Finance Tracker tests your OOP skills and the Trivia Game tests your state management, the Data Dashboard tests your ability to write general-purpose, data-agnostic code.

Why Data Analysis?

Three reasons.

First, data analysis is a universally valuable skill. Every field — science, business, journalism, public policy, healthcare — needs people who can look at a dataset and extract meaning. Building a tool that does this teaches you both the programming and the thinking.

Second, this project exercises Python's strengths. Python dominates data analysis for a reason: its string handling is excellent (Chapter 7), its dictionaries map naturally to records (Chapter 9), its file I/O is clean (Chapter 10), and its ecosystem is unmatched (Chapter 23). You'll use all of these strengths.

Third, the Data Dashboard requires you to think abstractly. Unlike the Finance Tracker (where you know the data schema) or the Trivia Game (where you define the question format), the Data Dashboard must handle datasets it hasn't seen before. A column might be numeric, a date, or a string. A CSV might have 5 columns or 50. Your code must adapt. That's abstraction (Chapter 1) made concrete.

Required Features

1. Dataset Loading

Read CSV files using Python's csv module. Handle quoted fields, commas within fields, and various delimiters. (Chapter 10)
Read JSON files that contain an array of records (list of dictionaries). (Chapter 10)
Auto-detect column types on load: classify each column as numeric (int/float), date, or text based on its values. (Chapters 3, 4, 7)
Display a dataset summary after loading: filename, number of rows, number of columns, column names with detected types, and a preview of the first 5 rows. (Chapter 7)
Handle common data problems: missing values (empty cells), inconsistent formatting, files with no data rows. Report problems clearly without crashing. (Chapter 11)

2. Data Exploration

View data in a formatted table with aligned columns, row numbers, and truncated long values. (Chapter 7 — string formatting is critical here)
Filter rows by column value: equals, contains (for text), greater than / less than (for numbers), and date ranges (for dates). (Chapters 4, 5, 7, 22)
Sort data by any column, ascending or descending. For numeric columns, sort numerically. For text columns, sort alphabetically. (Chapters 8, 19)
Search data using a keyword that matches across all text columns. Support case-insensitive matching. (Chapters 5, 7)
Select columns — display only a subset of columns chosen by the user. (Chapter 8)

3. Statistical Analysis

For every numeric column, compute and display the following statistics. Implement these calculations yourself — do not import the statistics module or numpy. The point is to practice the algorithms.

Count of non-missing values (Chapter 5)
Sum (Chapter 5)
Mean (arithmetic average) (Chapter 6)
Median (middle value; average of two middle values for even-length lists) (Chapters 8, 19)
Mode (most frequent value; handle ties by reporting all modes) (Chapter 9)
Minimum and maximum (Chapter 5)
Range (max - min) (Chapter 3)
Standard deviation (population standard deviation: square root of the average of squared deviations from the mean) (Chapters 5, 6, 17)
Percentiles (25th and 75th) (Chapters 8, 17)

Display these statistics in a clean, labeled format for each numeric column when the user requests a statistical summary.

4. Text-Based Visualizations

Generate the following visualizations using only text characters. No external libraries allowed for the required visualizations — this exercises your string formatting skills from Chapter 7 and algorithmic thinking from Chapter 17.

Horizontal Bar Chart

Category Breakdown
==================
Electronics   ##### ##### ##### ##           (32)
Clothing      ##### ##### ###                (23)
Food          ##### ##### ##### ##### ###    (43)
Books         ##### ##                       (12)

Scale bars proportionally to the largest value. Include labels and numeric values. Handle long category names by truncating or aligning.

Histogram

Distribution of Price
=====================
  0- 20  |################            (16)
 20- 40  |########################    (24)
 40- 60  |####################        (20)
 60- 80  |############                (12)
 80-100  |########                    ( 8)

Automatically determine bin ranges based on data. Default to 10 bins, but allow the user to specify a different count.

Frequency Table

Top 10 Values: Category
========================
Food           43  =============================
Electronics    32  =====================
Clothing       23  ===============
Books          12  ========

For text columns, show the most frequent values with counts and a visual bar.

5. Report Generation

Generate a comprehensive text report that includes:
Dataset metadata (filename, dimensions, load time)
Column type summary
Statistical summary for all numeric columns
Frequency tables for all text columns (top 10 values each)
One histogram for each numeric column
Any data quality notes (missing values, type inconsistencies)
Save reports to a text file with a timestamped filename. (Chapter 10)
Allow the user to generate a filtered report — apply filters first, then generate the report on the filtered subset. (Chapters 5, 6)

6. Data Export

Export filtered data to a new CSV file. (Chapter 10)
Preserve the original column headers and formatting. (Chapter 7)
Report how many rows were exported. (Chapter 7)

7. Regular Expressions (Chapter 22)

Use regex for search — the keyword search feature should support regex patterns (e.g., searching for \d{3}-\d{4} to find phone number patterns). (Chapter 22)
Use regex for date detection during column type classification (recognize patterns like 2026-03-14, 03/14/2026, March 14, 2026). (Chapter 22)
Use regex for data cleaning — provide a command that lets the user apply a regex substitution to a column (e.g., strip non-numeric characters from a phone number column). (Chapter 22)

8. Error Handling

Handle file not found, permission denied, and encoding errors with specific messages. (Chapter 11)
Handle malformed CSV (inconsistent column counts, unexpected delimiters) without crashing. (Chapter 11)
Handle type conversion failures gracefully when computing statistics on columns with mixed types. (Chapter 11)
Validate all user input in the interactive menu. (Chapter 11)

9. Testing

Write at least 15 unit tests using pytest covering:
Statistical calculations with known inputs and expected outputs
CSV and JSON loading with well-formed and malformed data
Column type detection
Filtering logic (numeric comparisons, text matching, date ranges)
Sorting (numeric vs. alphabetic)
Visualization output (check that bar lengths are proportional)
Edge cases: empty dataset, single-row dataset, single-column dataset, all-missing column
Use fixtures that create small, predictable test datasets. (Chapter 13)
Test your statistics functions against hand-calculated values — this is the most important testing in the project. (Chapter 13)

Technical Requirements

Requirement	Relevant Chapters	Details
Variables and types	Ch 3	Numeric parsing, type detection, running totals
Conditionals	Ch 4	Filter logic, type branching, menu routing
Loops	Ch 5	Row iteration, accumulation, statistical computation
Functions	Ch 6	At least 20 well-named functions with docstrings
Strings	Ch 7	Table formatting, bar chart rendering, report output
Lists and tuples	Ch 8	Column data, sorted results, percentile computation
Dictionaries	Ch 9	Records, frequency counts, column metadata
File I/O	Ch 10	CSV/JSON reading, report writing, export
Error handling	Ch 11	Robust parsing, file errors, type conversion
Modules	Ch 12	Code split across at least 5 modules
Testing	Ch 13	pytest suite with 15+ tests and fixtures
OOP	Ch 14-16	Dataset, Column, Report, Chart classes
Algorithms	Ch 17	Statistics implementation, binning, sorting
Searching/Sorting	Ch 19	Column sorting, filtered searches
Regular expressions	Ch 22	Search, date detection, data cleaning

Suggested Architecture

Module and Class Structure

data_dashboard/
    __init__.py
    dataset.py         # Dataset and Column classes
    stats.py           # Statistical computation functions
    charts.py          # TextBarChart, Histogram, FrequencyTable classes
    report.py          # ReportGenerator class
    loader.py          # CSVLoader, JSONLoader (file parsing)
    cli.py             # Interactive menu and display
    main.py            # Entry point
data/
    sample_sales.csv   # Sample dataset for testing
    sample_weather.json # Sample dataset for testing
tests/
    test_stats.py
    test_dataset.py
    test_charts.py
    test_loader.py
    test_report.py
    conftest.py

Core Classes

Column (Chapter 14) - Attributes: name (str), data (list), column_type (str: "numeric", "text", "date"), missing_count (int) - Methods: numeric_values() (returns non-missing values as floats), text_values(), unique_values(), value_counts() (returns dict of value frequencies), __len__(), __str__() - Property: is_numeric, is_text, is_date - The Column class is the fundamental unit of analysis — every statistical function operates on a Column

Dataset (Chapters 14, 16) - Attributes: name (str), columns (dict mapping column name to Column), row_count (int), source_path (Path) - Methods: filter(column, operator, value) returns a new Dataset, sort(column, reverse=False) returns a new Dataset, select(column_names) returns a new Dataset, search(pattern, regex=False) returns a new Dataset, head(n), to_rows(), summary() - Filtering and sorting return new Dataset objects rather than modifying the original — this is the immutability principle from Chapter 7 applied at the object level - Composition: a Dataset has Columns, it is not a Column

CSVLoader and JSONLoader (Chapters 10, 14) - Methods: load(filepath) returns a Dataset, detect_types(raw_data) classifies columns - These are separate from Dataset because loading is a different responsibility than analysis (separation of concerns, Chapter 16) - Handle encoding detection, delimiter sniffing (for CSV), and schema validation (for JSON)

Statistical Functions (module, not a class) (Chapters 6, 17) - mean(values), median(values), mode(values), std_dev(values), percentile(values, p), correlation(values_x, values_y) (stretch goal) - These are pure functions — they take a list of numbers and return a number. No side effects, no I/O, easy to test. This is the functional design choice: not everything needs to be a class. Functions that take data and return results are a perfectly valid Python pattern, as discussed in Chapter 16.

TextBarChart, Histogram, FrequencyTable (Chapters 7, 14) - Each takes data and rendering options (width, character, max_bars, etc.) - Method: render() returns a multi-line string - Returning a string rather than printing directly makes these testable (Chapter 13) and composable (you can embed them in reports)

ReportGenerator (Chapters 6, 14) - Methods: full_report(dataset), statistical_summary(dataset), save(filepath) - Composes the statistical functions and chart classes to produce a complete document - Accepts an optional filter specification so reports can cover subsets of the data

Design Principles (Chapter 16)

Immutable transformations: Filtering and sorting return new Datasets. The original data is never modified. This prevents a whole category of bugs where filtered views accidentally corrupt the source data.
Pure functions for computation: Statistics are pure functions, not methods on Dataset. This makes them independently testable and reusable outside the Dataset context.
Render-to-string: Charts and reports produce strings, not print output. This separates computation from I/O and makes everything testable.
Loader separation: The code that parses CSV is separate from the code that analyzes data. If you later want to add Excel or database loading, you add a new loader without touching the analysis code.

Development Milestones

Phase 1: Loading and Display (Days 1-3)

Goal: Read a CSV file and display its contents in a formatted table.

Create the CSVLoader class that reads a CSV file into a list of dictionaries
Create the Column class with basic type detection (numeric vs. text)
Create the Dataset class with summary() and head() methods
Write a display function that prints data in an aligned table with column headers
Test: Write tests for CSV loading (well-formed file, missing values, quoted fields)

Checkpoint: You can point the tool at any CSV file and see a clean summary and data preview. This is the "first useful output" milestone — it already does something a raw cat command can't.

Phase 2: Statistics (Days 4-7)

Goal: Compute and display statistics for numeric columns.

Implement all statistical functions: mean, median, mode, std_dev, percentile, min, max, range, count, sum
Create a statistical_summary() function that computes all stats for all numeric columns
Handle edge cases: empty columns, single-value columns, columns with missing values
Test: Write tests for every statistical function with hand-calculated expected values. This is the most critical test coverage in the project.

Checkpoint: Load a dataset, run statistical summary, and verify the numbers match what a spreadsheet computes. The hand-calculation tests from Chapter 13 are essential here — if your median function fails on an even-length list, you want to find out from a test, not from a user.

Implementation note on standard deviation: You're computing population standard deviation. The formula is: take each value, subtract the mean, square the result, take the mean of those squared differences, then take the square root. In code:

def std_dev(values):
    """Population standard deviation."""
    n = len(values)
    if n == 0:
        return 0.0
    avg = mean(values)
    squared_diffs = [(x - avg) ** 2 for x in values]
    return (sum(squared_diffs) / n) ** 0.5

That's a list comprehension (Chapter 8), the sum() builtin (Chapter 5), and the exponentiation operator (Chapter 3) — all skills you already have.

Phase 3: Visualization (Days 8-11)

Goal: Generate text-based charts from the data.

Implement TextBarChart for horizontal bar charts with proportional scaling
Implement Histogram with automatic bin calculation
Implement FrequencyTable for text column value distributions
Handle edge cases: very long labels (truncate), very small values (minimum bar width of 1), zero values
Integrate charts into the interactive menu
Test: Write tests that verify bar lengths are proportional to values, bin ranges cover the full data range, and frequency counts are correct

Checkpoint: Your dashboard produces visual output that communicates patterns in the data at a glance. This is where the project shifts from "useful" to "impressive."

Scaling note: To render proportional bars, find the maximum value, decide on a maximum bar width (say, 50 characters), and compute each bar's width as int(value / max_value * max_width). Handle the zero case. This is the same proportional scaling concept behind every real charting library — you're just implementing it with # characters instead of pixels.

Phase 4: Filtering, Sorting, and Search (Days 12-14)

Goal: Interactive data exploration.

Implement Dataset.filter() with support for numeric comparisons and text matching
Implement Dataset.sort() for any column with correct type-aware ordering
Implement keyword search across all text columns (case-insensitive)
Add regex support for search using the re module (Chapter 22)
Add the regex data cleaning command
Implement column selection
Build the interactive menu that chains these operations (filter, then sort, then view)
Test: Write tests for filtering logic, sort correctness, and regex search

Checkpoint: You can load a sales dataset, filter for transactions over $100 in the Electronics category, sort by date descending, and display only the columns you care about — all through an interactive menu. That's a genuinely powerful workflow.

Phase 5: Reports, Export, and Polish (Days 15-18)

Goal: Comprehensive reports and professional finish.

Create the ReportGenerator class that combines statistics, charts, and metadata into a full text report
Implement report saving to a timestamped text file
Implement CSV export of filtered data
Add JSON loading support
Audit all error handling — every file operation, every type conversion, every user input
Expand test suite to 15+ tests
Split code into modules if you haven't already
Write a README with usage examples
Review against the rubric (see capstone-rubric.md)

Checkpoint: The complete tool. Load any CSV or JSON dataset, explore it interactively, generate a full report with statistics and charts, and export filtered subsets. If Elena Vasquez saw this, she'd want a copy.

Sample Datasets

To test your tool, you'll need datasets. Here are some suggestions for freely available CSV files that exercise different features:

Iris dataset (150 rows, 5 columns — 4 numeric, 1 categorical): A classic. Small, clean, and good for testing statistics and histograms.
World cities (1,000+ rows — population, latitude, longitude, country): Good for filtering, sorting, and large datasets.
Movie ratings (mixed types — titles, years, ratings, genres): Tests text handling, numeric statistics, and frequency tables.
Weather data (dates, temperatures, precipitation): Tests date handling and time-based filtering.

You can also generate synthetic datasets using Python scripts — that's actually a useful exercise in its own right.

Stretch Goals

Correlation Analysis (Extends Chapter 17)

Implement Pearson correlation coefficient between two numeric columns. Display a correlation matrix for all numeric columns. The formula involves the same building blocks as standard deviation (means, sums of squared differences), so if you've implemented std_dev, correlation is within reach.

Trend Analysis (Extends Chapters 17, 22)

For datasets with a date column, compute and display trends: moving averages, month-over-month changes, and growth rates. This requires date parsing (regex), sorting by date, and sliding window calculations (a nice algorithmic exercise).

Multiple Dataset Comparison (Extends Chapter 16)

Load two datasets simultaneously and compare them: side-by-side statistics, shared columns, data merge operations. This exercises your architecture — your Dataset class needs to be general enough that two instances can coexist and interact.

Matplotlib Integration (Extends Chapter 23)

Add an optional --graphical flag that generates actual matplotlib charts (scatter plots, line graphs, box plots) alongside the text-based ones. Save them as PNG files. This tests your ability to integrate third-party libraries (Chapter 23) while maintaining a clean architecture.

Interactive Column Calculator (Extends Chapter 3)

Let the user create derived columns using expressions like profit = revenue - cost. Parse the expression, evaluate it for each row, and add the new column to the dataset. This is a challenging stretch goal that exercises string parsing, expression evaluation, and dataset mutation.

Export to HTML (Extends Chapter 7)

Generate an HTML version of the report with properly formatted tables and inline CSS. This is a pure string formatting exercise — you're building HTML the same way you built text-based charts, just with <table> and <tr> tags instead of | and - characters.

What This Project Demonstrates

The Data Dashboard is the most technically demanding capstone because it requires general-purpose thinking. You're not building for one dataset — you're building for any dataset. That generality exercises:

Abstraction (Chapter 1): Your code handles datasets it has never seen before, because you abstracted the concept of "a column of data" and "a row of records" rather than hardcoding a specific schema
Algorithm implementation (Chapter 17): Every statistical function is an algorithm you implemented from scratch — mean, median, mode, standard deviation, percentile, binning for histograms
String mastery (Chapter 7): The table formatter, bar chart renderer, and report generator are all exercises in precise string formatting — alignment, padding, truncation, proportional scaling
Data processing (Chapters 8, 9, 10): Reading, parsing, cleaning, filtering, sorting, and exporting structured data
Regular expressions (Chapter 22): Pattern-based search and date detection in real-world data
Modular design (Chapters 12, 16): Five or more modules with clean interfaces and no circular dependencies
Robust error handling (Chapter 11): Graceful handling of messy, incomplete, and malformed real-world data
Testing (Chapter 13): Verifiable correctness of statistical calculations — the kind of code where a subtle bug silently produces wrong numbers

If the Finance Tracker proves you can build a complete application and the Trivia Game proves you can manage interactive state, the Data Dashboard proves you can write general-purpose tools. That's the skill that separates someone who programs from someone who is a programmer.

Getting Started

Start by loading a CSV file and printing its contents in a table. That's Phase 1, and it connects directly to Chapter 10's file I/O and Chapter 7's string formatting. Once you can display data cleanly, add one statistic (mean). Then another (median). Then build a bar chart. Each addition is incremental, and each addition makes the tool more useful.

The statistical functions are the most satisfying part of this project to test. Write a test that computes the mean of [2, 4, 6, 8, 10] and asserts it equals 6.0. Write a test that computes the median of [1, 3, 5, 7] and asserts it equals 4.0. Write a test for standard deviation with known values. Every passing test is a guarantee that your analysis produces correct results. That matters when someone trusts your tool's output to make a decision.

If your table formatting looks ragged, revisit Chapter 7's section on str.ljust(), str.rjust(), and f-string format specifications like {value:<20}. Aligned output is the difference between a tool that looks amateur and one that looks professional. Take the time to get it right.

Dr. Patel would approve. Elena would adopt it. And your future self, the one who takes a data science course next semester, will thank you for every statistical function you implemented by hand — because you'll actually understand what the library functions are doing when you call numpy.std() for the first time.

Assessment: See capstone-rubric.md for detailed grading criteria.