Case Study 1: Data Modeling Gone Wrong

DataField.Dev

Case Study 1: Data Modeling Gone Wrong

"We'll Fix It Later" — How a Growing Retailer's Shortcuts Became a $2.3 Million Problem

The Company

NorthStar Home Goods was a mid-size online retailer based in Minneapolis, specializing in kitchen and home furnishing products. Founded in 2009, they grew rapidly — from $4 million in annual revenue to $85 million over a decade. By 2019, they had 1.2 million customers, 15,000 active SKUs, and were processing 8,000 orders per day.

Their original database had been built in 2010 by a talented application developer named Marcus, who had deep knowledge of Python and web frameworks but limited formal training in database design. Marcus built the system that got NorthStar off the ground, and for that he deserves credit. But the design decisions he made — expedient at the time — would cost the company dearly.

The Original Design

Marcus created the database with six tables. The most important was the ORDER table, which looked like this:

ORDERS
---
ORDER_ID           INTEGER (auto-increment)
CUSTOMER_EMAIL     VARCHAR(254)
CUSTOMER_NAME      VARCHAR(200)
CUSTOMER_ADDRESS   VARCHAR(500)
CUSTOMER_PHONE     VARCHAR(20)
ITEMS              TEXT   -- JSON blob: [{"sku":"K-1001","name":"Chef Knife","qty":2,"price":49.99}, ...]
ORDER_DATE         DATE
SHIPPING_METHOD    VARCHAR(50)
SHIPPING_COST      DOUBLE
TOTAL_AMOUNT       DOUBLE
PAYMENT_METHOD     VARCHAR(50)
STATUS             VARCHAR(50)
NOTES              TEXT

There was no CUSTOMER table. There was no ORDER_ITEM table. There was no PRODUCT table with authoritative names and prices. Everything was in one wide table, with the order items stored as a JSON blob in a TEXT column.

Where Things Went Wrong

Year 1-3: It Worked Fine

With 200 orders a day and 10,000 customers, the design was manageable. Queries were fast enough. Marcus wrote application code to parse the JSON items column. Reports were simple. Nobody complained.

Year 4: The First Signs

By 2014, the ORDERS table had 800,000 rows. The first problems appeared:

Customer data inconsistency. A customer who had placed 15 orders over three years had 15 different address entries — some current, some outdated, and two with typos. When the marketing team tried to build a mailing list, they found 340,000 "unique" customers by email, but the customer name and address did not match across orders for 28% of them. Which address was current? Nobody knew.

Price discrepancy. A product called "Chef's Choice Knife Set" had been sold at $129.99 for its first year, then raised to $149.99. The JSON blobs in old orders stored the old price (correctly — that was the price at the time of purchase). But the reporting team's query to calculate "total revenue from Chef's Choice Knife Set" summed amounts from the JSON blobs and got a number that did not match the accounting system. It took two weeks to figure out that the product name had been spelled three different ways across orders ("Chef's Choice Knife Set", "Chefs Choice Knife Set", "Chef Choice Knife - Set"), and the query was only matching one spelling.

Performance degradation. Generating a monthly sales-by-product report required parsing JSON from every order row. With 800,000 rows, the report took 45 minutes. Marcus added a caching layer, which helped — until the cache went stale and served last month's numbers for a board meeting.

Year 6: The Breaking Point

By 2016, the table had 2.3 million rows. Three incidents occurred in rapid succession:

Incident 1 — The Shipping Address Bug. A customer reported receiving someone else's order. Investigation revealed that the application code had a race condition: when two orders were placed within the same second by different customers, the shipping address from one could be written to the other's order row. The root cause was that the application was doing a read-modify-write cycle on the JSON items column, and the customer address was being updated in the same operation. A properly normalized design with a separate CUSTOMER table and ORDER_ITEM table would have made this race condition structurally impossible.

Incident 2 — The Tax Audit. The state of Minnesota audited NorthStar's sales tax collection. The auditors asked for a report of total sales by product category, by quarter, for the past three years. Generating this report required parsing 2.3 million JSON blobs, cross-referencing product SKUs with a separate spreadsheet (not a database table) that mapped SKUs to categories, and handling all the naming inconsistencies. The effort took three employees four weeks. The auditors found discrepancies and assessed a $180,000 penalty for underreported taxable sales — largely because some orders had been miscategorized due to inconsistent product names.

Incident 3 — The Loyalty Program Failure. Marketing launched a loyalty program based on total customer spending. The query to calculate lifetime spend per customer required grouping by CUSTOMER_EMAIL, summing TOTAL_AMOUNT — but 12% of customers had used multiple email addresses over the years. The loyalty program incorrectly split their purchase history, and top customers received lower loyalty tiers than they deserved. Sixty-three customers complained publicly on social media. NorthStar issued $45,000 in gift cards and spent $120,000 on a PR response.

The Remediation

In late 2016, NorthStar hired a database architect named Priya, who assessed the situation and presented a remediation plan to the board. Her findings:

No referential integrity. There was no guarantee that an order referenced a real customer or real products.
No normalization. Customer data was duplicated in every order. Product data existed only inside JSON blobs.
Wrong data types. DOUBLE was used for monetary amounts, causing penny-level rounding errors that had accumulated to a $14,000 discrepancy over six years.
No constraints. The STATUS column contained 23 different values, including "Shipped", "shipped", "SHIPPED", "Ship", and "Shiped" (sic).

Priya designed a normalized schema:

CUSTOMER (CUSTOMER_ID PK, EMAIL, NAME, PHONE, ...)
CUSTOMER_ADDRESS (ADDRESS_ID PK, CUSTOMER_ID FK, ADDRESS_TYPE, ...)
PRODUCT (PRODUCT_ID PK, SKU, NAME, CATEGORY_ID FK, CURRENT_PRICE, ...)
PRODUCT_CATEGORY (CATEGORY_ID PK, CATEGORY_NAME, TAX_RATE, ...)
ORDERS (ORDER_ID PK, CUSTOMER_ID FK, ORDER_DATE, SHIPPING_ADDRESS_ID FK, ...)
ORDER_ITEM (ORDER_ID FK, PRODUCT_ID FK, QUANTITY, UNIT_PRICE_AT_SALE, ...)

Key design decisions: - UNIT_PRICE_AT_SALE in ORDER_ITEM captured the historical price, while PRODUCT.CURRENT_PRICE held the current price. This solved the price-at-time-of-sale problem. - All monetary columns used DECIMAL(12,2). - CHECK constraints enforced valid STATUS values. - Foreign keys enforced referential integrity. - Customer deduplication used email matching with manual review for the 12% who had multiple emails.

The Cost

The migration project took 8 months and cost $1.4 million:

Item	Cost
Database architect (Priya) — 8 months	$160,000
Two additional developers — 6 months	$180,000
Data cleansing consultant — 4 months	$120,000
QA and testing	$90,000
Infrastructure (new database servers, migration tooling)	$75,000
Lost engineering productivity (feature work delayed)	$450,000
Customer-facing incidents during migration	$35,000
Previous tax penalty	$180,000
Loyalty program remediation	$165,000
Total	$1,455,000

Adding the pre-migration costs (tax penalty, loyalty program, the three employees who spent four weeks on the audit report), the total cost of the original design shortcuts exceeded $2.3 million.

Lessons for Meridian Bank

Normalize from the start. The cost of normalizing an existing database with years of inconsistent data is orders of magnitude higher than designing it correctly initially.
Separate entity types into separate tables. Customers, products, and orders are different things with different lifecycles. They belong in different tables.
Use the right data types. DECIMAL for money. CHAR for fixed codes. VARCHAR for variable text. CHECK constraints for enumerated values. These are not premature optimization — they are basic hygiene.
Enforce referential integrity in the database. Application code has bugs. Race conditions exist. Multiple applications access the same data. Only the database can enforce integrity universally and atomically.
The spreadsheet is not a database. NorthStar's product category mapping lived in a spreadsheet. When the spreadsheet and the database disagreed, nobody knew which was correct.

The Meridian Bank core tables we designed in Section 2.9 — with their primary keys, foreign keys, CHECK constraints, and normalization to 3NF — embody every lesson from NorthStar's failure. These design choices are not academic exercises. They are insurance policies against exactly the kind of slow-motion disaster that NorthStar experienced.

"The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time." — Tom Cargill, Bell Labs

In data modeling, the ratio is worse. The first design takes 10% of the effort. Fixing it after five years of bad data takes the other 190%.