Case Study 1: From O(n²) to O(n log n) — Optimizing a Real Program

The Problem

A student named Tomás writes a program to find all duplicate expenses in PennyWise. His first implementation uses a brute-force approach:

procedure FindDuplicatesBrute(const Expenses: TExpenseArray;
                              Count: Integer);
var
  I, J: Integer;
  Found: Integer;
begin
  Found := 0;
  for I := 0 to Count - 2 do
    for J := I + 1 to Count - 1 do
      if (Expenses[I].Amount = Expenses[J].Amount) and
         (Expenses[I].Date = Expenses[J].Date) then begin
        Inc(Found);
        WriteLn('  Duplicate: $', Expenses[I].Amount:0:2,
                ' on ', FormatDateTime('yyyy-mm-dd', Expenses[I].Date));
      end;
  WriteLn('  Total duplicates: ', Found);
end;

Performance on Tomás's Data

Expenses Time (brute force)
100 1 ms
1,000 85 ms
5,000 2,100 ms
10,000 8,400 ms
50,000 210,000 ms (3.5 minutes!)

The doubling ratio is approximately 4x when the input doubles — confirming O(n²).

The Optimization

Tomás realizes he can sort the expenses first, then scan for adjacent duplicates:

procedure FindDuplicatesSorted(var Expenses: TExpenseArray;
                               Count: Integer);
var
  I: Integer;
  Found: Integer;
begin
  { Sort by date, then amount — O(n log n) }
  QuickSortExpenses(Expenses, 0, Count - 1, @CompareByDateThenAmount);

  { Scan for adjacent duplicates — O(n) }
  Found := 0;
  for I := 0 to Count - 2 do
    if (Expenses[I].Amount = Expenses[I+1].Amount) and
       (Expenses[I].Date = Expenses[I+1].Date) then begin
      Inc(Found);
      WriteLn('  Duplicate: $', Expenses[I].Amount:0:2,
              ' on ', FormatDateTime('yyyy-mm-dd', Expenses[I].Date));
    end;
  WriteLn('  Total duplicates: ', Found);
end;

Performance After Optimization

Expenses Brute Force Sort + Scan Speedup
100 1 ms 1 ms 1x
1,000 85 ms 2 ms 42x
5,000 2,100 ms 8 ms 262x
10,000 8,400 ms 18 ms 466x
50,000 210,000 ms 95 ms 2,210x

The optimized version is over 2,000 times faster for 50,000 expenses. What took 3.5 minutes now takes under a tenth of a second.

Analysis

Brute Force: O(n²) — checking all pairs. Sort + Scan: O(n log n) for the sort + O(n) for the scan = O(n log n) total.

The doubling ratio for the optimized version is approximately 2.2x — consistent with O(n log n).

Verification

To verify the optimization produces the same results, Tomás runs both versions on the same data and compares outputs:

{ Assert: both approaches find the same number of duplicates }
BruteCount := FindDuplicatesBrute(Expenses, Count);
SortedCount := FindDuplicatesSorted(Expenses, Count);
Assert(BruteCount = SortedCount,
       'Optimization error: different duplicate counts!');

The Lesson

Tomás did not change his hardware. He did not rewrite his program in assembly. He did not use compiler optimization flags. He changed the algorithm — from O(n²) brute-force comparison to O(n log n) sort-and-scan — and gained a 2,000x speedup.

This is the most important lesson in all of computer science optimization: improve the algorithm first.

Discussion Questions

  1. For n = 100, the brute force and optimized versions both take about 1 ms. Why would you still prefer the optimized version?

  2. Could Tomás achieve even better performance using a hash table? What would the expected complexity be?

  3. The sort changes the order of Tomás's expense array. What if he needs to preserve the original order? How would he modify the approach?

  4. Tomás's optimization assumes that duplicates have exactly matching amounts and dates. What if "duplicate" means amounts within $0.01 and dates within 1 day? How does this affect the algorithm choice?