Case Study 1: From O(n²) to O(n log n) — Optimizing a Real Program

DataField.Dev

Case Study 1: From O(n²) to O(n log n) — Optimizing a Real Program

The Problem

A student named Tomás writes a program to find all duplicate expenses in PennyWise. His first implementation uses a brute-force approach:

procedure FindDuplicatesBrute(const Expenses: TExpenseArray;
                              Count: Integer);
var
  I, J: Integer;
  Found: Integer;
begin
  Found := 0;
  for I := 0 to Count - 2 do
    for J := I + 1 to Count - 1 do
      if (Expenses[I].Amount = Expenses[J].Amount) and
         (Expenses[I].Date = Expenses[J].Date) then begin
        Inc(Found);
        WriteLn('  Duplicate: $', Expenses[I].Amount:0:2,
                ' on ', FormatDateTime('yyyy-mm-dd', Expenses[I].Date));
      end;
  WriteLn('  Total duplicates: ', Found);
end;

Performance on Tomás's Data

Expenses	Time (brute force)
100	1 ms
1,000	85 ms
5,000	2,100 ms
10,000	8,400 ms
50,000	210,000 ms (3.5 minutes!)

The doubling ratio is approximately 4x when the input doubles — confirming O(n²).

The Optimization

Tomás realizes he can sort the expenses first, then scan for adjacent duplicates:

procedure FindDuplicatesSorted(var Expenses: TExpenseArray;
                               Count: Integer);
var
  I: Integer;
  Found: Integer;
begin
  { Sort by date, then amount — O(n log n) }
  QuickSortExpenses(Expenses, 0, Count - 1, @CompareByDateThenAmount);

  { Scan for adjacent duplicates — O(n) }
  Found := 0;
  for I := 0 to Count - 2 do
    if (Expenses[I].Amount = Expenses[I+1].Amount) and
       (Expenses[I].Date = Expenses[I+1].Date) then begin
      Inc(Found);
      WriteLn('  Duplicate: $', Expenses[I].Amount:0:2,
              ' on ', FormatDateTime('yyyy-mm-dd', Expenses[I].Date));
    end;
  WriteLn('  Total duplicates: ', Found);
end;

Performance After Optimization

Expenses	Brute Force	Sort + Scan	Speedup
100	1 ms	1 ms	1x
1,000	85 ms	2 ms	42x
5,000	2,100 ms	8 ms	262x
10,000	8,400 ms	18 ms	466x
50,000	210,000 ms	95 ms	2,210x

The optimized version is over 2,000 times faster for 50,000 expenses. What took 3.5 minutes now takes under a tenth of a second.

Analysis

Brute Force: O(n²) — checking all pairs. Sort + Scan: O(n log n) for the sort + O(n) for the scan = O(n log n) total.

The doubling ratio for the optimized version is approximately 2.2x — consistent with O(n log n).

Verification

To verify the optimization produces the same results, Tomás runs both versions on the same data and compares outputs:

{ Assert: both approaches find the same number of duplicates }
BruteCount := FindDuplicatesBrute(Expenses, Count);
SortedCount := FindDuplicatesSorted(Expenses, Count);
Assert(BruteCount = SortedCount,
       'Optimization error: different duplicate counts!');

The Lesson

Tomás did not change his hardware. He did not rewrite his program in assembly. He did not use compiler optimization flags. He changed the algorithm — from O(n²) brute-force comparison to O(n log n) sort-and-scan — and gained a 2,000x speedup.

This is the most important lesson in all of computer science optimization: improve the algorithm first.

Discussion Questions

For n = 100, the brute force and optimized versions both take about 1 ms. Why would you still prefer the optimized version?
Could Tomás achieve even better performance using a hash table? What would the expected complexity be?
The sort changes the order of Tomás's expense array. What if he needs to preserve the original order? How would he modify the approach?
Tomás's optimization assumes that duplicates have exactly matching amounts and dates. What if "duplicate" means amounts within $0.01 and dates within 1 day? How does this affect the algorithm choice?