Case Study 2: Parsing Real-World Data — CSV Files
Overview
Comma-Separated Values (CSV) is one of the most common data exchange formats in the world. Banks export transaction histories in CSV. Spreadsheets save to CSV. Government agencies publish open data as CSV. Despite its apparent simplicity — "just split on commas" — CSV parsing is surprisingly tricky once you encounter real-world data.
In this case study, we build a robust CSV parser in Pascal that handles the edge cases that trip up naive implementations: quoted fields, commas embedded within values, escaped quotes, blank lines, and malformed rows. We then use it to process a file of financial transactions, demonstrating the practical payoff of careful string processing.
Part 1: Why CSV Is Harder Than It Looks
Consider these CSV lines:
John Smith,42,New York
"Jane ""JD"" Doe",38,"Portland, Oregon"
Bob Wilson,,"No age given"
,25,
"Multi
line field",99,Test
A naive "split on commas" approach fails on line 2 (the comma in Portland, Oregon is not a delimiter) and line 4 (the field spans multiple lines). A correct parser must handle:
- Quoted fields: A field enclosed in double quotes can contain commas, which are data, not delimiters.
- Escaped quotes: A double quote within a quoted field is represented as two double quotes (
""). - Empty fields: Two consecutive commas indicate an empty field.
- Multiline fields: A quoted field can contain line breaks (though we will handle this as an extension — single-line parsing first).
- Trailing whitespace: Some exporters add spaces around delimiters.
- Inconsistent column counts: Malformed rows may have too many or too few fields.
Part 2: The State Machine Approach
The most reliable way to parse CSV is with a finite state machine. At any point during parsing, we are in one of three states:
| State | Description |
|---|---|
FIELD_START |
We are at the beginning of a new field |
IN_UNQUOTED |
We are reading characters inside an unquoted field |
IN_QUOTED |
We are reading characters inside a quoted field |
Transitions between states are triggered by the current character:
- In
FIELD_START: if we see", transition toIN_QUOTED. If we see,, emit an empty field and stay inFIELD_START. Otherwise, transition toIN_UNQUOTED. - In
IN_UNQUOTED: if we see,, emit the field and transition toFIELD_START. Otherwise, append the character. - In
IN_QUOTED: if we see", peek at the next character. If it is also", append a single"(escaped quote). If it is,or end of line, emit the field and transition toFIELD_START. Otherwise, it is a malformed quote.
Implementation
program CSVParserDemo;
{$mode objfpc}{$H+}
uses
SysUtils;
const
MAX_FIELDS = 50;
MAX_ROWS = 1000;
type
TParseState = (psFieldStart, psInUnquoted, psInQuoted);
TCSVRow = record
Fields : array[0..MAX_FIELDS - 1] of String;
FieldCount: Integer;
end;
TCSVData = record
Rows : array[0..MAX_ROWS - 1] of TCSVRow;
RowCount: Integer;
end;
function ParseCSVLine(const line: String; var row: TCSVRow): Boolean;
var
state: TParseState;
i: Integer;
current: String;
c: Char;
begin
Result := True;
row.FieldCount := 0;
current := '';
state := psFieldStart;
i := 1;
while i <= Length(line) do
begin
c := line[i];
case state of
psFieldStart:
begin
if c = '"' then
begin
state := psInQuoted;
current := '';
end
else if c = ',' then
begin
{ Empty field }
if row.FieldCount < MAX_FIELDS then
begin
row.Fields[row.FieldCount] := '';
Inc(row.FieldCount);
end;
{ Stay in psFieldStart }
end
else if (c = ' ') or (c = #9) then
begin
{ Skip leading whitespace before field }
end
else
begin
state := psInUnquoted;
current := c;
end;
end;
psInUnquoted:
begin
if c = ',' then
begin
if row.FieldCount < MAX_FIELDS then
begin
row.Fields[row.FieldCount] := TrimRight(current);
Inc(row.FieldCount);
end;
current := '';
state := psFieldStart;
end
else
current := current + c;
end;
psInQuoted:
begin
if c = '"' then
begin
{ Peek at next character }
if (i < Length(line)) and (line[i + 1] = '"') then
begin
{ Escaped quote — append single quote }
current := current + '"';
Inc(i); { Skip the second quote }
end
else
begin
{ End of quoted field }
if row.FieldCount < MAX_FIELDS then
begin
row.Fields[row.FieldCount] := current;
Inc(row.FieldCount);
end;
current := '';
state := psFieldStart;
{ Skip to the comma (or end of line) }
Inc(i);
while (i <= Length(line)) and (line[i] <> ',') do
Inc(i);
{ i now points to comma or is past end; the main loop Inc will advance past comma }
end;
end
else
current := current + c;
end;
end; { case }
Inc(i);
end; { while }
{ Handle the last field (if line does not end with comma) }
if (state = psInUnquoted) or (state = psFieldStart) then
begin
if (state = psInUnquoted) or (current <> '') then
begin
if row.FieldCount < MAX_FIELDS then
begin
row.Fields[row.FieldCount] := TrimRight(current);
Inc(row.FieldCount);
end;
end
else if (Length(line) > 0) and (line[Length(line)] = ',') then
begin
{ Trailing comma means trailing empty field }
if row.FieldCount < MAX_FIELDS then
begin
row.Fields[row.FieldCount] := '';
Inc(row.FieldCount);
end;
end;
end
else if state = psInQuoted then
begin
{ Unterminated quote — this is malformed }
WriteLn('Warning: Unterminated quoted field');
if row.FieldCount < MAX_FIELDS then
begin
row.Fields[row.FieldCount] := current;
Inc(row.FieldCount);
end;
Result := False; { Signal malformed row }
end;
end;
Part 3: Processing Financial Transactions
Now let us use the parser on a realistic dataset. Suppose we have a file transactions.csv:
Date,Category,Amount,Description,Account
2025-03-01,Income,3200.00,Salary - ACME Corp,Checking
2025-03-02,Rent,-1200.00,March Rent Payment,Checking
2025-03-03,Groceries,-87.42,Whole Foods Market,Credit Card
2025-03-04,Utilities,-145.00,"Electric, Gas & Water",Checking
2025-03-05,Dining,-32.50,"Joe's Diner",Credit Card
2025-03-06,Groceries,-62.18,Trader Joe's,Credit Card
2025-03-07,Transport,-45.00,Monthly Bus Pass,Checking
2025-03-08,Entertainment,-15.99,Netflix Subscription,Credit Card
2025-03-09,Groceries,-94.33,"Fresh Market, Organic Section",Credit Card
2025-03-10,Income,250.00,Freelance Project Payment,Savings
2025-03-11,Medical,-25.00,Dr. Smith Co-pay,Checking
2025-03-12,Dining,-28.75,"Sushi ""Zen"" Restaurant",Credit Card
Note the test cases: line 4 has a comma in a quoted field, line 5 has an apostrophe, line 9 has a comma in a quoted field, and line 12 has escaped quotes inside a quoted field.
Loading and Analyzing the Data
type
TTransaction = record
TransDate : String;
Category : String;
Amount : Real;
Description: String;
Account : String;
end;
procedure LoadTransactions(const data: TCSVData;
var trans: array of TTransaction;
var count: Integer);
var
i, code: Integer;
amountStr: String;
begin
count := 0;
{ Skip header row (i = 1 instead of 0) }
for i := 1 to data.RowCount - 1 do
begin
if data.Rows[i].FieldCount < 5 then
begin
WriteLn(Format('Warning: Row %d has only %d fields, skipping.',
[i + 1, data.Rows[i].FieldCount]));
Continue;
end;
trans[count].TransDate := data.Rows[i].Fields[0];
trans[count].Category := data.Rows[i].Fields[1];
amountStr := Trim(data.Rows[i].Fields[2]);
Val(amountStr, trans[count].Amount, code);
if code <> 0 then
begin
WriteLn(Format('Warning: Invalid amount "%s" on row %d, skipping.',
[amountStr, i + 1]));
Continue;
end;
trans[count].Description := data.Rows[i].Fields[3];
trans[count].Account := data.Rows[i].Fields[4];
Inc(count);
end;
end;
Summary Report
procedure PrintSummary(const trans: array of TTransaction;
count: Integer);
var
i: Integer;
totalIncome, totalExpenses: Real;
categories: array[0..49] of String;
catTotals: array[0..49] of Real;
catCount: Integer;
found: Boolean;
j: Integer;
begin
totalIncome := 0;
totalExpenses := 0;
catCount := 0;
for i := 0 to count - 1 do
begin
if trans[i].Amount >= 0 then
totalIncome := totalIncome + trans[i].Amount
else
totalExpenses := totalExpenses + trans[i].Amount;
{ Track category totals }
found := False;
for j := 0 to catCount - 1 do
begin
if LowerCase(categories[j]) = LowerCase(trans[i].Category) then
begin
catTotals[j] := catTotals[j] + trans[i].Amount;
found := True;
Break;
end;
end;
if not found then
begin
categories[catCount] := trans[i].Category;
catTotals[catCount] := trans[i].Amount;
Inc(catCount);
end;
end;
{ Print report }
WriteLn;
WriteLn('=== Financial Summary ===');
WriteLn(Format('Total transactions: %d', [count]));
WriteLn(Format('Total income: $%10.2f', [totalIncome]));
WriteLn(Format('Total expenses: $%10.2f', [totalExpenses]));
WriteLn(Format('Net: $%10.2f', [totalIncome + totalExpenses]));
WriteLn;
WriteLn('--- By Category ---');
WriteLn(Format('%-15s %10s', ['Category', 'Total']));
WriteLn(StringOfChar('-', 27));
for j := 0 to catCount - 1 do
WriteLn(Format('%-15s $%9.2f', [categories[j], catTotals[j]]));
end;
Expected output:
=== Financial Summary ===
Total transactions: 12
Total income: $ 3450.00
Total expenses: $ -1736.17
Net: $ 1713.83
--- By Category ---
Category Total
---------------------------
Income $ 3450.00
Rent $ -1200.00
Groceries $ -243.93
Utilities $ -145.00
Dining $ -61.25
Transport $ -45.00
Entertainment $ -15.99
Medical $ -25.00
Part 4: Validation and Error Handling
Real-world CSV files are messy. Here are common problems and how to handle them:
Problem 1: Inconsistent Column Counts
procedure ValidateRow(const row: TCSVRow; expectedFields: Integer;
lineNum: Integer; var errors: Integer);
begin
if row.FieldCount <> expectedFields then
begin
WriteLn(Format('ERROR line %d: Expected %d fields, found %d',
[lineNum, expectedFields, row.FieldCount]));
Inc(errors);
end;
end;
Problem 2: Invalid Date Formats
function IsValidDate(const s: String): Boolean;
var
year, month, day, code: Integer;
begin
Result := False;
if Length(s) <> 10 then Exit;
if (s[5] <> '-') or (s[8] <> '-') then Exit;
Val(Copy(s, 1, 4), year, code);
if code <> 0 then Exit;
Val(Copy(s, 6, 2), month, code);
if code <> 0 then Exit;
Val(Copy(s, 9, 2), day, code);
if code <> 0 then Exit;
Result := (year >= 2000) and (year <= 2099) and
(month >= 1) and (month <= 12) and
(day >= 1) and (day <= 31);
end;
Problem 3: Encoding Issues and Special Characters
function SanitizeField(const s: String): String;
var
i: Integer;
begin
Result := '';
for i := 1 to Length(s) do
begin
if Ord(s[i]) >= 32 then { Only printable characters }
Result := Result + s[i]
else if s[i] = #9 then
Result := Result + ' ' { Replace tab with space }
{ else skip control characters }
end;
Result := Trim(Result);
end;
Part 5: Writing CSV Output
Parsing is only half the story — sometimes you need to write CSV files too. The key rule: if a field contains a comma, a double quote, or a newline, it must be quoted, and any double quotes within it must be doubled.
function CSVEscapeField(const field: String): String;
var
needsQuoting: Boolean;
begin
needsQuoting := (Pos(',', field) > 0) or
(Pos('"', field) > 0) or
(Pos(#10, field) > 0) or
(Pos(#13, field) > 0);
if needsQuoting then
Result := '"' + StringReplace(field, '"', '""', [rfReplaceAll]) + '"'
else
Result := field;
end;
procedure WriteCSVRow(var f: TextFile; const fields: array of String);
var
i: Integer;
begin
for i := Low(fields) to High(fields) do
begin
if i > Low(fields) then
Write(f, ',');
Write(f, CSVEscapeField(fields[i]));
end;
WriteLn(f);
end;
Discussion Questions
-
Our state machine parser uses three states. A more complete CSV parser (handling multiline quoted fields) would need to read across line boundaries. How would you modify the parser to support this? What additional state would you need?
-
CSV has no formal standard, though RFC 4180 provides guidelines. What ambiguities exist in CSV (e.g., should leading/trailing spaces be preserved? what encoding to use?) and how would you resolve them in a production parser?
-
Compare the state machine approach to the
ParseCSVFieldfunction from Section 10.9 of the chapter. What are the advantages of each? Which is easier to extend with new features? -
Our validation function checks for the correct number of fields per row. What other validation rules would be important for financial transaction data? Think about data types, ranges, and business rules.
-
The
CSVEscapeFieldfunction decides whether quoting is needed. Some CSV writers always quote every field. What are the trade-offs?
Extension Activities
Activity 1: CSV Diff Tool. Write a program that reads two CSV files and reports the differences between them — added rows, deleted rows, and modified cells. Use a key column (e.g., date + description) to match rows between the two files.
Activity 2: CSV to HTML Table. Write a program that converts a CSV file into a formatted HTML table, with proper escaping of HTML special characters (<, >, &, ").
Activity 3: Multiline Field Support. Extend the CSV parser to handle quoted fields that span multiple lines. This requires reading the file differently — you can no longer process one line at a time with ReadLn. Instead, read character by character and track whether you are inside a quoted field.
Activity 4: Performance Comparison. Create a CSV file with 10,000 rows. Time how long it takes to parse using (a) the state machine parser, (b) the simple SplitString approach from Section 10.4, and (c) the ParseCSVField approach from Section 10.9. Which is fastest? Which produces the most correct results?