Case Study 2: Parsing Real-World Data — CSV Files

DataField.Dev

Case Study 2: Parsing Real-World Data — CSV Files

Overview

Comma-Separated Values (CSV) is one of the most common data exchange formats in the world. Banks export transaction histories in CSV. Spreadsheets save to CSV. Government agencies publish open data as CSV. Despite its apparent simplicity — "just split on commas" — CSV parsing is surprisingly tricky once you encounter real-world data.

In this case study, we build a robust CSV parser in Pascal that handles the edge cases that trip up naive implementations: quoted fields, commas embedded within values, escaped quotes, blank lines, and malformed rows. We then use it to process a file of financial transactions, demonstrating the practical payoff of careful string processing.

Part 1: Why CSV Is Harder Than It Looks

Consider these CSV lines:

John Smith,42,New York
"Jane ""JD"" Doe",38,"Portland, Oregon"
Bob Wilson,,"No age given"
,25,
"Multi
line field",99,Test

A naive "split on commas" approach fails on line 2 (the comma in Portland, Oregon is not a delimiter) and line 4 (the field spans multiple lines). A correct parser must handle:

Quoted fields: A field enclosed in double quotes can contain commas, which are data, not delimiters.
Escaped quotes: A double quote within a quoted field is represented as two double quotes ("").
Empty fields: Two consecutive commas indicate an empty field.
Multiline fields: A quoted field can contain line breaks (though we will handle this as an extension — single-line parsing first).
Trailing whitespace: Some exporters add spaces around delimiters.
Inconsistent column counts: Malformed rows may have too many or too few fields.

Part 2: The State Machine Approach

The most reliable way to parse CSV is with a finite state machine. At any point during parsing, we are in one of three states:

State	Description
`FIELD_START`	We are at the beginning of a new field
`IN_UNQUOTED`	We are reading characters inside an unquoted field
`IN_QUOTED`	We are reading characters inside a quoted field

Transitions between states are triggered by the current character:

In FIELD_START: if we see ", transition to IN_QUOTED. If we see ,, emit an empty field and stay in FIELD_START. Otherwise, transition to IN_UNQUOTED.
In IN_UNQUOTED: if we see ,, emit the field and transition to FIELD_START. Otherwise, append the character.
In IN_QUOTED: if we see ", peek at the next character. If it is also ", append a single " (escaped quote). If it is , or end of line, emit the field and transition to FIELD_START. Otherwise, it is a malformed quote.

Implementation

program CSVParserDemo;

{$mode objfpc}{$H+}

uses
  SysUtils;

const
  MAX_FIELDS = 50;
  MAX_ROWS   = 1000;

type
  TParseState = (psFieldStart, psInUnquoted, psInQuoted);

  TCSVRow = record
    Fields   : array[0..MAX_FIELDS - 1] of String;
    FieldCount: Integer;
  end;

  TCSVData = record
    Rows    : array[0..MAX_ROWS - 1] of TCSVRow;
    RowCount: Integer;
  end;

function ParseCSVLine(const line: String; var row: TCSVRow): Boolean;
var
  state: TParseState;
  i: Integer;
  current: String;
  c: Char;
begin
  Result := True;
  row.FieldCount := 0;
  current := '';
  state := psFieldStart;

  i := 1;
  while i <= Length(line) do
  begin
    c := line[i];

    case state of
      psFieldStart:
      begin
        if c = '"' then
        begin
          state := psInQuoted;
          current := '';
        end
        else if c = ',' then
        begin
          { Empty field }
          if row.FieldCount < MAX_FIELDS then
          begin
            row.Fields[row.FieldCount] := '';
            Inc(row.FieldCount);
          end;
          { Stay in psFieldStart }
        end
        else if (c = ' ') or (c = #9) then
        begin
          { Skip leading whitespace before field }
        end
        else
        begin
          state := psInUnquoted;
          current := c;
        end;
      end;

      psInUnquoted:
      begin
        if c = ',' then
        begin
          if row.FieldCount < MAX_FIELDS then
          begin
            row.Fields[row.FieldCount] := TrimRight(current);
            Inc(row.FieldCount);
          end;
          current := '';
          state := psFieldStart;
        end
        else
          current := current + c;
      end;

      psInQuoted:
      begin
        if c = '"' then
        begin
          { Peek at next character }
          if (i < Length(line)) and (line[i + 1] = '"') then
          begin
            { Escaped quote — append single quote }
            current := current + '"';
            Inc(i);  { Skip the second quote }
          end
          else
          begin
            { End of quoted field }
            if row.FieldCount < MAX_FIELDS then
            begin
              row.Fields[row.FieldCount] := current;
              Inc(row.FieldCount);
            end;
            current := '';
            state := psFieldStart;
            { Skip to the comma (or end of line) }
            Inc(i);
            while (i <= Length(line)) and (line[i] <> ',') do
              Inc(i);
            { i now points to comma or is past end; the main loop Inc will advance past comma }
          end;
        end
        else
          current := current + c;
      end;
    end;  { case }

    Inc(i);
  end;  { while }

  { Handle the last field (if line does not end with comma) }
  if (state = psInUnquoted) or (state = psFieldStart) then
  begin
    if (state = psInUnquoted) or (current <> '') then
    begin
      if row.FieldCount < MAX_FIELDS then
      begin
        row.Fields[row.FieldCount] := TrimRight(current);
        Inc(row.FieldCount);
      end;
    end
    else if (Length(line) > 0) and (line[Length(line)] = ',') then
    begin
      { Trailing comma means trailing empty field }
      if row.FieldCount < MAX_FIELDS then
      begin
        row.Fields[row.FieldCount] := '';
        Inc(row.FieldCount);
      end;
    end;
  end
  else if state = psInQuoted then
  begin
    { Unterminated quote — this is malformed }
    WriteLn('Warning: Unterminated quoted field');
    if row.FieldCount < MAX_FIELDS then
    begin
      row.Fields[row.FieldCount] := current;
      Inc(row.FieldCount);
    end;
    Result := False;  { Signal malformed row }
  end;
end;

Part 3: Processing Financial Transactions

Now let us use the parser on a realistic dataset. Suppose we have a file transactions.csv:

Date,Category,Amount,Description,Account
2025-03-01,Income,3200.00,Salary - ACME Corp,Checking
2025-03-02,Rent,-1200.00,March Rent Payment,Checking
2025-03-03,Groceries,-87.42,Whole Foods Market,Credit Card
2025-03-04,Utilities,-145.00,"Electric, Gas & Water",Checking
2025-03-05,Dining,-32.50,"Joe's Diner",Credit Card
2025-03-06,Groceries,-62.18,Trader Joe's,Credit Card
2025-03-07,Transport,-45.00,Monthly Bus Pass,Checking
2025-03-08,Entertainment,-15.99,Netflix Subscription,Credit Card
2025-03-09,Groceries,-94.33,"Fresh Market, Organic Section",Credit Card
2025-03-10,Income,250.00,Freelance Project Payment,Savings
2025-03-11,Medical,-25.00,Dr. Smith Co-pay,Checking
2025-03-12,Dining,-28.75,"Sushi ""Zen"" Restaurant",Credit Card

Note the test cases: line 4 has a comma in a quoted field, line 5 has an apostrophe, line 9 has a comma in a quoted field, and line 12 has escaped quotes inside a quoted field.

Loading and Analyzing the Data

type
  TTransaction = record
    TransDate  : String;
    Category   : String;
    Amount     : Real;
    Description: String;
    Account    : String;
  end;

procedure LoadTransactions(const data: TCSVData;
                           var trans: array of TTransaction;
                           var count: Integer);
var
  i, code: Integer;
  amountStr: String;
begin
  count := 0;
  { Skip header row (i = 1 instead of 0) }
  for i := 1 to data.RowCount - 1 do
  begin
    if data.Rows[i].FieldCount < 5 then
    begin
      WriteLn(Format('Warning: Row %d has only %d fields, skipping.',
                     [i + 1, data.Rows[i].FieldCount]));
      Continue;
    end;

    trans[count].TransDate   := data.Rows[i].Fields[0];
    trans[count].Category    := data.Rows[i].Fields[1];

    amountStr := Trim(data.Rows[i].Fields[2]);
    Val(amountStr, trans[count].Amount, code);
    if code <> 0 then
    begin
      WriteLn(Format('Warning: Invalid amount "%s" on row %d, skipping.',
                     [amountStr, i + 1]));
      Continue;
    end;

    trans[count].Description := data.Rows[i].Fields[3];
    trans[count].Account     := data.Rows[i].Fields[4];
    Inc(count);
  end;
end;

Summary Report

procedure PrintSummary(const trans: array of TTransaction;
                       count: Integer);
var
  i: Integer;
  totalIncome, totalExpenses: Real;
  categories: array[0..49] of String;
  catTotals: array[0..49] of Real;
  catCount: Integer;
  found: Boolean;
  j: Integer;
begin
  totalIncome := 0;
  totalExpenses := 0;
  catCount := 0;

  for i := 0 to count - 1 do
  begin
    if trans[i].Amount >= 0 then
      totalIncome := totalIncome + trans[i].Amount
    else
      totalExpenses := totalExpenses + trans[i].Amount;

    { Track category totals }
    found := False;
    for j := 0 to catCount - 1 do
    begin
      if LowerCase(categories[j]) = LowerCase(trans[i].Category) then
      begin
        catTotals[j] := catTotals[j] + trans[i].Amount;
        found := True;
        Break;
      end;
    end;
    if not found then
    begin
      categories[catCount] := trans[i].Category;
      catTotals[catCount] := trans[i].Amount;
      Inc(catCount);
    end;
  end;

  { Print report }
  WriteLn;
  WriteLn('=== Financial Summary ===');
  WriteLn(Format('Total transactions: %d', [count]));
  WriteLn(Format('Total income:      $%10.2f', [totalIncome]));
  WriteLn(Format('Total expenses:    $%10.2f', [totalExpenses]));
  WriteLn(Format('Net:               $%10.2f', [totalIncome + totalExpenses]));
  WriteLn;
  WriteLn('--- By Category ---');
  WriteLn(Format('%-15s %10s', ['Category', 'Total']));
  WriteLn(StringOfChar('-', 27));
  for j := 0 to catCount - 1 do
    WriteLn(Format('%-15s $%9.2f', [categories[j], catTotals[j]]));
end;

Expected output:

=== Financial Summary ===
Total transactions: 12
Total income:      $   3450.00
Total expenses:    $  -1736.17
Net:               $   1713.83

--- By Category ---
Category             Total
---------------------------
Income          $  3450.00
Rent            $ -1200.00
Groceries       $  -243.93
Utilities       $  -145.00
Dining          $   -61.25
Transport       $   -45.00
Entertainment   $   -15.99
Medical         $   -25.00

Part 4: Validation and Error Handling

Real-world CSV files are messy. Here are common problems and how to handle them:

Problem 1: Inconsistent Column Counts

procedure ValidateRow(const row: TCSVRow; expectedFields: Integer;
                      lineNum: Integer; var errors: Integer);
begin
  if row.FieldCount <> expectedFields then
  begin
    WriteLn(Format('ERROR line %d: Expected %d fields, found %d',
                   [lineNum, expectedFields, row.FieldCount]));
    Inc(errors);
  end;
end;

Problem 2: Invalid Date Formats

function IsValidDate(const s: String): Boolean;
var
  year, month, day, code: Integer;
begin
  Result := False;
  if Length(s) <> 10 then Exit;
  if (s[5] <> '-') or (s[8] <> '-') then Exit;

  Val(Copy(s, 1, 4), year, code);
  if code <> 0 then Exit;
  Val(Copy(s, 6, 2), month, code);
  if code <> 0 then Exit;
  Val(Copy(s, 9, 2), day, code);
  if code <> 0 then Exit;

  Result := (year >= 2000) and (year <= 2099) and
            (month >= 1) and (month <= 12) and
            (day >= 1) and (day <= 31);
end;

Problem 3: Encoding Issues and Special Characters

function SanitizeField(const s: String): String;
var
  i: Integer;
begin
  Result := '';
  for i := 1 to Length(s) do
  begin
    if Ord(s[i]) >= 32 then  { Only printable characters }
      Result := Result + s[i]
    else if s[i] = #9 then
      Result := Result + ' '  { Replace tab with space }
    { else skip control characters }
  end;
  Result := Trim(Result);
end;

Part 5: Writing CSV Output

Parsing is only half the story — sometimes you need to write CSV files too. The key rule: if a field contains a comma, a double quote, or a newline, it must be quoted, and any double quotes within it must be doubled.

function CSVEscapeField(const field: String): String;
var
  needsQuoting: Boolean;
begin
  needsQuoting := (Pos(',', field) > 0) or
                  (Pos('"', field) > 0) or
                  (Pos(#10, field) > 0) or
                  (Pos(#13, field) > 0);

  if needsQuoting then
    Result := '"' + StringReplace(field, '"', '""', [rfReplaceAll]) + '"'
  else
    Result := field;
end;

procedure WriteCSVRow(var f: TextFile; const fields: array of String);
var
  i: Integer;
begin
  for i := Low(fields) to High(fields) do
  begin
    if i > Low(fields) then
      Write(f, ',');
    Write(f, CSVEscapeField(fields[i]));
  end;
  WriteLn(f);
end;

Discussion Questions

Our state machine parser uses three states. A more complete CSV parser (handling multiline quoted fields) would need to read across line boundaries. How would you modify the parser to support this? What additional state would you need?
CSV has no formal standard, though RFC 4180 provides guidelines. What ambiguities exist in CSV (e.g., should leading/trailing spaces be preserved? what encoding to use?) and how would you resolve them in a production parser?
Compare the state machine approach to the ParseCSVField function from Section 10.9 of the chapter. What are the advantages of each? Which is easier to extend with new features?
Our validation function checks for the correct number of fields per row. What other validation rules would be important for financial transaction data? Think about data types, ranges, and business rules.
The CSVEscapeField function decides whether quoting is needed. Some CSV writers always quote every field. What are the trade-offs?

Extension Activities

Activity 1: CSV Diff Tool. Write a program that reads two CSV files and reports the differences between them — added rows, deleted rows, and modified cells. Use a key column (e.g., date + description) to match rows between the two files.

Activity 2: CSV to HTML Table. Write a program that converts a CSV file into a formatted HTML table, with proper escaping of HTML special characters (<, >, &, ").

Activity 3: Multiline Field Support. Extend the CSV parser to handle quoted fields that span multiple lines. This requires reading the file differently — you can no longer process one line at a time with ReadLn. Instead, read character by character and track whether you are inside a quoted field.

Activity 4: Performance Comparison. Create a CSV file with 10,000 rows. Time how long it takes to parse using (a) the state machine parser, (b) the simple SplitString approach from Section 10.4, and (c) the ParseCSVField approach from Section 10.9. Which is fastest? Which produces the most correct results?