Case Study 2: Text Analysis with Character Sets

DataField.Dev

Case Study 2: Text Analysis with Character Sets

The Scenario

You have been asked to build a simple text analysis tool for a writing class. The tool reads a passage of text and produces a detailed breakdown: how many vowels, consonants, digits, punctuation marks, and whitespace characters it contains. It should also identify which specific characters from each category appear and which do not. This is a task that character sets handle with remarkable elegance.

Designing with Character Sets

The key insight is that every character classification question — "Is this character a vowel?", "Is this a digit?", "Is this punctuation?" — is a set membership test. We define our character classes as constants up front, and the rest of the program practically writes itself.

program TextAnalyzer;

{$mode objfpc}{$H+}

const
  Vowels      = ['A','E','I','O','U','a','e','i','o','u'];
  Consonants  = ['B'..'D','F'..'H','J'..'N','P'..'T','V'..'Z',
                 'b'..'d','f'..'h','j'..'n','p'..'t','v'..'z'];
  Letters     = ['A'..'Z', 'a'..'z'];
  Digits      = ['0'..'9'];
  WhiteSpace  = [' ', #9, #10, #13];  { space, tab, LF, CR }
  Punctuation = ['.', ',', ';', ':', '!', '?', '''', '"',
                 '(', ')', '-', '/'];
  Brackets    = ['(', ')', '[', ']', '{', '}'];

Notice how we define Consonants explicitly as all alphabetic characters that are not vowels. We could also compute this: Letters - Vowels would give us the same set. Both approaches are valid; the explicit definition is slightly more readable, while the computed version is easier to maintain if the vowel list changes.

The Analysis Record

We collect all our results into a single record:

type
  TTextStats = record
    TotalChars:    Integer;
    VowelCount:    Integer;
    ConsonantCount: Integer;
    DigitCount:    Integer;
    PunctCount:    Integer;
    SpaceCount:    Integer;
    OtherCount:    Integer;
    UniqueVowels:  set of Char;
    UniqueConsonants: set of Char;
    UniqueDigits:  set of Char;
    UniquePunct:   set of Char;
  end;

The Unique* fields track which specific characters from each category appear in the text. This is a natural use of sets: as we encounter each character, we add it to the appropriate set. At the end, the set contains exactly the distinct characters that appeared.

The Core Analysis Function

function AnalyzeText(const Text: string): TTextStats;
var
  I: Integer;
  Ch: Char;
begin
  { Initialize all fields to zero/empty }
  Result.TotalChars := Length(Text);
  Result.VowelCount := 0;
  Result.ConsonantCount := 0;
  Result.DigitCount := 0;
  Result.PunctCount := 0;
  Result.SpaceCount := 0;
  Result.OtherCount := 0;
  Result.UniqueVowels := [];
  Result.UniqueConsonants := [];
  Result.UniqueDigits := [];
  Result.UniquePunct := [];

  for I := 1 to Length(Text) do
  begin
    Ch := Text[I];

    if Ch in Vowels then
    begin
      Inc(Result.VowelCount);
      Include(Result.UniqueVowels, Ch);
    end
    else if Ch in Consonants then
    begin
      Inc(Result.ConsonantCount);
      Include(Result.UniqueConsonants, Ch);
    end
    else if Ch in Digits then
    begin
      Inc(Result.DigitCount);
      Include(Result.UniqueDigits, Ch);
    end
    else if Ch in Punctuation then
    begin
      Inc(Result.PunctCount);
      Include(Result.UniquePunct, Ch);
    end
    else if Ch in WhiteSpace then
      Inc(Result.SpaceCount)
    else
      Inc(Result.OtherCount);
  end;
end;

Each character is tested against our predefined sets. The in operator does the classification, and Include accumulates the unique characters. The entire loop body is a clean chain of set membership tests — no nested conditions, no ASCII arithmetic, no character-by-character comparisons.

Displaying the Results

We need a helper to print the contents of a character set:

procedure PrintCharSet(const S: set of Char; const Label_: string);
var
  Ch: Char;
  First: Boolean;
begin
  Write(Label_, ': {');
  First := True;
  for Ch := #0 to #255 do
    if Ch in S then
    begin
      if not First then Write(', ');
      if Ch >= ' ' then
        Write(Ch)
      else
        Write('#', Ord(Ch));  { Show control characters by number }
      First := False;
    end;
  WriteLn('}');
end;

And the main display procedure:

procedure DisplayStats(const Stats: TTextStats);
var
  LetterCount: Integer;
begin
  LetterCount := Stats.VowelCount + Stats.ConsonantCount;

  WriteLn('=== Text Analysis Report ===');
  WriteLn;
  WriteLn('Total characters:  ', Stats.TotalChars);
  WriteLn('  Letters:         ', LetterCount);
  WriteLn('    Vowels:        ', Stats.VowelCount);
  WriteLn('    Consonants:    ', Stats.ConsonantCount);
  WriteLn('  Digits:          ', Stats.DigitCount);
  WriteLn('  Punctuation:     ', Stats.PunctCount);
  WriteLn('  Whitespace:      ', Stats.SpaceCount);
  WriteLn('  Other:           ', Stats.OtherCount);
  WriteLn;

  if LetterCount > 0 then
  begin
    WriteLn('Vowel ratio: ', (Stats.VowelCount * 100) div LetterCount, '% of letters');
    WriteLn('Consonant ratio: ', (Stats.ConsonantCount * 100) div LetterCount, '% of letters');
  end;
  WriteLn;

  PrintCharSet(Stats.UniqueVowels, 'Unique vowels found');
  PrintCharSet(Stats.UniqueConsonants, 'Unique consonants found');
  PrintCharSet(Stats.UniqueDigits, 'Unique digits found');
  PrintCharSet(Stats.UniquePunct, 'Unique punctuation found');
end;

Finding Missing Characters

One of the most useful features of sets is computing what is not present. We can find which vowels or digits are missing from the text using set difference:

procedure DisplayMissing(const Stats: TTextStats);
var
  MissingVowels: set of Char;
  MissingDigits: set of Char;
begin
  MissingVowels := Vowels - Stats.UniqueVowels;
  MissingDigits := Digits - Stats.UniqueDigits;

  WriteLn;
  if MissingVowels = [] then
    WriteLn('All vowels are represented in the text (pangram-like for vowels).')
  else
    PrintCharSet(MissingVowels, 'Missing vowels');

  if MissingDigits = [] then
    WriteLn('All digits 0-9 appear in the text.')
  else
    PrintCharSet(MissingDigits, 'Missing digits');
end;

The expression Vowels - Stats.UniqueVowels gives us the vowels that are in our reference set but not found in the text. This is a single CPU instruction, and the result is immediately useful.

Sentence Analysis

We can extend the analyzer to count sentences by tracking sentence-ending punctuation:

const
  SentenceEnders = ['.', '!', '?'];

function CountSentences(const Text: string): Integer;
var
  I: Integer;
begin
  Result := 0;
  for I := 1 to Length(Text) do
    if Text[I] in SentenceEnders then
      Inc(Result);
end;

And we can compute readability metrics:

procedure DisplayReadability(const Text: string; const Stats: TTextStats);
var
  WordCount, SentenceCount: Integer;
  AvgWordLength: Real;
  LetterCount: Integer;
begin
  SentenceCount := CountSentences(Text);
  WordCount := CountWords(Text);
  LetterCount := Stats.VowelCount + Stats.ConsonantCount;

  WriteLn;
  WriteLn('=== Readability Metrics ===');
  WriteLn('Words:     ', WordCount);
  WriteLn('Sentences: ', SentenceCount);

  if WordCount > 0 then
  begin
    AvgWordLength := LetterCount / WordCount;
    WriteLn('Avg word length: ', AvgWordLength:0:1, ' letters');
  end;

  if SentenceCount > 0 then
    WriteLn('Avg sentence length: ', WordCount div SentenceCount, ' words');
end;

Where CountWords uses a set to detect word boundaries:

function CountWords(const Text: string): Integer;
var
  I: Integer;
  InWord: Boolean;
begin
  Result := 0;
  InWord := False;
  for I := 1 to Length(Text) do
  begin
    if Text[I] in Letters + Digits then
    begin
      if not InWord then
      begin
        Inc(Result);
        InWord := True;
      end;
    end
    else
      InWord := False;
  end;
end;

The word-boundary detection uses Letters + Digits — the union of two sets — to define what constitutes a word character. This is the same approach used by professional text processors.

The Main Program

var
  InputText: string;
  Stats: TTextStats;
begin
  WriteLn('Enter text to analyze (press Enter when done):');
  ReadLn(InputText);

  if Length(InputText) = 0 then
  begin
    WriteLn('No text entered.');
    Halt(1);
  end;

  Stats := AnalyzeText(InputText);
  DisplayStats(Stats);
  DisplayMissing(Stats);
  DisplayReadability(InputText, Stats);
end.

Sample Run

Enter text to analyze (press Enter when done):
The quick brown fox jumps over 13 lazy dogs! Does it, really?

=== Text Analysis Report ===

Total characters:  59
  Letters:         43
    Vowels:        16
    Consonants:    27
  Digits:          2
  Punctuation:     3
  Whitespace:      11
  Other:           0

Vowel ratio: 37% of letters
Consonant ratio: 62% of letters

Unique vowels found: {A, a, e, i, o, u}
Unique consonants found: {D, T, b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z}
Unique digits found: {1, 3}
Unique punctuation found: {!, ,, ?}

Missing vowels: {E, I, O, U}
Missing digits: {0, 2, 4, 5, 6, 7, 8, 9}

=== Readability Metrics ===
Words:     12
Sentences: 2
Avg word length: 3.6 letters
Avg sentence length: 6 words

Analysis

Sets as Classification Tables

The fundamental pattern in this case study is using constant sets as classification tables. Instead of writing functions full of if statements to categorize characters, we declare the categories as sets and let the in operator do the work. This approach is:

Declarative — the sets say what each category contains, not how to check membership
Efficient — each in test is a single bit operation
Maintainable — adding a character to a category means adding it to one constant
Composable — Letters + Digits creates a new set from existing ones without any looping

The Power of Set Difference

The "missing characters" feature highlights set difference as an analytical tool. The question "which vowels are NOT in this text?" is answered in one expression: Vowels - Stats.UniqueVowels. Try expressing that as concisely without sets — you would need a loop, a boolean array, or a series of individual checks.

Real-World Extensions

This text analyzer could be extended in several directions:

Frequency analysis for cryptography exercises (use an array[Char] of Integer alongside the sets)
Language detection based on vowel/consonant ratios and character set usage
Readability scoring using standard formulas (Flesch-Kincaid, etc.)
Pattern detection — finding words that contain all vowels, or sentences with unusual punctuation patterns

All of these extensions benefit from the set-based character classification foundation.

Exercises for This Case Study

Extend the analyzer to track uppercase vs. lowercase letter distribution. Add UpperCount and LowerCount fields to TTextStats. Use the sets ['A'..'Z'] and ['a'..'z'] for classification.
Add a "character frequency" mode that counts how many times each character appears (not just whether it appears). You will need an array[Char] of Integer for this — sets only track presence, not count.
Build a pangram checker: a function IsPangram(const Text: string): Boolean that returns True if the text contains every letter of the alphabet. (Hint: collect all letters found into a set of Char and compare against ['a'..'z'] after converting to lowercase.)
Create a "text fingerprint" that summarizes a passage as the set of character classes it uses. For example, a passage containing letters, digits, and exclamation marks would have the fingerprint [fcLetters, fcDigits, fcPunctuation] where TFingerClass is an enumeration you define.
Extend the word counter to also identify the longest word and shortest word, using sets to determine word boundaries. Track the words as strings but use the character sets for boundary detection.