Case Study 2: Text Analysis with Character Sets
The Scenario
You have been asked to build a simple text analysis tool for a writing class. The tool reads a passage of text and produces a detailed breakdown: how many vowels, consonants, digits, punctuation marks, and whitespace characters it contains. It should also identify which specific characters from each category appear and which do not. This is a task that character sets handle with remarkable elegance.
Designing with Character Sets
The key insight is that every character classification question — "Is this character a vowel?", "Is this a digit?", "Is this punctuation?" — is a set membership test. We define our character classes as constants up front, and the rest of the program practically writes itself.
program TextAnalyzer;
{$mode objfpc}{$H+}
const
Vowels = ['A','E','I','O','U','a','e','i','o','u'];
Consonants = ['B'..'D','F'..'H','J'..'N','P'..'T','V'..'Z',
'b'..'d','f'..'h','j'..'n','p'..'t','v'..'z'];
Letters = ['A'..'Z', 'a'..'z'];
Digits = ['0'..'9'];
WhiteSpace = [' ', #9, #10, #13]; { space, tab, LF, CR }
Punctuation = ['.', ',', ';', ':', '!', '?', '''', '"',
'(', ')', '-', '/'];
Brackets = ['(', ')', '[', ']', '{', '}'];
Notice how we define Consonants explicitly as all alphabetic characters that are not vowels. We could also compute this: Letters - Vowels would give us the same set. Both approaches are valid; the explicit definition is slightly more readable, while the computed version is easier to maintain if the vowel list changes.
The Analysis Record
We collect all our results into a single record:
type
TTextStats = record
TotalChars: Integer;
VowelCount: Integer;
ConsonantCount: Integer;
DigitCount: Integer;
PunctCount: Integer;
SpaceCount: Integer;
OtherCount: Integer;
UniqueVowels: set of Char;
UniqueConsonants: set of Char;
UniqueDigits: set of Char;
UniquePunct: set of Char;
end;
The Unique* fields track which specific characters from each category appear in the text. This is a natural use of sets: as we encounter each character, we add it to the appropriate set. At the end, the set contains exactly the distinct characters that appeared.
The Core Analysis Function
function AnalyzeText(const Text: string): TTextStats;
var
I: Integer;
Ch: Char;
begin
{ Initialize all fields to zero/empty }
Result.TotalChars := Length(Text);
Result.VowelCount := 0;
Result.ConsonantCount := 0;
Result.DigitCount := 0;
Result.PunctCount := 0;
Result.SpaceCount := 0;
Result.OtherCount := 0;
Result.UniqueVowels := [];
Result.UniqueConsonants := [];
Result.UniqueDigits := [];
Result.UniquePunct := [];
for I := 1 to Length(Text) do
begin
Ch := Text[I];
if Ch in Vowels then
begin
Inc(Result.VowelCount);
Include(Result.UniqueVowels, Ch);
end
else if Ch in Consonants then
begin
Inc(Result.ConsonantCount);
Include(Result.UniqueConsonants, Ch);
end
else if Ch in Digits then
begin
Inc(Result.DigitCount);
Include(Result.UniqueDigits, Ch);
end
else if Ch in Punctuation then
begin
Inc(Result.PunctCount);
Include(Result.UniquePunct, Ch);
end
else if Ch in WhiteSpace then
Inc(Result.SpaceCount)
else
Inc(Result.OtherCount);
end;
end;
Each character is tested against our predefined sets. The in operator does the classification, and Include accumulates the unique characters. The entire loop body is a clean chain of set membership tests — no nested conditions, no ASCII arithmetic, no character-by-character comparisons.
Displaying the Results
We need a helper to print the contents of a character set:
procedure PrintCharSet(const S: set of Char; const Label_: string);
var
Ch: Char;
First: Boolean;
begin
Write(Label_, ': {');
First := True;
for Ch := #0 to #255 do
if Ch in S then
begin
if not First then Write(', ');
if Ch >= ' ' then
Write(Ch)
else
Write('#', Ord(Ch)); { Show control characters by number }
First := False;
end;
WriteLn('}');
end;
And the main display procedure:
procedure DisplayStats(const Stats: TTextStats);
var
LetterCount: Integer;
begin
LetterCount := Stats.VowelCount + Stats.ConsonantCount;
WriteLn('=== Text Analysis Report ===');
WriteLn;
WriteLn('Total characters: ', Stats.TotalChars);
WriteLn(' Letters: ', LetterCount);
WriteLn(' Vowels: ', Stats.VowelCount);
WriteLn(' Consonants: ', Stats.ConsonantCount);
WriteLn(' Digits: ', Stats.DigitCount);
WriteLn(' Punctuation: ', Stats.PunctCount);
WriteLn(' Whitespace: ', Stats.SpaceCount);
WriteLn(' Other: ', Stats.OtherCount);
WriteLn;
if LetterCount > 0 then
begin
WriteLn('Vowel ratio: ', (Stats.VowelCount * 100) div LetterCount, '% of letters');
WriteLn('Consonant ratio: ', (Stats.ConsonantCount * 100) div LetterCount, '% of letters');
end;
WriteLn;
PrintCharSet(Stats.UniqueVowels, 'Unique vowels found');
PrintCharSet(Stats.UniqueConsonants, 'Unique consonants found');
PrintCharSet(Stats.UniqueDigits, 'Unique digits found');
PrintCharSet(Stats.UniquePunct, 'Unique punctuation found');
end;
Finding Missing Characters
One of the most useful features of sets is computing what is not present. We can find which vowels or digits are missing from the text using set difference:
procedure DisplayMissing(const Stats: TTextStats);
var
MissingVowels: set of Char;
MissingDigits: set of Char;
begin
MissingVowels := Vowels - Stats.UniqueVowels;
MissingDigits := Digits - Stats.UniqueDigits;
WriteLn;
if MissingVowels = [] then
WriteLn('All vowels are represented in the text (pangram-like for vowels).')
else
PrintCharSet(MissingVowels, 'Missing vowels');
if MissingDigits = [] then
WriteLn('All digits 0-9 appear in the text.')
else
PrintCharSet(MissingDigits, 'Missing digits');
end;
The expression Vowels - Stats.UniqueVowels gives us the vowels that are in our reference set but not found in the text. This is a single CPU instruction, and the result is immediately useful.
Sentence Analysis
We can extend the analyzer to count sentences by tracking sentence-ending punctuation:
const
SentenceEnders = ['.', '!', '?'];
function CountSentences(const Text: string): Integer;
var
I: Integer;
begin
Result := 0;
for I := 1 to Length(Text) do
if Text[I] in SentenceEnders then
Inc(Result);
end;
And we can compute readability metrics:
procedure DisplayReadability(const Text: string; const Stats: TTextStats);
var
WordCount, SentenceCount: Integer;
AvgWordLength: Real;
LetterCount: Integer;
begin
SentenceCount := CountSentences(Text);
WordCount := CountWords(Text);
LetterCount := Stats.VowelCount + Stats.ConsonantCount;
WriteLn;
WriteLn('=== Readability Metrics ===');
WriteLn('Words: ', WordCount);
WriteLn('Sentences: ', SentenceCount);
if WordCount > 0 then
begin
AvgWordLength := LetterCount / WordCount;
WriteLn('Avg word length: ', AvgWordLength:0:1, ' letters');
end;
if SentenceCount > 0 then
WriteLn('Avg sentence length: ', WordCount div SentenceCount, ' words');
end;
Where CountWords uses a set to detect word boundaries:
function CountWords(const Text: string): Integer;
var
I: Integer;
InWord: Boolean;
begin
Result := 0;
InWord := False;
for I := 1 to Length(Text) do
begin
if Text[I] in Letters + Digits then
begin
if not InWord then
begin
Inc(Result);
InWord := True;
end;
end
else
InWord := False;
end;
end;
The word-boundary detection uses Letters + Digits — the union of two sets — to define what constitutes a word character. This is the same approach used by professional text processors.
The Main Program
var
InputText: string;
Stats: TTextStats;
begin
WriteLn('Enter text to analyze (press Enter when done):');
ReadLn(InputText);
if Length(InputText) = 0 then
begin
WriteLn('No text entered.');
Halt(1);
end;
Stats := AnalyzeText(InputText);
DisplayStats(Stats);
DisplayMissing(Stats);
DisplayReadability(InputText, Stats);
end.
Sample Run
Enter text to analyze (press Enter when done):
The quick brown fox jumps over 13 lazy dogs! Does it, really?
=== Text Analysis Report ===
Total characters: 59
Letters: 43
Vowels: 16
Consonants: 27
Digits: 2
Punctuation: 3
Whitespace: 11
Other: 0
Vowel ratio: 37% of letters
Consonant ratio: 62% of letters
Unique vowels found: {A, a, e, i, o, u}
Unique consonants found: {D, T, b, c, d, f, g, h, j, k, l, m, n, p, q, r, s, t, v, w, x, y, z}
Unique digits found: {1, 3}
Unique punctuation found: {!, ,, ?}
Missing vowels: {E, I, O, U}
Missing digits: {0, 2, 4, 5, 6, 7, 8, 9}
=== Readability Metrics ===
Words: 12
Sentences: 2
Avg word length: 3.6 letters
Avg sentence length: 6 words
Analysis
Sets as Classification Tables
The fundamental pattern in this case study is using constant sets as classification tables. Instead of writing functions full of if statements to categorize characters, we declare the categories as sets and let the in operator do the work. This approach is:
- Declarative — the sets say what each category contains, not how to check membership
- Efficient — each
intest is a single bit operation - Maintainable — adding a character to a category means adding it to one constant
- Composable —
Letters + Digitscreates a new set from existing ones without any looping
The Power of Set Difference
The "missing characters" feature highlights set difference as an analytical tool. The question "which vowels are NOT in this text?" is answered in one expression: Vowels - Stats.UniqueVowels. Try expressing that as concisely without sets — you would need a loop, a boolean array, or a series of individual checks.
Real-World Extensions
This text analyzer could be extended in several directions:
- Frequency analysis for cryptography exercises (use an
array[Char] of Integeralongside the sets) - Language detection based on vowel/consonant ratios and character set usage
- Readability scoring using standard formulas (Flesch-Kincaid, etc.)
- Pattern detection — finding words that contain all vowels, or sentences with unusual punctuation patterns
All of these extensions benefit from the set-based character classification foundation.
Exercises for This Case Study
-
Extend the analyzer to track uppercase vs. lowercase letter distribution. Add
UpperCountandLowerCountfields toTTextStats. Use the sets['A'..'Z']and['a'..'z']for classification. -
Add a "character frequency" mode that counts how many times each character appears (not just whether it appears). You will need an
array[Char] of Integerfor this — sets only track presence, not count. -
Build a pangram checker: a function
IsPangram(const Text: string): Booleanthat returnsTrueif the text contains every letter of the alphabet. (Hint: collect all letters found into aset of Charand compare against['a'..'z']after converting to lowercase.) -
Create a "text fingerprint" that summarizes a passage as the set of character classes it uses. For example, a passage containing letters, digits, and exclamation marks would have the fingerprint
[fcLetters, fcDigits, fcPunctuation]whereTFingerClassis an enumeration you define. -
Extend the word counter to also identify the longest word and shortest word, using sets to determine word boundaries. Track the words as strings but use the character sets for boundary detection.