Chapter 8: Built-in Functions — Strings, Numbers, Dates, and Type Conversion

DataField.Dev

25 min read

> Where you are: Part II, Chapter 8 of 40. You can select, filter, join, and aggregate. But real data is messy — names need formatting, dates need slicing, types need converting. This chapter is the toolkit you'll reach for in nearly every query.

In This Chapter

The functions you use constantly
String functions
Numeric functions
Date and time functions
Type conversion (casting)
Conditional expressions
Putting it together
Regular expressions: pattern matching beyond LIKE
Generating rows: generate_series and date spines
More string tools, and the format function
A wider gallery of function recipes
A worked scenario: cleaning a messy import
A performance note (preview of Chapter 23)
When built-in functions aren't enough
Why transform data in the database, not the application
A field guide to the gotchas
Finding the function you need
Progressive project: clean and format your data
A practitioner's reference card
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Built-in Functions — Strings, Numbers, Dates, and Type Conversion

Where you are: Part II, Chapter 8 of 40. You can select, filter, join, and aggregate. But real data is messy — names need formatting, dates need slicing, types need converting. This chapter is the toolkit you'll reach for in nearly every query.

Learning paths: 💻 📊 🔬 🏗️ — everyone. 📊 analysts especially: cleaning and formatting values is half of real analytics work. Treat this chapter as a working reference (Appendix C summarizes it).

The functions you use constantly

You've been computing simple expressions (price * 1.08) since Chapter 5. PostgreSQL ships with hundreds of built-in functions for transforming values: trimming whitespace from a name, extracting the month from a timestamp, rounding a number, converting text to a date. You won't memorize all of them — nobody does — but you'll use a core set daily, and you should know what categories exist so you know what to look up.

This chapter is organized by data type: strings, numbers, dates/times, type conversion, and conditional expressions. Skim it once, then return as a reference. Every example runs against Mercado.

Why this matters. Functions are where SQL meets the reality that data is rarely in the exact shape you need. The competent practitioner doesn't export to a spreadsheet to clean data — they do it in the query, repeatably, at the source.

String functions

SELECT
    first_name || ' ' || last_name           AS full_name,    -- concatenate
    concat(first_name, ' ', last_name)        AS full_name2,   -- concat() ignores NULLs
    length(email)                             AS email_len,
    upper(last_name)                          AS shout,
    lower(email)                              AS normalized,
    initcap('new york city')                  AS titlecased     -- 'New York City'
FROM customers
LIMIT 3;

The everyday string toolkit:

Function	Does	Example → result
`a \|\| b`	concatenate (NULL if any part is NULL)	`'a' \|\| 'b'` → `ab`
`concat(...)`	concatenate, ignoring NULLs	`concat('a', NULL, 'b')` → `ab`
`length(s)`	character count	`length('hi')` → `2`
`upper/lower/initcap`	change case	`initcap('foo bar')` → `Foo Bar`
`trim/ltrim/rtrim`	strip whitespace (or chars)	`trim(' x ')` → `x`
`substring(s from a for b)`	extract part	`substring('abcdef' from 2 for 3)` → `bcd`
`position(sub in s)`	find index (1-based, 0 if absent)	`position('c' in 'abc')` → `3`
`replace(s, from, to)`	replace all occurrences	`replace('a-b-c','-','_')` → `a_b_c`
`split_part(s, delim, n)`	nth piece	`split_part('a.b.c','.',2)` → `b`
`left/right(s, n)`	first/last n chars	`left('abcdef',3)` → `abc`
`lpad/rpad(s, len, fill)`	pad to length	`lpad('7',3,'0')` → `007`

A practical Mercado example — extract the email domain and normalize names:

SELECT initcap(first_name) || ' ' || initcap(last_name) AS name,
       split_part(email, '@', 2)                         AS email_domain
FROM customers
ORDER BY email_domain, name;

Common mistake. Forgetting that || yields NULL if any operand is NULL. first_name || ' ' || middle_name is NULL for everyone without a middle name. Use concat() (which skips NULLs) or COALESCE(middle_name, '') when a part may be missing.

Dialect Difference. length measures characters; octet_length measures bytes (they differ for non-ASCII text). String indexing is 1-based in SQL (not 0-based like most programming languages). initcap is PostgreSQL/Oracle; not all databases have it. See Appendix J.

Numeric functions

SELECT
    round(price, 1)          AS rounded,      -- 1299.0
    ceil(price / 100.0)      AS hundreds_up,
    floor(price / 100.0)     AS hundreds_dn,
    trunc(price)             AS no_cents,
    mod(product_id, 2)       AS even_odd,
    abs(-5)                  AS absolute
FROM products
LIMIT 3;

Function	Does
`round(n, d)`	round to `d` decimals (banker's? no — round-half-away)
`ceil(n)` / `floor(n)`	round up / down to integer
`trunc(n, d)`	truncate toward zero
`abs(n)`	absolute value
`mod(a, b)` / `a % b`	remainder
`power(a, b)` / `a ^ b`	exponent
`sqrt(n)`	square root
`random()`	random float in [0,1)

Common mistake — integer division. In SQL, integer / integer yields an integer, discarding the remainder: 5 / 2 → 2, not 2.5. To get a decimal, make one side numeric: 5.0 / 2 → 2.5, or 5 / 2.0, or cast: 5::numeric / 2. This silently corrupts averages and ratios — e.g., successes / attempts will be 0 whenever successes < attempts. Always ensure at least one operand is non-integer for division you expect to be fractional.

-- BUG: integer division gives 0 for every product cheaper than its category avg... 
SELECT product_id, (1 / 2) AS oops;          -- 0
-- FIX:
SELECT product_id, (1.0 / 2) AS better;      -- 0.5

Date and time functions

Dates and times are where beginners stumble most, because they involve time zones, intervals, and formatting. PostgreSQL's date/time support is excellent once you know the core verbs.

SELECT
    now()                              AS current_timestamp,   -- timestamptz, "right now"
    current_date                       AS today,
    order_date::date                   AS order_day,           -- drop the time part
    extract(year FROM order_date)      AS yr,                  -- 2024
    extract(month FROM order_date)     AS mo,
    date_trunc('month', order_date)    AS month_start,         -- 2024-03-01 00:00:00
    age(now(), order_date)             AS how_long_ago
FROM orders
LIMIT 3;

The core verbs:

Function	Does	Example
`now()` / `current_date`	current timestamp / date	—
`extract(field FROM ts)`	pull a part (`year`, `month`, `day`, `dow`, `hour`)	`extract(month FROM order_date)`
`date_trunc('unit', ts)`	round down to a unit	`date_trunc('month', order_date)`
`age(a, b)`	interval between two timestamps	`age(now(), hired_at)`
`ts + interval '1 day'`	date arithmetic with intervals	`order_date + interval '7 days'`
`to_char(ts, fmt)`	format as text	`to_char(order_date,'YYYY-MM-DD')`

date_trunc is the analyst's best friend for time-series grouping — "orders per month" is just:

SELECT date_trunc('month', order_date) AS month, COUNT(*) AS orders
FROM orders
GROUP BY date_trunc('month', order_date)
ORDER BY month;

Interval arithmetic handles "within the last 30 days," "due in a week," and the like:

SELECT order_id, order_date
FROM orders
WHERE order_date >= now() - interval '90 days';

Common mistake — time zones. timestamp (without time zone) and timestamptz (with time zone) are different types. Mixing them, or assuming everything is in your local zone, causes off-by-hours and off-by-a-day bugs. Mercado uses timestamptz throughout (the right default). When comparing dates, be aware that order_date >= '2024-03-01' includes the boundary; for "all of March," use >= '2024-03-01' AND < '2024-04-01' (a half-open range) rather than BETWEEN ... AND '2024-03-31', which misses the last day's times.

Dialect Difference. Date functions vary wildly across databases: PostgreSQL's extract/date_trunc/age/interval differ from MySQL's YEAR()/DATE_FORMAT(), SQL Server's DATEPART()/DATEADD(), and Oracle's EXTRACT/TRUNC. Date handling is the least portable part of SQL — Appendix J maps the common ones.

Type conversion (casting)

Sometimes you need to change a value's type — text to number, timestamp to date, number to text. PostgreSQL offers CAST(value AS type) and the shorthand value::type:

SELECT
    '42'::integer            AS text_to_int,
    42::text                 AS int_to_text,
    price::integer           AS dollars_only,     -- truncates the cents
    order_date::date         AS just_the_date,
    '2024-06-01'::date       AS literal_date,
    CAST('3.14' AS numeric)  AS explicit_cast
FROM products, orders
LIMIT 1;

::type is PostgreSQL shorthand; CAST(x AS type) is standard SQL. They're equivalent — use CAST for portability, :: for brevity.
For formatted conversions (text ↔ number/date with a specific pattern), use to_char, to_number, to_date: sql SELECT to_char(1234567.89, 'FM999,999,990.00'); -- '1,234,567.89' SELECT to_date('06/01/2024', 'MM/DD/YYYY'); -- 2024-06-01

Common mistake. Casting text that isn't valid for the target type errors out: '12x'::integer fails. When converting user-supplied or messy text, validate or clean it first, or you'll get runtime errors mid-query.

Conditional expressions

These compute different values based on conditions — the "if/then" of SQL.

CASE — the SQL conditional

SELECT name, price,
       CASE
           WHEN price >= 1000 THEN 'premium'
           WHEN price >= 200  THEN 'mid-range'
           ELSE 'budget'
       END AS price_band
FROM products
ORDER BY price DESC;

CASE evaluates conditions top to bottom and returns the first match's result (or the ELSE, or NULL if no ELSE). It works anywhere an expression is allowed — SELECT, WHERE, ORDER BY, even inside aggregates (the conditional-aggregate trick):

-- Count delivered vs cancelled in one pass:
SELECT
    COUNT(*) FILTER (WHERE status = 'delivered') AS delivered,   -- modern way
    SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled  -- classic way
FROM orders;

COALESCE — first non-NULL

COALESCE(a, b, c, ...) returns the first argument that isn't NULL — perfect for defaults:

SELECT first_name, COALESCE(phone, 'no phone on file') AS contact FROM customers;
SELECT COALESCE(SUM(amount), 0) FROM payments WHERE order_id = 999;  -- 0, not NULL

You've already used COALESCE to turn missing aggregates into zero (Chapter 7) — it's one of the most useful functions in SQL.

NULLIF, GREATEST, LEAST

SELECT NULLIF(quantity, 0);          -- returns NULL if quantity is 0 (avoid div-by-zero)
SELECT a / NULLIF(b, 0);             -- safe division: NULL instead of error when b=0
SELECT GREATEST(1, 5, 3), LEAST(1, 5, 3);   -- 5, 1  (per-row max/min across columns)

NULLIF(x, 0) is the standard guard against division-by-zero: total / NULLIF(count, 0) yields NULL (not an error) when count is 0.

Putting it together

A realistic query combining several function families — a customer-facing order summary:

SELECT
    o.order_id,
    initcap(c.first_name) || ' ' || initcap(c.last_name)  AS customer,
    to_char(o.order_date, 'Mon DD, YYYY')                 AS placed_on,
    CASE o.status
        WHEN 'delivered' THEN '✓ Delivered'
        WHEN 'cancelled' THEN '✗ Cancelled'
        ELSE initcap(o.status)
    END                                                    AS status_label,
    COALESCE(to_char(SUM(oi.quantity * oi.unit_price), 'FM999,990.00'), '0.00') AS total
FROM orders o
JOIN customers c    ON c.customer_id = o.customer_id
LEFT JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY o.order_id, c.first_name, c.last_name, o.order_date, o.status
ORDER BY o.order_date DESC;

This single query cleans names, formats a date, labels a status, and formats a money total — all at the database, repeatably. That's the power of functions.

Regular expressions: pattern matching beyond LIKE

LIKE handles simple patterns (%, _), but real text work — validating an email shape, pulling the digits out of a SKU, splitting on a flexible delimiter — needs regular expressions, and PostgreSQL's support is first-class. The core operators are ~ (matches), ~* (matches, case-insensitive), and their negations !~ and !~*:

-- Rows whose SKU looks like three letters, a dash, then three digits:
SELECT sku, name FROM products WHERE sku ~ '^[A-Z]{3}-[0-9]{3}$';

-- Customers whose email is NOT a plausible address shape (a data-quality check):
SELECT email FROM customers WHERE email !~ '^[^@]+@[^@]+\.[^@]+$';

Beyond matching, a family of functions extracts and rewrites:

Function	Does	Example → result
`regexp_replace(s, pat, repl, flags)`	substitute matches	`regexp_replace('a1b2','[0-9]','#','g')` → `a#b#`
`regexp_match(s, pat)`	first match (with capture groups)	`regexp_match('SKU-042','([0-9]+)')` → `{042}`
`regexp_matches(s, pat, 'g')`	all matches, as rows	iterate every number in a string
`substring(s from pat)`	extract via regex	`substring('order #42' from '[0-9]+')` → `42`

-- Pull the numeric part out of every SKU:
SELECT sku, substring(sku from '[0-9]+') AS sku_number FROM products;

-- Normalize messy phone numbers to digits only:
SELECT phone, regexp_replace(phone, '[^0-9]', '', 'g') AS digits FROM customers;

The 'g' flag means "global" (replace/return all matches, not just the first), and regexp_replace with it is the standard way to strip or normalize characters. Regular expressions have a learning curve, but even a handful — anchors ^/$, character classes [...], quantifiers +/*/{n} — covers most data-cleaning needs. They're the difference between massaging text in the database and exporting it to a script.

Common mistake. Using a leading-wildcard regex or LIKE '%term%' on a large table and expecting it to be fast. Like function-wrapped columns, an unanchored pattern generally can't use a regular index (Chapter 23). For heavy text search — ranking, stemming, large volumes — the right tool is full-text search (Chapter 16), not ever-cleverer regexes.

Generating rows: `generate_series` and date spines

One PostgreSQL function deserves special mention because it solves a problem that stumps people: producing rows that don't exist in your data. generate_series manufactures a sequence of numbers or timestamps, which is invaluable for building a complete "spine" of time periods so your report shows zeros for empty periods instead of simply omitting them.

-- Every day in March 2024, whether or not it had orders:
SELECT day::date
FROM generate_series(DATE '2024-03-01', DATE '2024-03-31', INTERVAL '1 day') AS day;

-- Orders per day with NO gaps — empty days show 0, not missing rows:
SELECT d.day::date AS day, COUNT(o.order_id) AS orders
FROM generate_series(DATE '2024-03-01', DATE '2024-03-31', INTERVAL '1 day') AS d(day)
LEFT JOIN orders o ON o.order_date >= d.day AND o.order_date < d.day + INTERVAL '1 day'
GROUP BY d.day
ORDER BY d.day;

That second query is a genuinely important pattern. A plain GROUP BY date_trunc('day', order_date) skips days with no orders entirely — so a chart drawn from it would silently hide the gaps, distorting the story. By generating the full date spine first and LEFT JOINing the data onto it, every day appears, and quiet days correctly read zero. Analysts and dashboard-builders use this constantly. generate_series also produces numeric sequences (generate_series(1, 100)), useful for test data and for "bucket" boundaries.

More string tools, and the `format` function

A few more string functions round out the toolkit, including one — format — that is the clean, injection-resistant way to build strings with placeholders:

SELECT format('Order #%s for %s: $%s',
              o.order_id, c.last_name, o.total_amount) AS summary
FROM orders o JOIN customers c ON c.customer_id = o.customer_id
LIMIT 3;

Function	Does	Example → result
`format(fmt, ...)`	printf-style templating (`%s`, `%I`, `%L`)	see above
`translate(s, from, to)`	character-by-character map	`translate('hello','el','ip')` → `hippo`
`repeat(s, n)`	repeat a string	`repeat('ab', 3)` → `ababab`
`reverse(s)`	reverse characters	`reverse('abc')` → `cba`
`string_to_array(s, delim)`	split into an array	`string_to_array('a,b,c', ',')` → `{a,b,c}`
`starts_with(s, prefix)`	boolean prefix test	`starts_with('UltraBook','Ultra')` → `t`

format's placeholders are worth knowing: %s inserts a value as text, %I safely quotes an identifier (a table/column name), and %L safely quotes a literal (escaping quotes for you). That last pair matters when you build SQL dynamically (Chapter 15's functions, Chapter 22's migrations) — format(..., %I, %L) is the safe way, versus error-prone manual quoting. It's the same defensive instinct that, scaled up, becomes the parameterized-query security lesson of Chapter 32.

A wider gallery of function recipes

Realistic one-liners you'll adapt constantly:

-- Mask all but the last 4 characters of a sensitive identifier:
SELECT 'XXXX-XXXX-' || right(card_number, 4) AS masked FROM payments;

-- A URL-friendly slug from a product name (lowercase, spaces → dashes, strip punctuation):
SELECT name,
       regexp_replace(lower(trim(name)), '[^a-z0-9]+', '-', 'g') AS slug
FROM products;

-- Bucket order ages into human labels:
SELECT order_id,
       CASE
           WHEN order_date > now() - interval '7 days'  THEN 'this week'
           WHEN order_date > now() - interval '30 days' THEN 'this month'
           ELSE 'older'
       END AS recency
FROM orders;

-- Initials from a name:
SELECT left(first_name, 1) || left(last_name, 1) AS initials FROM customers;

-- Parse the local-part and domain of an email separately:
SELECT split_part(email, '@', 1) AS local_part,
       split_part(email, '@', 2) AS domain
FROM customers;

None of these is exotic; each is the kind of small transformation that comes up weekly. The meta-lesson is the one from the chapter opening: data is rarely in exactly the shape you need, and the competent practitioner reshapes it in the query — repeatably, at the source — rather than exporting it somewhere to fix by hand. The more of this toolkit lives in your fingers, the less you'll reach for a spreadsheet or a throwaway script.

Try this. Build a single query that, for each customer, shows their name in Last, First format, their email domain, and a label for whether they joined in the last year. You'll combine ||, initcap, split_part, a date comparison, and a CASE — five tools from this chapter in one practical result. That kind of small composition is exactly what daily SQL work looks like.

A worked scenario: cleaning a messy import

Functions earn their keep most visibly when you face the universal reality of data work: data arrives dirty. Imagine Mercado just acquired a competitor and received a spreadsheet of their customers, exported to a staging table with all the usual real-world mess — inconsistent capitalization, stray whitespace, phone numbers in a dozen formats, names jammed into a single field, and dates as text in a regional format. Your job is to clean it into the shape Mercado's schema expects. This is exactly the kind of task people reflexively export to a script; watch how far the function toolkit gets you without ever leaving SQL.

Start with the names, which arrived as a single full_name field like " smith, JANE ". You need them trimmed, properly cased, and split into first and last. trim removes the surrounding whitespace; split_part separates on the comma; initcap fixes the capitalization. Composing them — inside-out, just like a subquery — gives clean columns:

SELECT
    initcap(trim(split_part(full_name, ',', 2))) AS first_name,   -- "Jane"
    initcap(trim(split_part(full_name, ',', 1))) AS last_name     -- "Smith"
FROM staging_customers;

The phone numbers came in as "(555) 123-4567", "555.123.4567", "5551234567", and worse. To normalize them to pure digits — the form you'd store and compare — strip every non-digit with a global regexp_replace. The emails arrived with inconsistent case, which matters because email comparison should be case-insensitive; lower normalizes them. And the join dates came as text like "06/15/2024", which to_date parses into a real date using an explicit format so there's no ambiguity about month-versus-day order:

SELECT
    lower(trim(email))                             AS email,
    regexp_replace(phone, '[^0-9]', '', 'g')       AS phone_digits,
    to_date(joined_on, 'MM/DD/YYYY')               AS joined_date,
    CASE
        WHEN total_spent::numeric >= 1000 THEN 'gold'
        WHEN total_spent::numeric >= 200  THEN 'silver'
        ELSE 'standard'
    END                                            AS loyalty_tier
FROM staging_customers
WHERE email ~ '^[^@]+@[^@]+\.[^@]+$';              -- skip rows with junk emails

Notice the final touches: a CASE assigns a loyalty tier from spend (a business rule applied during the cleanse), a ::numeric cast turns the text-typed total_spent into a number the comparison can use, and a regex in the WHERE quarantines rows whose email is so malformed it isn't worth importing. In a dozen lines, raw spreadsheet sludge becomes rows that conform to Mercado's expectations — title-cased names, digit-only phones, lowercased emails, real dates, assigned tiers, and obvious garbage filtered out. The whole transformation is set-oriented, runs over the entire import in one pass, and is repeatable: re-run it whenever a new batch arrives, and it cleans the new batch identically.

This scenario is a microcosm of a huge slice of real database work — the unglamorous, essential labor of turning messy reality into clean, queryable data. Chapter 31 will formalize this into proper ETL pipelines with staging tables and validation, but the engine underneath is exactly the functions of this chapter. The lesson to carry forward is confidence: when dirty data lands on your desk, your first thought should be "which functions reshape this?" not "let me write a script to loop through it." Most cleaning is a composition of trim, lower, initcap, split_part, regexp_replace, to_date, and CASE — and you now know all of them.

Why this matters. Data cleaning is where a large fraction of every data professional's time actually goes, and doing it in SQL — at the source, repeatably, over whole batches at once — is dramatically more effective than the row-by-row scripting many people default to. The functions in this chapter are not academic trivia; they are the daily tools of turning the world's messy data into something a database can trust. Mastering them is mastering a real and constant part of the job.

A performance note (preview of Chapter 23)

One important warning: applying a function to a column in a WHERE clause can prevent the database from using an index on that column. WHERE lower(email) = 'alice@example.com' can't use a plain index on email, because the index stores email, not lower(email). The fixes — an expression index on lower(email), or storing data pre-normalized — are covered in Chapter 23. For now, just register the idea: functions on columns in WHERE have a performance cost, and there's a tool for it.

When built-in functions aren't enough

The functions in this chapter cover the overwhelming majority of everyday value-shaping, but it's worth knowing where their territory ends, so you recognize when a problem has outgrown them and what comes next. Three boundaries matter.

The first is text search at scale. LIKE, ILIKE, and even regular expressions are fine for small tables and simple patterns, but they don't rank results by relevance, don't understand word stems ("running" matching "run"), don't ignore noise words, and can't use an ordinary index when the pattern has a leading wildcard. The moment you're building anything resembling a search box over a meaningful volume of text — product descriptions, articles, support tickets — you've left the function toolkit and entered full-text search (Chapter 16), with its specialized tsvector/tsquery types and GIN indexes. Recognizing "this is really a search feature, not a pattern match" is the signal to make the jump.

The second is structured and semi-structured data. When a value isn't a simple scalar but a nested document — a product's variable attributes, an event payload, an API response — string functions are the wrong tool entirely. PostgreSQL's jsonb type and its operators (Chapter 16) let you store, index, and query nested structures natively, so you extract a field with ->> rather than parsing text. Arrays similarly have their own operators and the unnest function for turning array elements into rows. Trying to pick apart JSON with split_part and substring is a sign you should be using the JSON functions instead — they're faster, safer, and far clearer.

The third is genuinely procedural logic — multi-step algorithms, loops, branching that goes beyond what CASE can express, or transformations that need to call out to other systems. SQL's built-in functions are expression-level: they compute one value from others. When you need procedural control flow, the right tools are user-defined functions in PL/pgSQL (Chapter 15), which let you write procedures inside the database, or — when the logic is truly application-specific — your application code (Part V). The art is knowing the line: most "I need a script for this" instincts are actually a composition of built-in functions away from done, but some genuinely need a procedure, and forcing procedural logic into a tangle of nested CASE expressions serves no one.

Knowing these three boundaries is itself part of mastering functions, because it keeps you from two opposite errors: reaching for heavy machinery (a full-text engine, a JSON pipeline, a procedural function) when a one-line built-in would do, and stubbornly forcing built-ins to do a job they're wrong for (parsing JSON with string functions, building search with LIKE). The toolkit is broad and you should exhaust it first — but recognizing its edges is what lets you reach for the right next tool at the right time, which is a recurring theme of the whole book.

Why transform data in the database, not the application

It's worth pausing on a question that quietly shapes whole architectures: where should data transformation happen — in the database via functions like these, or in the application code that consumes the results? The instinct of many developers, especially those who think in a programming language first and SQL second, is to pull raw rows out of the database and reshape them in Python or JavaScript. Often that's exactly the wrong call, and understanding why makes you a better engineer.

Consider a concrete scenario. A report needs each customer's name in Title Case, their email domain, and a label for how recently they joined. The application-centric approach fetches first_name, last_name, email, and created_at for every customer, then loops in application code to title-case the names, split the emails, and compute the recency labels. The database-centric approach does all of it in the SELECT — initcap, split_part, a CASE over a date interval — and returns rows that are already report-ready. Both produce the same output, but they are not equivalent.

The database approach wins on several axes at once. It moves less data over the network: you transfer the finished columns, not raw inputs plus the overhead of fetching fields you'll only use to compute something else. It's reusable: the transformation lives in one query (or one view, Chapter 15) that every consumer shares, rather than being re-implemented — slightly differently, with slightly different bugs — in the web app, the mobile backend, the nightly export, and the analyst's notebook. It's faster for set-oriented work: the database is a highly optimized engine for transforming columns across millions of rows, far outpacing a row-by-row loop in a general-purpose language. And it keeps the logic close to the data it describes, which is where it's easiest to keep correct as the schema evolves.

There are real limits, of course. Heavy procedural logic, calls to external services, and anything genuinely application-specific belong in the application, not crammed into SQL. The point is not "do everything in the database" but rather "do data-shaping in the database by default, and move to application code only when there's a reason." The reason developers get this backwards is usually unfamiliarity — they reach for the loop because they know loops better than they know initcap and split_part. This chapter exists partly to remove that excuse: once the function toolkit is in your fingers, the database is very often the cleanest place to reshape data, and reaching for it becomes second nature.

Why this matters. Theme #4 — PostgreSQL's power often eliminates other tools — applies even at this small scale. A surprising amount of "ETL" and "data cleaning" that teams build elaborate application pipelines for is, in reality, a handful of SQL functions in a well-written query. Knowing the toolkit means you can tell the difference between work that genuinely needs application code and work that's a one-line regexp_replace away from done.

A field guide to the gotchas

Built-in functions are mostly intuitive, but a handful of behaviors trip up nearly everyone, and they cluster into a few recurring themes. Gathering them in one place makes a useful pre-flight checklist whenever a function-heavy query misbehaves.

The NULL family. Anything that touches a NULL tends to become NULL. String concatenation with || yields NULL if any operand is NULL (use concat, which skips them, or COALESCE to supply a fallback). Arithmetic with NULL yields NULL. A CASE with no matching branch and no ELSE yields NULL. None of this is a bug — it's the faithful propagation of "unknown" — but it means a single unexpected NULL can hollow out a computed column for some rows. When a derived value is mysteriously empty for part of your data, suspect a NULL input first.

The numeric family. Integer division truncates: 5 / 2 is 2, and count_a / count_b is 0 whenever the numerator is smaller — the source of countless all-zero percentage columns. Force one operand non-integer (5.0 / 2, count_a::numeric / count_b) whenever you expect a fraction. Division by zero raises an error rather than returning NULL, so guard denominators with NULLIF(denom, 0). Rounding has subtleties too — round on numeric rounds half away from zero, and money should always live in numeric, never float, to avoid the binary-fraction drift that makes 0.1 + 0.2 not quite 0.3.

The date/time family. This is the least portable and most error-prone corner of SQL. The two timestamp types — with and without time zone — are genuinely different, and mixing them, or assuming a stored time is in your local zone, produces off-by-hours and off-by-a-day bugs that are maddening to track down. Filtering "all of a month" with BETWEEN and a literal last day silently drops the final day's later hours; the half-open range (>= first AND < next-first) is the bug-free idiom. And wrapping a date column in a function inside WHERE can disable an index, turning a fast lookup into a full scan. When dates are involved, slow down and be explicit.

The indexing family. A theme that spans the whole chapter: applying a function to a column in WHERE — lower(email), extract(year from order_date), an unanchored regex — can prevent the use of an ordinary index on that column, because the index stores the raw value, not the transformed one. The fixes (expression indexes, range rewrites) come in Chapter 23, but the awareness belongs here: functions on columns in WHERE have a performance cost, even when they're logically correct. Functions in the SELECT list are free in this sense; functions in WHERE deserve a second thought on large tables.

Run a misbehaving function-heavy query past these four families — NULL, numeric, date/time, indexing — and you'll diagnose the overwhelming majority of problems in seconds. They are not random quirks; each follows from a principle (unknown propagates, integers are integers, time zones are real, indexes store raw values), and knowing the principle is what turns a confusing result into an obvious one.

Finding the function you need

No one memorizes all of PostgreSQL's hundreds of functions, and you shouldn't try. The professional skill is not recall but knowing what category to look in and how to look it up quickly. When you need to do something to a value and don't know the function, the move is to identify the value's type — text, number, date, JSON — and browse that section of the documentation (or Appendix C, which condenses the everyday set). The categories are stable even when the individual function names aren't: there is always a way to extract part of a string, to round a number, to pull a field from a date, to format a value as text. You're looking for which spelling PostgreSQL uses, not whether the capability exists.

psql itself helps. The \df meta-command lists functions, and \df *trunc* finds every function with "trunc" in its name — handy when you remember roughly what it's called. The documentation's function tables are organized exactly by the categories of this chapter, so the mental index you've built here — "this is a string problem," "this is a date problem" — maps directly onto where to look. Over time, the functions you use weekly migrate into muscle memory, and the long tail stays a quick lookup away. That division — a small memorized core plus a well-understood map of the rest — is how every experienced practitioner actually works. Treat this chapter as building the map, not as a memorization exercise.

Try this. Next time you reach for application code to reshape a value, pause and ask: "Is there a SQL function for this?" Search the docs by the value's type. More often than you'd expect, the answer is yes, and the database version is shorter and faster. Building that reflex — check for a built-in first — is worth more than memorizing any particular function.

Progressive project: clean and format your data

In your domain:

Use string functions to format a display name or normalize an identifier (e.g., uppercase a book's ISBN, title-case a member name).
Use a date function to group something by month or to find records "in the last N days."
Use CASE to bucket a numeric value into categories (price bands, age groups, fine tiers).
Use COALESCE to supply a default for a nullable column in a report, and NULLIF to guard a division.

Add these to project-notes.md.

A practitioner's reference card

Because this chapter is as much a reference as a lesson, here is a condensed card of the functions worth keeping at your fingertips, organized by the four families and annotated with the gotcha that most often bites in each. Treat it as the index you return to, not something to memorize whole.

Strings. || and concat (the former propagates NULL, the latter skips it — a real difference); length, upper/lower/initcap, trim; substring, left/right, split_part (parsing); replace, translate, regexp_replace with the 'g' flag (cleaning); format with %s/%I/%L (safe templating). Remember strings are 1-indexed, and case-insensitive comparison is a lower() or ILIKE away. The gotcha: || with any NULL operand yields NULL.

Numbers. round, ceil/floor, trunc, abs, mod/%, power/^, sqrt. The gotcha that dwarfs all others: integer division truncates, so force a numeric operand (a::numeric / b) for any ratio you expect to be fractional, and guard denominators with NULLIF(b, 0) against division-by-zero errors. Money lives in numeric, never float.

Dates and times. now()/current_date; extract (pull a field), date_trunc (round to a period — the time-series workhorse), age (interval between), to_char (format); interval arithmetic (+ interval '7 days'); generate_series (build a date spine so empty periods show as zero). The gotchas: timestamp vs timestamptz are different types; use half-open ranges (>= start AND < next) for "all of a period"; wrapping a date column in a function inside WHERE can disable its index.

Conversion and conditionals. CAST(x AS type) or x::type; to_char/to_date/to_number for formatted conversions; CASE (the general conditional), COALESCE (first non-NULL, for defaults), NULLIF (safe division and "treat this value as missing"), GREATEST/LEAST (per-row max/min across columns). The gotcha: casting invalid text ('12x'::integer) errors at runtime, so validate messy input first.

Keep this card nearby for a few weeks and the everyday subset migrates into muscle memory; the long tail stays a \df or a documentation lookup away. That blend — a memorized core, a known map of the rest — is exactly how working practitioners operate. The goal was never to hold hundreds of functions in your head, but to know what's possible, what category to reach into, and which gotcha to watch for in each. With that, the database becomes the natural and powerful place to reshape data that this chapter argued it should be.

One last habit to take with you: when you discover a function you didn't know — and you will, for years — add it to your own running notes with a one-line example. The official documentation is exhaustive but vast; a personal cheat sheet of the dozen functions you actually reach for, in your domain, is worth more than any reference because it's tuned to your work. The card above is a starting point; the version you build from real use is the one that will make you fast.

Summary

PostgreSQL's built-in functions transform values in your queries. String functions (||/concat, length, upper/lower/initcap, trim, substring, split_part, replace, lpad) format and parse text — mind that || propagates NULL and SQL strings are 1-indexed. Numeric functions (round, ceil/floor, trunc, abs, mod) compute — watch integer division (5/2 = 2). Date/time functions (now, extract, date_trunc, age, interval arithmetic, to_char) slice and format temporal data — mind time zones and use half-open ranges for "all of a month." Type conversion uses CAST/::, with to_char/to_date/to_number for formatted conversions. Conditional expressions — CASE, COALESCE (first non-NULL), NULLIF (safe division), GREATEST/LEAST — bring if/then logic into SQL. And remember: a function on a column in WHERE can defeat an index (Chapter 23).

You can now: - Format and parse text with the core string functions (and handle NULL in concatenation). - Compute with numeric functions and avoid the integer-division trap. - Slice, group, and format dates/times; reason about time zones and date ranges. - Convert types with CAST/:: and formatted to_char/to_date. - Use CASE, COALESCE, NULLIF, and FILTER for conditional logic and safe division.

What's next. Chapter 9 — Subqueries — queries nested inside queries. You'll filter by the result of another query, compute per-row correlated values, and use EXISTS/IN — the first big payoff of Chapter 4's closure property (a query's result is itself a relation you can query).

Practice in exercises.md, test yourself with the quiz, apply it in the case studies, review the key takeaways, and go deeper with further reading.

In This Chapter

Chapter 8: Built-in Functions — Strings, Numbers, Dates, and Type Conversion

The functions you use constantly

String functions

Numeric functions

Date and time functions

Type conversion (casting)

Conditional expressions

CASE — the SQL conditional

COALESCE — first non-NULL

NULLIF, GREATEST, LEAST

Putting it together

Regular expressions: pattern matching beyond LIKE

Generating rows: generate_series and date spines

More string tools, and the format function

A wider gallery of function recipes

A worked scenario: cleaning a messy import

A performance note (preview of Chapter 23)

When built-in functions aren't enough

Why transform data in the database, not the application

A field guide to the gotchas

Finding the function you need

Progressive project: clean and format your data

A practitioner's reference card

Summary

Generating rows: `generate_series` and date spines

More string tools, and the `format` function