Case Study 8.1: Accessing a Struct in Assembly

The Scenario

A data processing pipeline stores sensor readings in a packed struct. A performance audit reveals that the struct-access function is on the critical path, running 100 million times per second. The team decides to write it in assembly and compare the output to what GCC -O2 produces.

The Struct

typedef struct {
    uint32_t sensor_id;    // offset  0, 4 bytes
    uint32_t flags;        // offset  4, 4 bytes
    int64_t  timestamp;    // offset  8, 8 bytes
    double   reading;      // offset 16, 8 bytes
    uint64_t checksum;     // offset 24, 8 bytes
} SensorReading;           // sizeof = 32 bytes

Note: this struct has no padding. Each field is naturally aligned (each field at an offset that is a multiple of its size), which makes for clean assembly.

The C Function

// Returns true if reading is valid (flags bit 0 set, checksum matches simple sum)
bool validate_reading(const SensorReading *r) {
    if (!(r->flags & 0x1)) return false;
    uint64_t expected = (uint64_t)r->sensor_id
                      + (uint64_t)(r->timestamp)
                      + (uint64_t)(r->reading);  // truncated
    return (r->checksum == expected);
}

GCC -O2 Output (Annotated)

; GCC 13.2, -O2, x86-64 Linux
; RDI = r (pointer to SensorReading)
validate_reading:
    mov    eax, dword [rdi + 4]    ; eax = r->flags
    and    eax, 1                   ; eax &= 1 (test bit 0)
    je     .return_false            ; if 0, return false

    movzx  eax, dword [rdi + 0]    ; eax = r->sensor_id (zero-extend)
    mov    rdx, qword [rdi + 8]    ; rdx = r->timestamp
    movsd  xmm0, qword [rdi + 16]  ; xmm0 = r->reading (double)
    cvttsd2si rcx, xmm0            ; rcx = (int64_t)r->reading (truncate)
    add    rax, rdx                 ; rax += timestamp
    add    rax, rcx                 ; rax += (uint64_t)reading
    cmp    rax, qword [rdi + 24]   ; cmp computed == r->checksum
    sete   al                       ; al = (ZF set ? 1 : 0)
    movzx  eax, al                  ; zero-extend to 32-bit return
    ret

.return_false:
    xor    eax, eax                 ; return false
    ret

Key Observations

Observation 1: Pure Base+Displacement Addressing

Every struct field access uses [rdi + constant_offset]. This is the canonical form for struct access: - [rdi + 0] = sensor_id (offset 0) - [rdi + 4] = flags (offset 4) - [rdi + 8] = timestamp (offset 8) - [rdi + 16] = reading (offset 16) - [rdi + 24] = checksum (offset 24)

The offsets match exactly what offsetof(SensorReading, field) would return. No pointer arithmetic, no temporary registers, no index computation — just the struct pointer plus the compile-time-known offset.

Observation 2: Size Awareness

GCC uses the right instruction size for each field: - dword (32-bit) for uint32_t fields - qword (64-bit) for int64_t and uint64_t fields - movsd (64-bit SSE) for double

Notice that sensor_id is loaded with movzx eax, dword [rdi] — the zero-extend ensures the upper 32 bits of RAX are clean before the addition. This is essential: add rax, rdx would produce wrong results if RAX contained garbage in bits 32-63.

Observation 3: No Memory-to-Memory Operations

The comparison r->checksum == expected is not done by comparing two memory locations. The computed value is in RAX; the stored checksum is loaded by the cmp rax, [rdi+24] instruction, which reads memory into the comparison hardware without an explicit register load.

Hand-Written Assembly vs. Compiler Output

Let us write the same function by hand and compare:

; Hand-written version
; RDI = r, return bool in AL (extended to EAX)
validate_reading_manual:
    ; Check flags bit 0
    test   dword [rdi + 4], 1      ; test r->flags & 1
    jz     .false                   ; if zero, return false

    ; Compute expected checksum
    movzx  rax, dword [rdi]        ; rax = sensor_id (32-bit, zero-extended)
    add    rax, [rdi + 8]          ; rax += timestamp

    ; Load double, convert to int64
    movsd  xmm0, [rdi + 16]
    cvttsd2si rdx, xmm0            ; rdx = (int64_t)reading
    add    rax, rdx                 ; rax += truncated reading

    ; Compare with stored checksum
    cmp    rax, [rdi + 24]
    sete   al
    movzx  eax, al
    ret

.false:
    xor    eax, eax
    ret

The two versions are essentially identical. The only difference: GCC loads flags into EAX and uses and eax, 1 + je, while our version uses test dword [rdi+4], 1 + jz. Both produce the same result. TEST is slightly more compact (it does not modify the destination register), so it is marginally preferred.

Extending to an Array of Structs

Now suppose we need to validate an array:

int count_valid(const SensorReading *array, int n) {
    int count = 0;
    for (int i = 0; i < n; i++) {
        if (validate_reading(&array[i])) count++;
    }
    return count;
}

Here sizeof(SensorReading) = 32, which is not a valid scale factor. The compiler handles this by advancing a pointer rather than using a scaled index:

; GCC -O2 output (conceptual, simplified):
; RDI = array, ESI = n
count_valid:
    test   esi, esi
    jle    .return_zero

    xor    eax, eax              ; count = 0
    mov    rdx, rdi              ; rdx = current pointer (= &array[0])
    mov    ecx, esi              ; ecx = remaining count

.loop:
    ; Inline validate_reading(rdx):
    mov    r8d, [rdx + 4]        ; r8d = flags
    test   r8d, 1
    jz     .skip

    ; ... (validate logic using RDX as struct pointer)
    ; ... if valid:
    inc    eax

.skip:
    add    rdx, 32               ; advance pointer by sizeof(SensorReading)
    dec    ecx
    jnz    .loop
    ret

.return_zero:
    xor    eax, eax
    ret

The key insight: add rdx, 32 advances by the struct size. The compiler prefers pointer-increment loops over indexed loops when the element size is not 1/2/4/8, because the pointer can be used directly in addressing without an intermediate multiply.

What This Means for Performance

A modern out-of-order processor can execute this loop at close to one iteration per clock, assuming the data is in L1 cache. The addressing modes are not the bottleneck; the load-use latency is. Each field load ([rdx + offset]) has a 4-cycle L1 cache latency, but the out-of-order engine issues multiple loads simultaneously, hiding most of that latency.

The practical lesson: structure your data access to be sequential (cache-friendly) and let the addressing modes be as simple as possible. Complex addressing modes (base+index×scale+disp) do not cost more than simple ones — but a cache miss costs 200+ cycles regardless of addressing mode.

Takeaways

  1. Struct field access in assembly is [pointer + field_offset]. The offset is always a compile-time constant.
  2. Size discipline matters: use the right operand size and handle sign/zero-extension explicitly.
  3. When the struct size is not 1/2/4/8, use a pointer that advances by the stride rather than a scaled index.
  4. GCC -O2 output for struct access is often nearly identical to hand-written code. The compiler wins on field accesses; it earns its keep on the surrounding loop optimization.