Case Study 8.1: Accessing a Struct in Assembly
The Scenario
A data processing pipeline stores sensor readings in a packed struct. A performance audit reveals that the struct-access function is on the critical path, running 100 million times per second. The team decides to write it in assembly and compare the output to what GCC -O2 produces.
The Struct
typedef struct {
uint32_t sensor_id; // offset 0, 4 bytes
uint32_t flags; // offset 4, 4 bytes
int64_t timestamp; // offset 8, 8 bytes
double reading; // offset 16, 8 bytes
uint64_t checksum; // offset 24, 8 bytes
} SensorReading; // sizeof = 32 bytes
Note: this struct has no padding. Each field is naturally aligned (each field at an offset that is a multiple of its size), which makes for clean assembly.
The C Function
// Returns true if reading is valid (flags bit 0 set, checksum matches simple sum)
bool validate_reading(const SensorReading *r) {
if (!(r->flags & 0x1)) return false;
uint64_t expected = (uint64_t)r->sensor_id
+ (uint64_t)(r->timestamp)
+ (uint64_t)(r->reading); // truncated
return (r->checksum == expected);
}
GCC -O2 Output (Annotated)
; GCC 13.2, -O2, x86-64 Linux
; RDI = r (pointer to SensorReading)
validate_reading:
mov eax, dword [rdi + 4] ; eax = r->flags
and eax, 1 ; eax &= 1 (test bit 0)
je .return_false ; if 0, return false
movzx eax, dword [rdi + 0] ; eax = r->sensor_id (zero-extend)
mov rdx, qword [rdi + 8] ; rdx = r->timestamp
movsd xmm0, qword [rdi + 16] ; xmm0 = r->reading (double)
cvttsd2si rcx, xmm0 ; rcx = (int64_t)r->reading (truncate)
add rax, rdx ; rax += timestamp
add rax, rcx ; rax += (uint64_t)reading
cmp rax, qword [rdi + 24] ; cmp computed == r->checksum
sete al ; al = (ZF set ? 1 : 0)
movzx eax, al ; zero-extend to 32-bit return
ret
.return_false:
xor eax, eax ; return false
ret
Key Observations
Observation 1: Pure Base+Displacement Addressing
Every struct field access uses [rdi + constant_offset]. This is the canonical form for struct access:
- [rdi + 0] = sensor_id (offset 0)
- [rdi + 4] = flags (offset 4)
- [rdi + 8] = timestamp (offset 8)
- [rdi + 16] = reading (offset 16)
- [rdi + 24] = checksum (offset 24)
The offsets match exactly what offsetof(SensorReading, field) would return. No pointer arithmetic, no temporary registers, no index computation — just the struct pointer plus the compile-time-known offset.
Observation 2: Size Awareness
GCC uses the right instruction size for each field:
- dword (32-bit) for uint32_t fields
- qword (64-bit) for int64_t and uint64_t fields
- movsd (64-bit SSE) for double
Notice that sensor_id is loaded with movzx eax, dword [rdi] — the zero-extend ensures the upper 32 bits of RAX are clean before the addition. This is essential: add rax, rdx would produce wrong results if RAX contained garbage in bits 32-63.
Observation 3: No Memory-to-Memory Operations
The comparison r->checksum == expected is not done by comparing two memory locations. The computed value is in RAX; the stored checksum is loaded by the cmp rax, [rdi+24] instruction, which reads memory into the comparison hardware without an explicit register load.
Hand-Written Assembly vs. Compiler Output
Let us write the same function by hand and compare:
; Hand-written version
; RDI = r, return bool in AL (extended to EAX)
validate_reading_manual:
; Check flags bit 0
test dword [rdi + 4], 1 ; test r->flags & 1
jz .false ; if zero, return false
; Compute expected checksum
movzx rax, dword [rdi] ; rax = sensor_id (32-bit, zero-extended)
add rax, [rdi + 8] ; rax += timestamp
; Load double, convert to int64
movsd xmm0, [rdi + 16]
cvttsd2si rdx, xmm0 ; rdx = (int64_t)reading
add rax, rdx ; rax += truncated reading
; Compare with stored checksum
cmp rax, [rdi + 24]
sete al
movzx eax, al
ret
.false:
xor eax, eax
ret
The two versions are essentially identical. The only difference: GCC loads flags into EAX and uses and eax, 1 + je, while our version uses test dword [rdi+4], 1 + jz. Both produce the same result. TEST is slightly more compact (it does not modify the destination register), so it is marginally preferred.
Extending to an Array of Structs
Now suppose we need to validate an array:
int count_valid(const SensorReading *array, int n) {
int count = 0;
for (int i = 0; i < n; i++) {
if (validate_reading(&array[i])) count++;
}
return count;
}
Here sizeof(SensorReading) = 32, which is not a valid scale factor. The compiler handles this by advancing a pointer rather than using a scaled index:
; GCC -O2 output (conceptual, simplified):
; RDI = array, ESI = n
count_valid:
test esi, esi
jle .return_zero
xor eax, eax ; count = 0
mov rdx, rdi ; rdx = current pointer (= &array[0])
mov ecx, esi ; ecx = remaining count
.loop:
; Inline validate_reading(rdx):
mov r8d, [rdx + 4] ; r8d = flags
test r8d, 1
jz .skip
; ... (validate logic using RDX as struct pointer)
; ... if valid:
inc eax
.skip:
add rdx, 32 ; advance pointer by sizeof(SensorReading)
dec ecx
jnz .loop
ret
.return_zero:
xor eax, eax
ret
The key insight: add rdx, 32 advances by the struct size. The compiler prefers pointer-increment loops over indexed loops when the element size is not 1/2/4/8, because the pointer can be used directly in addressing without an intermediate multiply.
What This Means for Performance
A modern out-of-order processor can execute this loop at close to one iteration per clock, assuming the data is in L1 cache. The addressing modes are not the bottleneck; the load-use latency is. Each field load ([rdx + offset]) has a 4-cycle L1 cache latency, but the out-of-order engine issues multiple loads simultaneously, hiding most of that latency.
The practical lesson: structure your data access to be sequential (cache-friendly) and let the addressing modes be as simple as possible. Complex addressing modes (base+index×scale+disp) do not cost more than simple ones — but a cache miss costs 200+ cycles regardless of addressing mode.
Takeaways
- Struct field access in assembly is
[pointer + field_offset]. The offset is always a compile-time constant. - Size discipline matters: use the right operand size and handle sign/zero-extension explicitly.
- When the struct size is not 1/2/4/8, use a pointer that advances by the stride rather than a scaled index.
- GCC -O2 output for struct access is often nearly identical to hand-written code. The compiler wins on field accesses; it earns its keep on the surrounding loop optimization.