Chapter 15 Exercises: SIMD Programming

Open Assembly Language Project

Chapter 15 Exercises: SIMD Programming

Exercise 15.1 — Register Identification

For each operation below, identify (a) the register width used, (b) the ISA extension required, and (c) the number of elements processed simultaneously for the given type:

movaps  xmm0, [rel data]      ; Operation A
vmovaps ymm1, [rel data]      ; Operation B
addps   xmm2, xmm3            ; Operation C
vaddps  ymm4, ymm5, ymm6      ; Operation D
paddd   xmm7, xmm8            ; Operation E
vpaddw  ymm9, ymm10, ymm11    ; Operation F

For each: 1. Register width in bits 2. ISA extension (SSE, SSE2, AVX, AVX2) 3. Number of elements if operating on 32-bit floats 4. Number of elements if operating on 8-bit integers

Exercise 15.2 — Packed Layout

A float array contains: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

After executing:

vmovaps ymm0, [rel array]    ; loads all 8 floats

Draw the layout of ymm0, labeling: - Which lane contains which value - The bit positions of lane 0 and lane 7 - What xmm0 contains after this instruction

Exercise 15.3 — Alignment Penalty

Explain the difference between MOVAPS and MOVUPS:

What happens if you use MOVAPS with a 16-byte misaligned address on modern Intel processors?
What is the typical performance penalty of MOVUPS vs. MOVAPS for aligned data on Haswell?
Write the NASM declaration to ensure a 32-byte aligned array in .data:

section .data
; declare float_array of 8 floats, 32-byte aligned

When would you choose MOVUPS over MOVAPS?

Exercise 15.4 — Packed Arithmetic Translation

Translate the following C loop to SSE2 assembly using XMM registers (process 4 floats at a time):

void scale_array(float *dst, const float *src, float scalar, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i] * scalar;
    }
}

Assumptions: - n is a multiple of 4 - Both arrays are 16-byte aligned - rdi = dst, rsi = src, xmm0 = scalar, rdx = n

Write the complete NASM function. Handle the loop setup, SIMD multiply, and return.

Exercise 15.5 — AVX2 Translation

Rewrite Exercise 15.4 using AVX2 (process 8 floats at a time with YMM registers):

; void scale_array_avx2(float *dst, const float *src, float scalar, int n)
; rdi = dst, rsi = src, xmm0 = scalar (needs broadcasting), rdx = n
; Assume n is a multiple of 8, arrays 32-byte aligned

Key challenge: scalar arrives in xmm0 as a single float. You need to broadcast it to all 8 lanes. Use VBROADCASTSS ymm1, xmm0 to accomplish this, then write the loop.

Exercise 15.6 — Horizontal Reduction

The following code attempts to sum all 4 floats in xmm0:

; xmm0 = [a, b, c, d]  (4 floats)
; Goal: xmm0[0] = a + b + c + d
addps  xmm0, ???       ; add xmm0 to itself shifted?

Complete the reduction using SHUFPS and ADDPS. The final sum should be in the low 32 bits of xmm0. Write the 3-4 instruction sequence and explain what each step produces.

For reference, SHUFPS xmm, src, imm8 selects 4 elements. The immediate byte encodes which source elements go to which destination positions.

Exercise 15.7 — AVX2 Horizontal Reduction

An 8-float horizontal sum with AVX2 requires an extra step compared to SSE2. Given:

; ymm0 = [a, b, c, d, e, f, g, h]
; Goal: compute a+b+c+d+e+f+g+h

The trick is that AVX operates within 128-bit lanes. VADDPS ymm0, ymm0, ymm1 adds lanes independently without crossing the 128-bit boundary.

Use VEXTRACTF128 xmm1, ymm0, 1 to extract the upper 128 bits
Add xmm1 to xmm0 (now xmm0 = [a+e, b+f, c+g, d+h])
Complete the reduction to a scalar sum

Write the complete sequence (6-8 instructions).

Exercise 15.8 — Packed Integer Operations

Translate to SSE2 assembly:

// Clamp each int32 value to [0, 255]
void clamp_to_byte(int32_t *arr, int n) {
    for (int i = 0; i < n; i++) {
        if (arr[i] < 0) arr[i] = 0;
        if (arr[i] > 255) arr[i] = 255;
    }
}

Use these SSE2 integer instructions: - PXOR xmm, xmm — zero a register - PCMPGTD xmm, src — compare packed int32 (greater than) - PANDN xmm, src — AND NOT (dst = ~dst & src) - MOVDQA — move aligned 128-bit integer

Hint: Use the compare result as a mask.

Exercise 15.9 — SHUFPS Decode

For each SHUFPS instruction below, determine what the destination register contains. Each field of xmm_src = [A, B, C, D] (element 0 = A at low bits):

; xmm0 = [A, B, C, D], xmm1 = [E, F, G, H]

shufps xmm0, xmm1, 0b00_01_10_11   ; = 0x1B
shufps xmm0, xmm1, 0b00_00_00_00   ; = 0x00
shufps xmm0, xmm1, 0b11_10_01_00   ; = 0xE4 (identity)
shufps xmm0, xmm0, 0b00_00_00_00   ; broadcast element 0

For SHUFPS xmm_dst, xmm_src, imm8: - Bits [1:0]: select which element of dst goes to result[0] - Bits [3:2]: select which element of dst goes to result[1] - Bits [5:4]: select which element of src goes to result[2] - Bits [7:6]: select which element of src goes to result[3]

Exercise 15.10 — PSHUFD Decode

PSHUFD xmm_dst, xmm_src, imm8 rearranges 4 doublewords (32-bit integers) in a register. The immediate encoding is identical to SHUFPS but uses only one source.

Given xmm0 = [0x00000001, 0x00000002, 0x00000003, 0x00000004]:

What does xmm0 contain after each:

pshufd xmm1, xmm0, 0x1B    ; 0b00_01_10_11
pshufd xmm1, xmm0, 0x55    ; 0b01_01_01_01
pshufd xmm1, xmm0, 0x4E    ; 0b01_00_11_10

Describe a use case for PSHUFD xmm1, xmm0, 0x55.

Exercise 15.11 — Vectorize a Dot Product

Implement a dot product in AVX2:

float dot_product(const float *a, const float *b, int n);
// Returns sum of a[i]*b[i] for i in [0, n)
// Assume n is a multiple of 8, arrays 32-byte aligned

Your implementation must: 1. Use VFMADD231PS (Fused Multiply-Add: dst += src1 * src2) to accumulate products 2. Maintain a running accumulator in ymm0 3. Perform a horizontal reduction at the end

Write the complete NASM function. Use VZEROUPPER before ret.

Exercise 15.12 — VZEROUPPER Requirement

Explain the VZEROUPPER problem:

What happens to the upper 128 bits of a YMM register after a legacy SSE instruction (e.g., ADDPS xmm0, xmm1) executes on a processor with AVX support?
What is the "AVX-SSE transition penalty" and on which Intel microarchitecture does it occur? How many cycles is the penalty?
In the following code sequence, where should VZEROUPPER be inserted and why?

external_avx_function:
    vmovaps ymm0, [rel data]
    vaddps  ymm0, ymm0, ymm1
    vmovaps [rel output], ymm0
    ; ... transition to calling SSE code ...
    ret

Is VZEROUPPER needed if the code only uses AVX instructions throughout? Why or why not?

Exercise 15.13 — AES-NI Key Schedule

The AES-128 key schedule expands a 16-byte key into 11 round keys. The hardware instruction AESKEYGENASSIST xmm_dst, xmm_src, rcon assists with this:

What does AESKEYGENASSIST compute? (Which bytes does it operate on from the source?)
The key expansion formula uses PSLLDQ (byte-shift left). What does PSLLDQ xmm, 4 do?
Given the partial key expansion macro:

%macro KEY_EXPAND 2
    aeskeygenassist xmm1, xmm0, %2
    pshufd          xmm1, xmm1, 0xFF
    movaps          xmm2, xmm0
    pslldq          xmm2, 4
    xorps           xmm0, xmm2
    pslldq          xmm2, 4
    xorps           xmm0, xmm2
    pslldq          xmm2, 4
    xorps           xmm0, xmm2
    xorps           xmm0, xmm1
    movaps          [%1], xmm0
%endmacro

Explain step by step what this macro computes. What is the purpose of the three successive PSLLDQ + XORPS steps?

Exercise 15.14 — AES-NI Encryption Loop

Complete the following AES-128 encryption function:

; void aes128_encrypt_block(uint8_t *ct, const uint8_t *pt, const uint8_t *key_schedule)
; rdi = ciphertext output
; rsi = plaintext input (16 bytes)
; rdx = key schedule (11 * 16 = 176 bytes)

aes128_encrypt_block:
    movdqu  xmm0, [rsi]          ; load plaintext
    movaps  xmm1, [rdx]          ; load round key 0
    pxor    xmm0, xmm1           ; initial key whitening

    ; Rounds 1-9: AESENC
    ; TODO: add 9 AESENC rounds using key schedule offsets
    ; Key schedule: round key i is at [rdx + i*16]

    ; Round 10: AESENCLAST
    ; TODO: add final round

    movdqu  [rdi], xmm0          ; store ciphertext
    ret

Fill in the 9 intermediate rounds and the final AESENCLAST round. Each round key is at [rdx + round_number * 16].

Exercise 15.15 — AES Decryption

AES-NI decryption uses AESDEC and AESDECLAST. However, decryption requires the equivalent inverse cipher key schedule, which is obtained by applying AESIMC (AES Inverse Mix Columns) to round keys 1-9.

Write pseudocode (or NASM) to convert an encryption key schedule to a decryption key schedule: - Round key 0 stays the same - Round keys 1-9: apply AESIMC - Round key 10 stays the same
Write the decryption loop body using AESDEC / AESDECLAST.
Explain why round keys 0 and 10 do NOT get AESIMC applied.

Exercise 15.16 — CTR Mode Counter

In AES-CTR mode, a counter block is encrypted and XORed with plaintext. The counter increments for each 16-byte block.

; Counter block layout (big-endian convention):
; [nonce: 8 bytes][counter: 8 bytes]
; Stored in xmm_ctr

; Given xmm_ctr holds the current counter block,
; write the increment sequence using these instructions:
; PADDQ  xmm, src  — add two 64-bit integers packed in xmm
; (We only want to increment the low 64 bits — the counter)

Write the counter increment instruction.
In little-endian storage (x86 native), the low 64 bits of an XMM register are bits [63:0]. Does PADDQ increment the right field?
For a CTR mode encrypt loop that processes one 16-byte block per iteration: write the 4-5 instruction sequence that (a) encrypts the counter, (b) XORs with plaintext, (c) stores ciphertext, (d) increments the counter.

Exercise 15.17 — Performance Analysis

You are optimizing an inner loop that applies a 4×4 float matrix to a vector of 4 floats.

The naive scalar implementation requires: 4 multiplies + 3 adds per output element × 4 output elements = 16 multiplies + 12 adds = 28 FP operations.

An FMA-based SIMD implementation using VFMADD213PS processes 4 elements simultaneously.

How many VFMADD213PS instructions are needed for the complete 4×4 × 4×1 multiplication?
Assuming Haswell (VFMADD213PS throughput: 0.5 cycles, latency: 5 cycles), what is the theoretical throughput-limited time for this computation?
What is the latency-limited time if the four output accumulations are independent?
How does this compare to the scalar implementation (MULSS/ADDSS: 1 cycle throughput each)?

Exercise 15.18 — Auto-Vectorization Analysis

Given the following C function:

void add_arrays(float *dst, const float *a, const float *b, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = a[i] + b[i];
    }
}

Compile mentally: what prevents auto-vectorization? (Hint: aliasing)
Add the __restrict__ qualifier to the pointers. Does this fix the aliasing problem?
What GCC flags enable AVX2 auto-vectorization?
After adding __restrict__ and -O3 -mavx2, GCC generates something like:

.loop:
    vmovups ymm0, [rsi + rax*4]
    vaddps  ymm0, ymm0, [rdx + rax*4]
    vmovups [rdi + rax*4], ymm0
    add     rax, 8
    cmp     rax, rcx
    jb      .loop

Why does GCC use VMOVUPS instead of VMOVAPS? Under what conditions would GCC use VMOVAPS?

Exercise 15.19 — SIMD Data Layout Conversion

A program stores RGB pixels as a struct array (AoS):

struct Pixel { uint8_t r, g, b, a; };  // 4 bytes per pixel
Pixel image[WIDTH * HEIGHT];

To apply a SIMD operation to just the red channel, you need to gather the r bytes from every 4th byte.

Describe the challenge of processing this data with SIMD instructions.
The alternative layout (SoA) stores: r[n], g[n], b[n], a[n] as separate arrays. Show how MOVDQU loads 16 red bytes at once with SoA.
If you cannot change the data layout, PSHUFB (byte shuffle) can extract and pack bytes from a 16-byte register. Write pseudocode for extracting 4 red bytes from 16 bytes of AoS pixel data.
For the grayscale formula Y = 0.299R + 0.587G + 0.114*B: why is integer arithmetic preferred over floating-point for byte images?

Exercise 15.20 — Security: Constant-Time AES

AES-NI instructions are specifically designed to be constant-time — their execution time does not depend on the key or data values.

Explain why table-based (T-table) software AES is vulnerable to cache-timing attacks. What information leaks through cache access patterns?
The following is a simplified software AES SubBytes step: c uint8_t sbox_lookup(uint8_t byte) { return sbox[byte]; // table lookup } How does this leak the key material to a timing attacker?
AESENC eliminates this attack. Explain why: what property of the instruction prevents timing side channels?
A system uses AES-NI for encryption but then does: c if (key[0] == 0) handle_zero_key(); Does this reintroduce a timing channel? What type of vulnerability is this?
Name one other x86-64 instruction (outside of AES-NI) that is specified to run in constant time regardless of inputs, and explain its security relevance.