Chapter 15 Exercises: SIMD Programming

Exercise 15.1 — Register Identification

For each operation below, identify (a) the register width used, (b) the ISA extension required, and (c) the number of elements processed simultaneously for the given type:

movaps  xmm0, [rel data]      ; Operation A
vmovaps ymm1, [rel data]      ; Operation B
addps   xmm2, xmm3            ; Operation C
vaddps  ymm4, ymm5, ymm6      ; Operation D
paddd   xmm7, xmm8            ; Operation E
vpaddw  ymm9, ymm10, ymm11    ; Operation F

For each: 1. Register width in bits 2. ISA extension (SSE, SSE2, AVX, AVX2) 3. Number of elements if operating on 32-bit floats 4. Number of elements if operating on 8-bit integers


Exercise 15.2 — Packed Layout

A float array contains: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

After executing:

vmovaps ymm0, [rel array]    ; loads all 8 floats

Draw the layout of ymm0, labeling: - Which lane contains which value - The bit positions of lane 0 and lane 7 - What xmm0 contains after this instruction


Exercise 15.3 — Alignment Penalty

Explain the difference between MOVAPS and MOVUPS:

  1. What happens if you use MOVAPS with a 16-byte misaligned address on modern Intel processors?
  2. What is the typical performance penalty of MOVUPS vs. MOVAPS for aligned data on Haswell?
  3. Write the NASM declaration to ensure a 32-byte aligned array in .data:
section .data
; declare float_array of 8 floats, 32-byte aligned
  1. When would you choose MOVUPS over MOVAPS?

Exercise 15.4 — Packed Arithmetic Translation

Translate the following C loop to SSE2 assembly using XMM registers (process 4 floats at a time):

void scale_array(float *dst, const float *src, float scalar, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = src[i] * scalar;
    }
}

Assumptions: - n is a multiple of 4 - Both arrays are 16-byte aligned - rdi = dst, rsi = src, xmm0 = scalar, rdx = n

Write the complete NASM function. Handle the loop setup, SIMD multiply, and return.


Exercise 15.5 — AVX2 Translation

Rewrite Exercise 15.4 using AVX2 (process 8 floats at a time with YMM registers):

; void scale_array_avx2(float *dst, const float *src, float scalar, int n)
; rdi = dst, rsi = src, xmm0 = scalar (needs broadcasting), rdx = n
; Assume n is a multiple of 8, arrays 32-byte aligned

Key challenge: scalar arrives in xmm0 as a single float. You need to broadcast it to all 8 lanes. Use VBROADCASTSS ymm1, xmm0 to accomplish this, then write the loop.


Exercise 15.6 — Horizontal Reduction

The following code attempts to sum all 4 floats in xmm0:

; xmm0 = [a, b, c, d]  (4 floats)
; Goal: xmm0[0] = a + b + c + d
addps  xmm0, ???       ; add xmm0 to itself shifted?

Complete the reduction using SHUFPS and ADDPS. The final sum should be in the low 32 bits of xmm0. Write the 3-4 instruction sequence and explain what each step produces.

For reference, SHUFPS xmm, src, imm8 selects 4 elements. The immediate byte encodes which source elements go to which destination positions.


Exercise 15.7 — AVX2 Horizontal Reduction

An 8-float horizontal sum with AVX2 requires an extra step compared to SSE2. Given:

; ymm0 = [a, b, c, d, e, f, g, h]
; Goal: compute a+b+c+d+e+f+g+h

The trick is that AVX operates within 128-bit lanes. VADDPS ymm0, ymm0, ymm1 adds lanes independently without crossing the 128-bit boundary.

  1. Use VEXTRACTF128 xmm1, ymm0, 1 to extract the upper 128 bits
  2. Add xmm1 to xmm0 (now xmm0 = [a+e, b+f, c+g, d+h])
  3. Complete the reduction to a scalar sum

Write the complete sequence (6-8 instructions).


Exercise 15.8 — Packed Integer Operations

Translate to SSE2 assembly:

// Clamp each int32 value to [0, 255]
void clamp_to_byte(int32_t *arr, int n) {
    for (int i = 0; i < n; i++) {
        if (arr[i] < 0) arr[i] = 0;
        if (arr[i] > 255) arr[i] = 255;
    }
}

Use these SSE2 integer instructions: - PXOR xmm, xmm — zero a register - PCMPGTD xmm, src — compare packed int32 (greater than) - PANDN xmm, src — AND NOT (dst = ~dst & src) - MOVDQA — move aligned 128-bit integer

Hint: Use the compare result as a mask.


Exercise 15.9 — SHUFPS Decode

For each SHUFPS instruction below, determine what the destination register contains. Each field of xmm_src = [A, B, C, D] (element 0 = A at low bits):

; xmm0 = [A, B, C, D], xmm1 = [E, F, G, H]

shufps xmm0, xmm1, 0b00_01_10_11   ; = 0x1B
shufps xmm0, xmm1, 0b00_00_00_00   ; = 0x00
shufps xmm0, xmm1, 0b11_10_01_00   ; = 0xE4 (identity)
shufps xmm0, xmm0, 0b00_00_00_00   ; broadcast element 0

For SHUFPS xmm_dst, xmm_src, imm8: - Bits [1:0]: select which element of dst goes to result[0] - Bits [3:2]: select which element of dst goes to result[1] - Bits [5:4]: select which element of src goes to result[2] - Bits [7:6]: select which element of src goes to result[3]


Exercise 15.10 — PSHUFD Decode

PSHUFD xmm_dst, xmm_src, imm8 rearranges 4 doublewords (32-bit integers) in a register. The immediate encoding is identical to SHUFPS but uses only one source.

Given xmm0 = [0x00000001, 0x00000002, 0x00000003, 0x00000004]:

What does xmm0 contain after each:

pshufd xmm1, xmm0, 0x1B    ; 0b00_01_10_11
pshufd xmm1, xmm0, 0x55    ; 0b01_01_01_01
pshufd xmm1, xmm0, 0x4E    ; 0b01_00_11_10

Describe a use case for PSHUFD xmm1, xmm0, 0x55.


Exercise 15.11 — Vectorize a Dot Product

Implement a dot product in AVX2:

float dot_product(const float *a, const float *b, int n);
// Returns sum of a[i]*b[i] for i in [0, n)
// Assume n is a multiple of 8, arrays 32-byte aligned

Your implementation must: 1. Use VFMADD231PS (Fused Multiply-Add: dst += src1 * src2) to accumulate products 2. Maintain a running accumulator in ymm0 3. Perform a horizontal reduction at the end

Write the complete NASM function. Use VZEROUPPER before ret.


Exercise 15.12 — VZEROUPPER Requirement

Explain the VZEROUPPER problem:

  1. What happens to the upper 128 bits of a YMM register after a legacy SSE instruction (e.g., ADDPS xmm0, xmm1) executes on a processor with AVX support?

  2. What is the "AVX-SSE transition penalty" and on which Intel microarchitecture does it occur? How many cycles is the penalty?

  3. In the following code sequence, where should VZEROUPPER be inserted and why?

external_avx_function:
    vmovaps ymm0, [rel data]
    vaddps  ymm0, ymm0, ymm1
    vmovaps [rel output], ymm0
    ; ... transition to calling SSE code ...
    ret
  1. Is VZEROUPPER needed if the code only uses AVX instructions throughout? Why or why not?

Exercise 15.13 — AES-NI Key Schedule

The AES-128 key schedule expands a 16-byte key into 11 round keys. The hardware instruction AESKEYGENASSIST xmm_dst, xmm_src, rcon assists with this:

  1. What does AESKEYGENASSIST compute? (Which bytes does it operate on from the source?)
  2. The key expansion formula uses PSLLDQ (byte-shift left). What does PSLLDQ xmm, 4 do?
  3. Given the partial key expansion macro:
%macro KEY_EXPAND 2
    aeskeygenassist xmm1, xmm0, %2
    pshufd          xmm1, xmm1, 0xFF
    movaps          xmm2, xmm0
    pslldq          xmm2, 4
    xorps           xmm0, xmm2
    pslldq          xmm2, 4
    xorps           xmm0, xmm2
    pslldq          xmm2, 4
    xorps           xmm0, xmm2
    xorps           xmm0, xmm1
    movaps          [%1], xmm0
%endmacro

Explain step by step what this macro computes. What is the purpose of the three successive PSLLDQ + XORPS steps?


Exercise 15.14 — AES-NI Encryption Loop

Complete the following AES-128 encryption function:

; void aes128_encrypt_block(uint8_t *ct, const uint8_t *pt, const uint8_t *key_schedule)
; rdi = ciphertext output
; rsi = plaintext input (16 bytes)
; rdx = key schedule (11 * 16 = 176 bytes)

aes128_encrypt_block:
    movdqu  xmm0, [rsi]          ; load plaintext
    movaps  xmm1, [rdx]          ; load round key 0
    pxor    xmm0, xmm1           ; initial key whitening

    ; Rounds 1-9: AESENC
    ; TODO: add 9 AESENC rounds using key schedule offsets
    ; Key schedule: round key i is at [rdx + i*16]

    ; Round 10: AESENCLAST
    ; TODO: add final round

    movdqu  [rdi], xmm0          ; store ciphertext
    ret

Fill in the 9 intermediate rounds and the final AESENCLAST round. Each round key is at [rdx + round_number * 16].


Exercise 15.15 — AES Decryption

AES-NI decryption uses AESDEC and AESDECLAST. However, decryption requires the equivalent inverse cipher key schedule, which is obtained by applying AESIMC (AES Inverse Mix Columns) to round keys 1-9.

  1. Write pseudocode (or NASM) to convert an encryption key schedule to a decryption key schedule: - Round key 0 stays the same - Round keys 1-9: apply AESIMC - Round key 10 stays the same

  2. Write the decryption loop body using AESDEC / AESDECLAST.

  3. Explain why round keys 0 and 10 do NOT get AESIMC applied.


Exercise 15.16 — CTR Mode Counter

In AES-CTR mode, a counter block is encrypted and XORed with plaintext. The counter increments for each 16-byte block.

; Counter block layout (big-endian convention):
; [nonce: 8 bytes][counter: 8 bytes]
; Stored in xmm_ctr

; Given xmm_ctr holds the current counter block,
; write the increment sequence using these instructions:
; PADDQ  xmm, src  — add two 64-bit integers packed in xmm
; (We only want to increment the low 64 bits — the counter)
  1. Write the counter increment instruction.
  2. In little-endian storage (x86 native), the low 64 bits of an XMM register are bits [63:0]. Does PADDQ increment the right field?
  3. For a CTR mode encrypt loop that processes one 16-byte block per iteration: write the 4-5 instruction sequence that (a) encrypts the counter, (b) XORs with plaintext, (c) stores ciphertext, (d) increments the counter.

Exercise 15.17 — Performance Analysis

You are optimizing an inner loop that applies a 4×4 float matrix to a vector of 4 floats.

The naive scalar implementation requires: 4 multiplies + 3 adds per output element × 4 output elements = 16 multiplies + 12 adds = 28 FP operations.

An FMA-based SIMD implementation using VFMADD213PS processes 4 elements simultaneously.

  1. How many VFMADD213PS instructions are needed for the complete 4×4 × 4×1 multiplication?
  2. Assuming Haswell (VFMADD213PS throughput: 0.5 cycles, latency: 5 cycles), what is the theoretical throughput-limited time for this computation?
  3. What is the latency-limited time if the four output accumulations are independent?
  4. How does this compare to the scalar implementation (MULSS/ADDSS: 1 cycle throughput each)?

Exercise 15.18 — Auto-Vectorization Analysis

Given the following C function:

void add_arrays(float *dst, const float *a, const float *b, int n) {
    for (int i = 0; i < n; i++) {
        dst[i] = a[i] + b[i];
    }
}
  1. Compile mentally: what prevents auto-vectorization? (Hint: aliasing)
  2. Add the __restrict__ qualifier to the pointers. Does this fix the aliasing problem?
  3. What GCC flags enable AVX2 auto-vectorization?
  4. After adding __restrict__ and -O3 -mavx2, GCC generates something like:
.loop:
    vmovups ymm0, [rsi + rax*4]
    vaddps  ymm0, ymm0, [rdx + rax*4]
    vmovups [rdi + rax*4], ymm0
    add     rax, 8
    cmp     rax, rcx
    jb      .loop

Why does GCC use VMOVUPS instead of VMOVAPS? Under what conditions would GCC use VMOVAPS?


Exercise 15.19 — SIMD Data Layout Conversion

A program stores RGB pixels as a struct array (AoS):

struct Pixel { uint8_t r, g, b, a; };  // 4 bytes per pixel
Pixel image[WIDTH * HEIGHT];

To apply a SIMD operation to just the red channel, you need to gather the r bytes from every 4th byte.

  1. Describe the challenge of processing this data with SIMD instructions.
  2. The alternative layout (SoA) stores: r[n], g[n], b[n], a[n] as separate arrays. Show how MOVDQU loads 16 red bytes at once with SoA.
  3. If you cannot change the data layout, PSHUFB (byte shuffle) can extract and pack bytes from a 16-byte register. Write pseudocode for extracting 4 red bytes from 16 bytes of AoS pixel data.
  4. For the grayscale formula Y = 0.299R + 0.587G + 0.114*B: why is integer arithmetic preferred over floating-point for byte images?

Exercise 15.20 — Security: Constant-Time AES

AES-NI instructions are specifically designed to be constant-time — their execution time does not depend on the key or data values.

  1. Explain why table-based (T-table) software AES is vulnerable to cache-timing attacks. What information leaks through cache access patterns?

  2. The following is a simplified software AES SubBytes step: c uint8_t sbox_lookup(uint8_t byte) { return sbox[byte]; // table lookup } How does this leak the key material to a timing attacker?

  3. AESENC eliminates this attack. Explain why: what property of the instruction prevents timing side channels?

  4. A system uses AES-NI for encryption but then does: c if (key[0] == 0) handle_zero_key(); Does this reintroduce a timing channel? What type of vulnerability is this?

  5. Name one other x86-64 instruction (outside of AES-NI) that is specified to run in constant time regardless of inputs, and explain its security relevance.