Chapter 15 Exercises: SIMD Programming
Exercise 15.1 — Register Identification
For each operation below, identify (a) the register width used, (b) the ISA extension required, and (c) the number of elements processed simultaneously for the given type:
movaps xmm0, [rel data] ; Operation A
vmovaps ymm1, [rel data] ; Operation B
addps xmm2, xmm3 ; Operation C
vaddps ymm4, ymm5, ymm6 ; Operation D
paddd xmm7, xmm8 ; Operation E
vpaddw ymm9, ymm10, ymm11 ; Operation F
For each: 1. Register width in bits 2. ISA extension (SSE, SSE2, AVX, AVX2) 3. Number of elements if operating on 32-bit floats 4. Number of elements if operating on 8-bit integers
Exercise 15.2 — Packed Layout
A float array contains: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
After executing:
vmovaps ymm0, [rel array] ; loads all 8 floats
Draw the layout of ymm0, labeling:
- Which lane contains which value
- The bit positions of lane 0 and lane 7
- What xmm0 contains after this instruction
Exercise 15.3 — Alignment Penalty
Explain the difference between MOVAPS and MOVUPS:
- What happens if you use
MOVAPSwith a 16-byte misaligned address on modern Intel processors? - What is the typical performance penalty of
MOVUPSvs.MOVAPSfor aligned data on Haswell? - Write the NASM declaration to ensure a 32-byte aligned array in
.data:
section .data
; declare float_array of 8 floats, 32-byte aligned
- When would you choose
MOVUPSoverMOVAPS?
Exercise 15.4 — Packed Arithmetic Translation
Translate the following C loop to SSE2 assembly using XMM registers (process 4 floats at a time):
void scale_array(float *dst, const float *src, float scalar, int n) {
for (int i = 0; i < n; i++) {
dst[i] = src[i] * scalar;
}
}
Assumptions:
- n is a multiple of 4
- Both arrays are 16-byte aligned
- rdi = dst, rsi = src, xmm0 = scalar, rdx = n
Write the complete NASM function. Handle the loop setup, SIMD multiply, and return.
Exercise 15.5 — AVX2 Translation
Rewrite Exercise 15.4 using AVX2 (process 8 floats at a time with YMM registers):
; void scale_array_avx2(float *dst, const float *src, float scalar, int n)
; rdi = dst, rsi = src, xmm0 = scalar (needs broadcasting), rdx = n
; Assume n is a multiple of 8, arrays 32-byte aligned
Key challenge: scalar arrives in xmm0 as a single float. You need to broadcast it to all 8 lanes. Use VBROADCASTSS ymm1, xmm0 to accomplish this, then write the loop.
Exercise 15.6 — Horizontal Reduction
The following code attempts to sum all 4 floats in xmm0:
; xmm0 = [a, b, c, d] (4 floats)
; Goal: xmm0[0] = a + b + c + d
addps xmm0, ??? ; add xmm0 to itself shifted?
Complete the reduction using SHUFPS and ADDPS. The final sum should be in the low 32 bits of xmm0. Write the 3-4 instruction sequence and explain what each step produces.
For reference, SHUFPS xmm, src, imm8 selects 4 elements. The immediate byte encodes which source elements go to which destination positions.
Exercise 15.7 — AVX2 Horizontal Reduction
An 8-float horizontal sum with AVX2 requires an extra step compared to SSE2. Given:
; ymm0 = [a, b, c, d, e, f, g, h]
; Goal: compute a+b+c+d+e+f+g+h
The trick is that AVX operates within 128-bit lanes. VADDPS ymm0, ymm0, ymm1 adds lanes independently without crossing the 128-bit boundary.
- Use
VEXTRACTF128 xmm1, ymm0, 1to extract the upper 128 bits - Add
xmm1toxmm0(now xmm0 = [a+e, b+f, c+g, d+h]) - Complete the reduction to a scalar sum
Write the complete sequence (6-8 instructions).
Exercise 15.8 — Packed Integer Operations
Translate to SSE2 assembly:
// Clamp each int32 value to [0, 255]
void clamp_to_byte(int32_t *arr, int n) {
for (int i = 0; i < n; i++) {
if (arr[i] < 0) arr[i] = 0;
if (arr[i] > 255) arr[i] = 255;
}
}
Use these SSE2 integer instructions:
- PXOR xmm, xmm — zero a register
- PCMPGTD xmm, src — compare packed int32 (greater than)
- PANDN xmm, src — AND NOT (dst = ~dst & src)
- MOVDQA — move aligned 128-bit integer
Hint: Use the compare result as a mask.
Exercise 15.9 — SHUFPS Decode
For each SHUFPS instruction below, determine what the destination register contains. Each field of xmm_src = [A, B, C, D] (element 0 = A at low bits):
; xmm0 = [A, B, C, D], xmm1 = [E, F, G, H]
shufps xmm0, xmm1, 0b00_01_10_11 ; = 0x1B
shufps xmm0, xmm1, 0b00_00_00_00 ; = 0x00
shufps xmm0, xmm1, 0b11_10_01_00 ; = 0xE4 (identity)
shufps xmm0, xmm0, 0b00_00_00_00 ; broadcast element 0
For SHUFPS xmm_dst, xmm_src, imm8:
- Bits [1:0]: select which element of dst goes to result[0]
- Bits [3:2]: select which element of dst goes to result[1]
- Bits [5:4]: select which element of src goes to result[2]
- Bits [7:6]: select which element of src goes to result[3]
Exercise 15.10 — PSHUFD Decode
PSHUFD xmm_dst, xmm_src, imm8 rearranges 4 doublewords (32-bit integers) in a register. The immediate encoding is identical to SHUFPS but uses only one source.
Given xmm0 = [0x00000001, 0x00000002, 0x00000003, 0x00000004]:
What does xmm0 contain after each:
pshufd xmm1, xmm0, 0x1B ; 0b00_01_10_11
pshufd xmm1, xmm0, 0x55 ; 0b01_01_01_01
pshufd xmm1, xmm0, 0x4E ; 0b01_00_11_10
Describe a use case for PSHUFD xmm1, xmm0, 0x55.
Exercise 15.11 — Vectorize a Dot Product
Implement a dot product in AVX2:
float dot_product(const float *a, const float *b, int n);
// Returns sum of a[i]*b[i] for i in [0, n)
// Assume n is a multiple of 8, arrays 32-byte aligned
Your implementation must:
1. Use VFMADD231PS (Fused Multiply-Add: dst += src1 * src2) to accumulate products
2. Maintain a running accumulator in ymm0
3. Perform a horizontal reduction at the end
Write the complete NASM function. Use VZEROUPPER before ret.
Exercise 15.12 — VZEROUPPER Requirement
Explain the VZEROUPPER problem:
-
What happens to the upper 128 bits of a YMM register after a legacy SSE instruction (e.g.,
ADDPS xmm0, xmm1) executes on a processor with AVX support? -
What is the "AVX-SSE transition penalty" and on which Intel microarchitecture does it occur? How many cycles is the penalty?
-
In the following code sequence, where should
VZEROUPPERbe inserted and why?
external_avx_function:
vmovaps ymm0, [rel data]
vaddps ymm0, ymm0, ymm1
vmovaps [rel output], ymm0
; ... transition to calling SSE code ...
ret
- Is
VZEROUPPERneeded if the code only uses AVX instructions throughout? Why or why not?
Exercise 15.13 — AES-NI Key Schedule
The AES-128 key schedule expands a 16-byte key into 11 round keys. The hardware instruction AESKEYGENASSIST xmm_dst, xmm_src, rcon assists with this:
- What does
AESKEYGENASSISTcompute? (Which bytes does it operate on from the source?) - The key expansion formula uses
PSLLDQ(byte-shift left). What doesPSLLDQ xmm, 4do? - Given the partial key expansion macro:
%macro KEY_EXPAND 2
aeskeygenassist xmm1, xmm0, %2
pshufd xmm1, xmm1, 0xFF
movaps xmm2, xmm0
pslldq xmm2, 4
xorps xmm0, xmm2
pslldq xmm2, 4
xorps xmm0, xmm2
pslldq xmm2, 4
xorps xmm0, xmm2
xorps xmm0, xmm1
movaps [%1], xmm0
%endmacro
Explain step by step what this macro computes. What is the purpose of the three successive PSLLDQ + XORPS steps?
Exercise 15.14 — AES-NI Encryption Loop
Complete the following AES-128 encryption function:
; void aes128_encrypt_block(uint8_t *ct, const uint8_t *pt, const uint8_t *key_schedule)
; rdi = ciphertext output
; rsi = plaintext input (16 bytes)
; rdx = key schedule (11 * 16 = 176 bytes)
aes128_encrypt_block:
movdqu xmm0, [rsi] ; load plaintext
movaps xmm1, [rdx] ; load round key 0
pxor xmm0, xmm1 ; initial key whitening
; Rounds 1-9: AESENC
; TODO: add 9 AESENC rounds using key schedule offsets
; Key schedule: round key i is at [rdx + i*16]
; Round 10: AESENCLAST
; TODO: add final round
movdqu [rdi], xmm0 ; store ciphertext
ret
Fill in the 9 intermediate rounds and the final AESENCLAST round. Each round key is at [rdx + round_number * 16].
Exercise 15.15 — AES Decryption
AES-NI decryption uses AESDEC and AESDECLAST. However, decryption requires the equivalent inverse cipher key schedule, which is obtained by applying AESIMC (AES Inverse Mix Columns) to round keys 1-9.
-
Write pseudocode (or NASM) to convert an encryption key schedule to a decryption key schedule: - Round key 0 stays the same - Round keys 1-9: apply
AESIMC- Round key 10 stays the same -
Write the decryption loop body using
AESDEC/AESDECLAST. -
Explain why round keys 0 and 10 do NOT get
AESIMCapplied.
Exercise 15.16 — CTR Mode Counter
In AES-CTR mode, a counter block is encrypted and XORed with plaintext. The counter increments for each 16-byte block.
; Counter block layout (big-endian convention):
; [nonce: 8 bytes][counter: 8 bytes]
; Stored in xmm_ctr
; Given xmm_ctr holds the current counter block,
; write the increment sequence using these instructions:
; PADDQ xmm, src — add two 64-bit integers packed in xmm
; (We only want to increment the low 64 bits — the counter)
- Write the counter increment instruction.
- In little-endian storage (x86 native), the low 64 bits of an XMM register are bits [63:0]. Does PADDQ increment the right field?
- For a CTR mode encrypt loop that processes one 16-byte block per iteration: write the 4-5 instruction sequence that (a) encrypts the counter, (b) XORs with plaintext, (c) stores ciphertext, (d) increments the counter.
Exercise 15.17 — Performance Analysis
You are optimizing an inner loop that applies a 4×4 float matrix to a vector of 4 floats.
The naive scalar implementation requires: 4 multiplies + 3 adds per output element × 4 output elements = 16 multiplies + 12 adds = 28 FP operations.
An FMA-based SIMD implementation using VFMADD213PS processes 4 elements simultaneously.
- How many
VFMADD213PSinstructions are needed for the complete 4×4 × 4×1 multiplication? - Assuming Haswell (VFMADD213PS throughput: 0.5 cycles, latency: 5 cycles), what is the theoretical throughput-limited time for this computation?
- What is the latency-limited time if the four output accumulations are independent?
- How does this compare to the scalar implementation (MULSS/ADDSS: 1 cycle throughput each)?
Exercise 15.18 — Auto-Vectorization Analysis
Given the following C function:
void add_arrays(float *dst, const float *a, const float *b, int n) {
for (int i = 0; i < n; i++) {
dst[i] = a[i] + b[i];
}
}
- Compile mentally: what prevents auto-vectorization? (Hint: aliasing)
- Add the
__restrict__qualifier to the pointers. Does this fix the aliasing problem? - What GCC flags enable AVX2 auto-vectorization?
- After adding
__restrict__and-O3 -mavx2, GCC generates something like:
.loop:
vmovups ymm0, [rsi + rax*4]
vaddps ymm0, ymm0, [rdx + rax*4]
vmovups [rdi + rax*4], ymm0
add rax, 8
cmp rax, rcx
jb .loop
Why does GCC use VMOVUPS instead of VMOVAPS? Under what conditions would GCC use VMOVAPS?
Exercise 15.19 — SIMD Data Layout Conversion
A program stores RGB pixels as a struct array (AoS):
struct Pixel { uint8_t r, g, b, a; }; // 4 bytes per pixel
Pixel image[WIDTH * HEIGHT];
To apply a SIMD operation to just the red channel, you need to gather the r bytes from every 4th byte.
- Describe the challenge of processing this data with SIMD instructions.
- The alternative layout (SoA) stores:
r[n], g[n], b[n], a[n]as separate arrays. Show howMOVDQUloads 16 red bytes at once with SoA. - If you cannot change the data layout,
PSHUFB(byte shuffle) can extract and pack bytes from a 16-byte register. Write pseudocode for extracting 4 red bytes from 16 bytes of AoS pixel data. - For the grayscale formula Y = 0.299R + 0.587G + 0.114*B: why is integer arithmetic preferred over floating-point for byte images?
Exercise 15.20 — Security: Constant-Time AES
AES-NI instructions are specifically designed to be constant-time — their execution time does not depend on the key or data values.
-
Explain why table-based (T-table) software AES is vulnerable to cache-timing attacks. What information leaks through cache access patterns?
-
The following is a simplified software AES SubBytes step:
c uint8_t sbox_lookup(uint8_t byte) { return sbox[byte]; // table lookup }How does this leak the key material to a timing attacker? -
AESENCeliminates this attack. Explain why: what property of the instruction prevents timing side channels? -
A system uses AES-NI for encryption but then does:
c if (key[0] == 0) handle_zero_key();Does this reintroduce a timing channel? What type of vulnerability is this? -
Name one other x86-64 instruction (outside of AES-NI) that is specified to run in constant time regardless of inputs, and explain its security relevance.