Case Study 15.1: SIMD Image Processing — Grayscale Conversion
The Problem
Converting a color image to grayscale is a classic image processing operation. The standard luminance formula is:
Y = 0.299*R + 0.587*G + 0.114*B
For an 8-bit image (each channel 0-255), the naive scalar implementation processes one pixel at a time. A 1920×1080 image contains 2,073,600 pixels — at one pixel per loop iteration, that is over two million iterations. SIMD can process 16 pixels per instruction cycle using 8-bit integer operations, giving a theoretical 16× speedup.
Data Layout
The input image is stored in RGBA format: 4 bytes per pixel (R, G, B, A) packed consecutively. The output is a single byte per pixel (the Y value).
Memory layout per pixel:
Byte 0: R (red)
Byte 1: G (green)
Byte 2: B (blue)
Byte 3: A (alpha, ignored for grayscale)
For 4 consecutive pixels at address [base]:
Offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data: R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3
A 128-bit XMM register holds 16 bytes — exactly 4 RGBA pixels.
Approach 1: Scalar (Baseline)
; void grayscale_scalar(uint8_t *dst, const uint32_t *src, int n)
; rdi = dst (output: one byte per pixel)
; rsi = src (input: 4 bytes per pixel, RGBA)
; rdx = n (number of pixels)
section .data
; Fixed-point coefficients scaled by 256 (to avoid floating-point):
; 0.299 * 256 = 76.544 → round to 77
; 0.587 * 256 = 150.272 → round to 150
; 0.114 * 256 = 29.184 → round to 29
; Total: 77 + 150 + 29 = 256 (exact — no bias)
coeff_r: dq 77
coeff_g: dq 150
coeff_b: dq 29
section .text
global grayscale_scalar
grayscale_scalar:
test rdx, rdx
jz .done
xor rcx, rcx ; pixel index
.pixel_loop:
movzx eax, byte [rsi + rcx*4] ; R
movzx r8d, byte [rsi + rcx*4 + 1] ; G
movzx r9d, byte [rsi + rcx*4 + 2] ; B
; Y = (77*R + 150*G + 29*B) / 256
imul eax, eax, 77
imul r8d, r8d, 150
imul r9d, r9d, 29
add eax, r8d
add eax, r9d
shr eax, 8 ; divide by 256
mov [rdi + rcx], al ; store 1 byte
inc rcx
cmp rcx, rdx
jb .pixel_loop
.done:
ret
Performance: ~8-10 instructions per pixel. For 2 million pixels: ~18 million instructions.
Why Integer Arithmetic?
The Y formula uses floating-point coefficients (0.299, 0.587, 0.114). Integer arithmetic is faster for two reasons:
-
Throughput: Integer multiply (
IMUL r32, r32, imm8) has 1-cycle throughput vs.MULSSat 1-cycle throughput with 5-cycle latency. But the real gain comes from using SIMD. -
SIMD byte width: SSE2 can multiply sixteen 8-bit values simultaneously with
PMADDUBSW(SSSE3). Integer allows processing 16 pixels vs. 4 pixels per register for float. -
Scaling: We multiply each coefficient by 256, do integer math, then right-shift by 8 (divide by 256). The result is an approximation of the exact formula with maximum error of 1 out of 256 (~0.4%) — imperceptible to human vision.
Approach 2: SSE2 (4 Pixels at a Time)
With SSE2, we process 4 RGBA pixels at a time using 16-bit arithmetic (to avoid overflow: 255 × 150 = 38,250, which fits in 16 bits but not 8 bits).
; void grayscale_sse2(uint8_t *dst, const uint32_t *src, int n)
; rdi = dst, rsi = src, rdx = n
; Assume n is a multiple of 4, src is 16-byte aligned
section .data
align 16
; Shuffle mask to extract R bytes from 4 RGBA pixels into low 4 words:
; Input: R0 G0 B0 A0 | R1 G1 B1 A1 | R2 G2 B2 A2 | R3 G3 B3 A3
; Output: R0 0 R1 0 | R2 0 R3 0 | 0 0 0 0 | 0 0 0 0
mask_r: db 0, 0xFF, 4, 0xFF, 8, 0xFF, 12, 0xFF, \
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
; Shuffle to extract G bytes:
mask_g: db 1, 0xFF, 5, 0xFF, 9, 0xFF, 13, 0xFF, \
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
; Shuffle to extract B bytes:
mask_b: db 2, 0xFF, 6, 0xFF, 10, 0xFF, 14, 0xFF, \
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
; 16-bit coefficient broadcast (4 copies each):
coeff_r_v: dw 77, 77, 77, 77, 0, 0, 0, 0
coeff_g_v: dw 150, 150, 150, 150, 0, 0, 0, 0
coeff_b_v: dw 29, 29, 29, 29, 0, 0, 0, 0
section .text
global grayscale_sse2
grayscale_sse2:
test rdx, rdx
jz .done
movdqa xmm6, [rel coeff_r_v] ; xmm6 = [77, 77, 77, 77, 0, 0, 0, 0]
movdqa xmm7, [rel coeff_g_v] ; xmm7 = [150, 150, 150, 150, 0, 0, 0, 0]
xor rcx, rcx ; pixel index (advances by 4)
.pixel_loop:
; Load 4 RGBA pixels = 16 bytes
movdqu xmm0, [rsi + rcx*4] ; xmm0 = R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3
; Extract R channel into 16-bit words using PSHUFB (SSSE3)
; OR without SSSE3: use PUNPCKLBW + mask
; Using the simpler PUNPCKLBW approach:
pxor xmm5, xmm5 ; xmm5 = 0
; Extract R bytes at positions 0, 4, 8, 12 → pack into low 8 bytes
; Method: byte-shift and mask
movdqa xmm1, xmm0
pand xmm1, [rel mask_byte0] ; keep only byte 0 of each dword (R values)
; xmm1 = R0 0 0 0 | R1 0 0 0 | R2 0 0 0 | R3 0 0 0
; Unpack to 16-bit: xmm1 = R0 0 R1 0 R2 0 R3 0 (8 words, upper 4 = 0)
; (bytes in 32-bit slots, so we need to rearrange)
; A cleaner approach with PSHUFD + PUNPCKLBW:
; Step 1: isolate R channel (byte 0 of each dword)
movdqa xmm1, xmm0 ; copy
pslld xmm1, 24 ; shift R to top of each dword
psrld xmm1, 24 ; shift back — leaves R in low byte of each dword
packuswb xmm1, xmm5 ; pack 4 dwords → 4 bytes in low 4 bytes of xmm1
punpcklbw xmm1, xmm5 ; expand to 16-bit: xmm1 = [R0, R1, R2, R3, 0, 0, 0, 0]
; Similarly for G (byte 1):
movdqa xmm2, xmm0
psrld xmm2, 8 ; G is now at byte 0 of each dword
pand xmm2, [rel mask_low_byte]; keep only low byte of each dword
packuswb xmm2, xmm5
punpcklbw xmm2, xmm5 ; xmm2 = [G0, G1, G2, G3, 0, 0, 0, 0]
; Similarly for B (byte 2):
movdqa xmm3, xmm0
psrld xmm3, 16
pand xmm3, [rel mask_low_byte]
packuswb xmm3, xmm5
punpcklbw xmm3, xmm5 ; xmm3 = [B0, B1, B2, B3, 0, 0, 0, 0]
; Multiply by coefficients (16-bit multiply, 32-bit result would overflow, use PMULLW)
; PMULLW multiplies 16-bit words and keeps low 16 bits — fine since 255*150 = 38250 < 65535
pmullw xmm1, xmm6 ; xmm1 = [77*R0, 77*R1, 77*R2, 77*R3, ...]
pmullw xmm2, xmm7 ; xmm2 = [150*G0, 150*G1, ...]
pmullw xmm3, [rel coeff_b_v] ; xmm3 = [29*B0, 29*B1, ...]
; Sum the channels
paddw xmm1, xmm2 ; xmm1 = R*77 + G*150
paddw xmm1, xmm3 ; xmm1 = R*77 + G*150 + B*29 (= Y*256)
; Divide by 256 (right-shift 8 bits)
psrlw xmm1, 8 ; xmm1 = [Y0, Y1, Y2, Y3, 0, 0, 0, 0]
; Pack 16-bit → 8-bit (saturating unsigned)
packuswb xmm1, xmm5 ; xmm1[3:0] = Y0, Y1, Y2, Y3 (as bytes)
; Store 4 grayscale bytes
movd [rdi + rcx], xmm1 ; store 4 bytes
add rcx, 4
cmp rcx, rdx
jb .pixel_loop
.done:
ret
Implementation note: The above uses several intermediate steps for clarity. In practice,
PSHUFB(SSSE3) makes the channel extraction much cleaner with a single byte-shuffle instruction per channel.
Approach 3: SSSE3 with PSHUFB (16 Pixels at a Time)
With SSSE3's PSHUFB, we can extract all R values, then all G values, then all B values with a single shuffle per channel — and we can process 16 bytes at a time for the byte-multiplication approach.
The key insight: PMADDUBSW (SSSE3) multiplies 16 unsigned 8-bit values by 16 signed 8-bit values and produces 8 signed 16-bit results, summing adjacent pairs. This enables the weighted sum of two channels in a single instruction.
; SSSE3 approach for 4 pixels per iteration using PMADDUBSW
; PMADDUBSW(a[u8x16], b[s8x16]) → c[s16x8]
; c[i] = a[2i]*b[2i] + a[2i+1]*b[2i+1]
section .data
align 16
; For 4 pixels, interleave R,G pairs followed by B,0 pairs:
; Input register after shuffle: R0 G0 R1 G1 R2 G2 R3 G3 | B0 0 B1 0 B2 0 B3 0
; PMADDUBSW with [77, 150, 77, 150, ...]: → [77*R0+150*G0, 77*R1+150*G1, ...]
; PMADDUBSW with [29, 0, 29, 0, ...]: → [29*B0, 29*B1, ...]
rg_shuffle: db 0, 1, 4, 5, 8, 9, 12, 13, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
b_shuffle: db 2, 0xFF, 6, 0xFF, 10, 0xFF, 14, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
; Coefficient vectors for PMADDUBSW:
rg_coeff: db 77, 150, 77, 150, 77, 150, 77, 150, 0, 0, 0, 0, 0, 0, 0, 0
b_coeff: db 29, 0, 29, 0, 29, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0
section .text
global grayscale_ssse3
grayscale_ssse3:
movdqa xmm5, [rel rg_shuffle]
movdqa xmm6, [rel b_shuffle]
movdqa xmm4, [rel rg_coeff]
pxor xmm7, xmm7
xor rcx, rcx
.loop:
movdqu xmm0, [rsi + rcx*4] ; load 4 RGBA pixels
; Shuffle to get R0,G0,R1,G1,R2,G2,R3,G3 in low 8 bytes
pshufb xmm1, xmm0 ; xmm1 = R,G pairs
movdqa xmm1, xmm0
pshufb xmm1, xmm5 ; xmm1[0:7] = R0 G0 R1 G1 R2 G2 R3 G3
; Shuffle to get B0,0,B1,0,... in low 8 bytes
movdqa xmm2, xmm0
pshufb xmm2, xmm6 ; xmm2[0:7] = B0 0 B1 0 B2 0 B3 0
; PMADDUBSW: multiply-add pairs
pmaddubsw xmm1, xmm4 ; xmm1 = [77*R0+150*G0, 77*R1+150*G1, ...] (16-bit)
pmaddubsw xmm2, [rel b_coeff] ; xmm2 = [29*B0, 29*B1, ...] (16-bit)
; Sum and shift
paddw xmm1, xmm2 ; Y*256 in 16-bit
psrlw xmm1, 8 ; Y (divide by 256)
; Pack to bytes
packuswb xmm1, xmm7 ; Y0,Y1,Y2,Y3 in low 4 bytes
; Store
movd [rdi + rcx], xmm1
add rcx, 4
cmp rcx, rdx
jb .loop
ret
Performance Comparison
| Implementation | Pixels/Iteration | Approx Cycles/Pixel | Throughput (MP/s at 3 GHz) |
|---|---|---|---|
| Scalar (C compiler, -O2) | 1 | 6-8 | ~375-500 |
| SSE2 (4 pixels) | 4 | 1.5-2 | ~1,500-2,000 |
| SSSE3 PMADDUBSW (4 pixels) | 4 | 0.8-1.2 | ~2,500-3,750 |
| AVX2 (16 pixels) | 16 | 0.2-0.4 | ~7,500-15,000 |
For a 1920×1080 frame at 30 FPS: scalar requires ~540M pixels/sec capacity; AVX2 easily handles 60+ FPS at these rates.
The AVX2 Version: 16 Pixels Per Iteration
With AVX2, we extend to 256-bit registers and process 8 RGBA pixels (32 bytes) at a time, producing 8 grayscale bytes per iteration. The approach extends directly: VPSHUFB operates on 256-bit registers, treating the register as two independent 128-bit halves.
; Key difference for AVX2:
vmovdqu ymm0, [rsi + rcx*4] ; load 8 RGBA pixels (32 bytes)
; Apply the same shuffle and multiply logic with ymm registers
; VPMADDUBSW ymm1, ymm1, ymm4 ; 16 multiply-add pairs simultaneously
; VPSRLW ymm1, ymm1, 8
; VPACKUSWB ymm1, ymm1, ymm7
; NOTE: VPACKUSWB in AVX2 packs within each 128-bit lane, not across lanes
; so the 8 output bytes end up split between low 8 bytes of each 128-bit half:
; Result bytes: Y0 Y1 Y2 Y3 ?? ?? ?? ?? Y4 Y5 Y6 Y7 ?? ?? ?? ??
; Fix with VPERMQ ymm1, ymm1, 0b10_00_10_00 to consolidate
vpermq ymm1, ymm1, 0xD8 ; reorder 64-bit lanes to consolidate output bytes
vmovdqu [rdi + rcx], xmm1 ; store 8 bytes from low 128 bits
vzeroupper
Key Insight: Lane Crossing in AVX2
The most common mistake with AVX2 byte operations is forgetting that many instructions operate independently within each 128-bit half — they do NOT cross the 128-bit lane boundary. VPSHUFB shuffles within each 128-bit half. VPACKUSWB packs within each half. This means the 8 output bytes from 8 pixels end up split: 4 bytes in the low 128 bits and 4 bytes in the high 128 bits, in the wrong positions.
VPERMQ (which CAN cross lanes, operating on 64-bit quadwords) is used to consolidate the output into contiguous bytes before storing.
Practical Lesson
SIMD image processing requires understanding both: 1. Data layout — AoS (RGBA RGBA...) vs. SoA (RRR... GGG... BBB...). SoA is almost always better for SIMD; AoS requires shuffles to extract channels. 2. Lane behavior — AVX2 operates in two independent 128-bit halves for most byte/word instructions. Always check the manual: does your instruction cross lanes or not?
The SSSE3 PMADDUBSW instruction is particularly elegant for this problem: it simultaneously multiplies and sums two pairs of channels per output element, halving the instruction count compared to separate multiply-then-add.