Case Study 15.1: SIMD Image Processing — Grayscale Conversion

The Problem

Converting a color image to grayscale is a classic image processing operation. The standard luminance formula is:

Y = 0.299*R + 0.587*G + 0.114*B

For an 8-bit image (each channel 0-255), the naive scalar implementation processes one pixel at a time. A 1920×1080 image contains 2,073,600 pixels — at one pixel per loop iteration, that is over two million iterations. SIMD can process 16 pixels per instruction cycle using 8-bit integer operations, giving a theoretical 16× speedup.

Data Layout

The input image is stored in RGBA format: 4 bytes per pixel (R, G, B, A) packed consecutively. The output is a single byte per pixel (the Y value).

Memory layout per pixel:
Byte 0: R (red)
Byte 1: G (green)
Byte 2: B (blue)
Byte 3: A (alpha, ignored for grayscale)

For 4 consecutive pixels at address [base]:
Offset:  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
Data:   R0  G0  B0  A0  R1  G1  B1  A1  R2  G2  B2  A2  R3  G3  B3  A3

A 128-bit XMM register holds 16 bytes — exactly 4 RGBA pixels.

Approach 1: Scalar (Baseline)

; void grayscale_scalar(uint8_t *dst, const uint32_t *src, int n)
; rdi = dst (output: one byte per pixel)
; rsi = src (input: 4 bytes per pixel, RGBA)
; rdx = n (number of pixels)

section .data
; Fixed-point coefficients scaled by 256 (to avoid floating-point):
; 0.299 * 256 =  76.544 → round to  77
; 0.587 * 256 = 150.272 → round to 150
; 0.114 * 256 =  29.184 → round to  29
; Total: 77 + 150 + 29 = 256 (exact — no bias)
coeff_r: dq 77
coeff_g: dq 150
coeff_b: dq 29

section .text
global grayscale_scalar

grayscale_scalar:
    test  rdx, rdx
    jz    .done
    xor   rcx, rcx           ; pixel index

.pixel_loop:
    movzx eax, byte [rsi + rcx*4]      ; R
    movzx r8d, byte [rsi + rcx*4 + 1]  ; G
    movzx r9d, byte [rsi + rcx*4 + 2]  ; B

    ; Y = (77*R + 150*G + 29*B) / 256
    imul  eax, eax, 77
    imul  r8d, r8d, 150
    imul  r9d, r9d, 29
    add   eax, r8d
    add   eax, r9d
    shr   eax, 8             ; divide by 256

    mov   [rdi + rcx], al    ; store 1 byte

    inc   rcx
    cmp   rcx, rdx
    jb    .pixel_loop

.done:
    ret

Performance: ~8-10 instructions per pixel. For 2 million pixels: ~18 million instructions.

Why Integer Arithmetic?

The Y formula uses floating-point coefficients (0.299, 0.587, 0.114). Integer arithmetic is faster for two reasons:

  1. Throughput: Integer multiply (IMUL r32, r32, imm8) has 1-cycle throughput vs. MULSS at 1-cycle throughput with 5-cycle latency. But the real gain comes from using SIMD.

  2. SIMD byte width: SSE2 can multiply sixteen 8-bit values simultaneously with PMADDUBSW (SSSE3). Integer allows processing 16 pixels vs. 4 pixels per register for float.

  3. Scaling: We multiply each coefficient by 256, do integer math, then right-shift by 8 (divide by 256). The result is an approximation of the exact formula with maximum error of 1 out of 256 (~0.4%) — imperceptible to human vision.

Approach 2: SSE2 (4 Pixels at a Time)

With SSE2, we process 4 RGBA pixels at a time using 16-bit arithmetic (to avoid overflow: 255 × 150 = 38,250, which fits in 16 bits but not 8 bits).

; void grayscale_sse2(uint8_t *dst, const uint32_t *src, int n)
; rdi = dst, rsi = src, rdx = n
; Assume n is a multiple of 4, src is 16-byte aligned

section .data
align 16

; Shuffle mask to extract R bytes from 4 RGBA pixels into low 4 words:
; Input:  R0 G0 B0 A0 | R1 G1 B1 A1 | R2 G2 B2 A2 | R3 G3 B3 A3
; Output: R0  0 R1  0 | R2  0 R3  0 |  0  0  0  0 |  0  0  0  0
mask_r: db  0, 0xFF,  4, 0xFF,  8, 0xFF, 12, 0xFF, \
           0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF

; Shuffle to extract G bytes:
mask_g: db  1, 0xFF,  5, 0xFF,  9, 0xFF, 13, 0xFF, \
           0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF

; Shuffle to extract B bytes:
mask_b: db  2, 0xFF,  6, 0xFF, 10, 0xFF, 14, 0xFF, \
           0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF

; 16-bit coefficient broadcast (4 copies each):
coeff_r_v: dw 77,  77,  77,  77,  0, 0, 0, 0
coeff_g_v: dw 150, 150, 150, 150, 0, 0, 0, 0
coeff_b_v: dw 29,  29,  29,  29,  0, 0, 0, 0

section .text
global grayscale_sse2

grayscale_sse2:
    test  rdx, rdx
    jz    .done

    movdqa xmm6, [rel coeff_r_v]    ; xmm6 = [77, 77, 77, 77, 0, 0, 0, 0]
    movdqa xmm7, [rel coeff_g_v]    ; xmm7 = [150, 150, 150, 150, 0, 0, 0, 0]

    xor   rcx, rcx                  ; pixel index (advances by 4)

.pixel_loop:
    ; Load 4 RGBA pixels = 16 bytes
    movdqu xmm0, [rsi + rcx*4]     ; xmm0 = R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3

    ; Extract R channel into 16-bit words using PSHUFB (SSSE3)
    ; OR without SSSE3: use PUNPCKLBW + mask
    ; Using the simpler PUNPCKLBW approach:
    pxor  xmm5, xmm5               ; xmm5 = 0

    ; Extract R bytes at positions 0, 4, 8, 12 → pack into low 8 bytes
    ; Method: byte-shift and mask
    movdqa xmm1, xmm0
    pand  xmm1, [rel mask_byte0]   ; keep only byte 0 of each dword (R values)
    ; xmm1 = R0 0 0 0 | R1 0 0 0 | R2 0 0 0 | R3 0 0 0

    ; Unpack to 16-bit: xmm1 = R0 0 R1 0 R2 0 R3 0 (8 words, upper 4 = 0)
    ; (bytes in 32-bit slots, so we need to rearrange)

    ; A cleaner approach with PSHUFD + PUNPCKLBW:
    ; Step 1: isolate R channel (byte 0 of each dword)
    movdqa xmm1, xmm0               ; copy
    pslld  xmm1, 24                 ; shift R to top of each dword
    psrld  xmm1, 24                 ; shift back — leaves R in low byte of each dword
    packuswb xmm1, xmm5            ; pack 4 dwords → 4 bytes in low 4 bytes of xmm1
    punpcklbw xmm1, xmm5           ; expand to 16-bit: xmm1 = [R0, R1, R2, R3, 0, 0, 0, 0]

    ; Similarly for G (byte 1):
    movdqa xmm2, xmm0
    psrld  xmm2, 8                  ; G is now at byte 0 of each dword
    pand   xmm2, [rel mask_low_byte]; keep only low byte of each dword
    packuswb xmm2, xmm5
    punpcklbw xmm2, xmm5           ; xmm2 = [G0, G1, G2, G3, 0, 0, 0, 0]

    ; Similarly for B (byte 2):
    movdqa xmm3, xmm0
    psrld  xmm3, 16
    pand   xmm3, [rel mask_low_byte]
    packuswb xmm3, xmm5
    punpcklbw xmm3, xmm5           ; xmm3 = [B0, B1, B2, B3, 0, 0, 0, 0]

    ; Multiply by coefficients (16-bit multiply, 32-bit result would overflow, use PMULLW)
    ; PMULLW multiplies 16-bit words and keeps low 16 bits — fine since 255*150 = 38250 < 65535
    pmullw xmm1, xmm6               ; xmm1 = [77*R0, 77*R1, 77*R2, 77*R3, ...]
    pmullw xmm2, xmm7               ; xmm2 = [150*G0, 150*G1, ...]
    pmullw xmm3, [rel coeff_b_v]   ; xmm3 = [29*B0, 29*B1, ...]

    ; Sum the channels
    paddw  xmm1, xmm2               ; xmm1 = R*77 + G*150
    paddw  xmm1, xmm3               ; xmm1 = R*77 + G*150 + B*29 (= Y*256)

    ; Divide by 256 (right-shift 8 bits)
    psrlw  xmm1, 8                  ; xmm1 = [Y0, Y1, Y2, Y3, 0, 0, 0, 0]

    ; Pack 16-bit → 8-bit (saturating unsigned)
    packuswb xmm1, xmm5             ; xmm1[3:0] = Y0, Y1, Y2, Y3 (as bytes)

    ; Store 4 grayscale bytes
    movd   [rdi + rcx], xmm1        ; store 4 bytes

    add   rcx, 4
    cmp   rcx, rdx
    jb    .pixel_loop

.done:
    ret

Implementation note: The above uses several intermediate steps for clarity. In practice, PSHUFB (SSSE3) makes the channel extraction much cleaner with a single byte-shuffle instruction per channel.

Approach 3: SSSE3 with PSHUFB (16 Pixels at a Time)

With SSSE3's PSHUFB, we can extract all R values, then all G values, then all B values with a single shuffle per channel — and we can process 16 bytes at a time for the byte-multiplication approach.

The key insight: PMADDUBSW (SSSE3) multiplies 16 unsigned 8-bit values by 16 signed 8-bit values and produces 8 signed 16-bit results, summing adjacent pairs. This enables the weighted sum of two channels in a single instruction.

; SSSE3 approach for 4 pixels per iteration using PMADDUBSW
; PMADDUBSW(a[u8x16], b[s8x16]) → c[s16x8]
; c[i] = a[2i]*b[2i] + a[2i+1]*b[2i+1]

section .data
align 16

; For 4 pixels, interleave R,G pairs followed by B,0 pairs:
; Input register after shuffle: R0 G0 R1 G1 R2 G2 R3 G3 | B0 0 B1 0 B2 0 B3 0
; PMADDUBSW with [77, 150, 77, 150, ...]: → [77*R0+150*G0, 77*R1+150*G1, ...]
; PMADDUBSW with [29, 0, 29, 0, ...]:     → [29*B0, 29*B1, ...]

rg_shuffle: db 0, 1, 4, 5, 8, 9, 12, 13, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
b_shuffle:  db 2, 0xFF, 6, 0xFF, 10, 0xFF, 14, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF

; Coefficient vectors for PMADDUBSW:
rg_coeff:   db 77, 150, 77, 150, 77, 150, 77, 150,  0, 0, 0, 0, 0, 0, 0, 0
b_coeff:    db 29,   0, 29,   0, 29,   0, 29,   0,  0, 0, 0, 0, 0, 0, 0, 0

section .text
global grayscale_ssse3

grayscale_ssse3:
    movdqa  xmm5, [rel rg_shuffle]
    movdqa  xmm6, [rel b_shuffle]
    movdqa  xmm4, [rel rg_coeff]
    pxor    xmm7, xmm7

    xor     rcx, rcx

.loop:
    movdqu  xmm0, [rsi + rcx*4]    ; load 4 RGBA pixels

    ; Shuffle to get R0,G0,R1,G1,R2,G2,R3,G3 in low 8 bytes
    pshufb  xmm1, xmm0             ; xmm1 = R,G pairs
    movdqa  xmm1, xmm0
    pshufb  xmm1, xmm5             ; xmm1[0:7] = R0 G0 R1 G1 R2 G2 R3 G3

    ; Shuffle to get B0,0,B1,0,... in low 8 bytes
    movdqa  xmm2, xmm0
    pshufb  xmm2, xmm6             ; xmm2[0:7] = B0 0 B1 0 B2 0 B3 0

    ; PMADDUBSW: multiply-add pairs
    pmaddubsw xmm1, xmm4           ; xmm1 = [77*R0+150*G0, 77*R1+150*G1, ...]  (16-bit)
    pmaddubsw xmm2, [rel b_coeff]  ; xmm2 = [29*B0, 29*B1, ...]  (16-bit)

    ; Sum and shift
    paddw   xmm1, xmm2             ; Y*256 in 16-bit
    psrlw   xmm1, 8                ; Y (divide by 256)

    ; Pack to bytes
    packuswb xmm1, xmm7            ; Y0,Y1,Y2,Y3 in low 4 bytes

    ; Store
    movd    [rdi + rcx], xmm1

    add     rcx, 4
    cmp     rcx, rdx
    jb      .loop

    ret

Performance Comparison

Implementation Pixels/Iteration Approx Cycles/Pixel Throughput (MP/s at 3 GHz)
Scalar (C compiler, -O2) 1 6-8 ~375-500
SSE2 (4 pixels) 4 1.5-2 ~1,500-2,000
SSSE3 PMADDUBSW (4 pixels) 4 0.8-1.2 ~2,500-3,750
AVX2 (16 pixels) 16 0.2-0.4 ~7,500-15,000

For a 1920×1080 frame at 30 FPS: scalar requires ~540M pixels/sec capacity; AVX2 easily handles 60+ FPS at these rates.

The AVX2 Version: 16 Pixels Per Iteration

With AVX2, we extend to 256-bit registers and process 8 RGBA pixels (32 bytes) at a time, producing 8 grayscale bytes per iteration. The approach extends directly: VPSHUFB operates on 256-bit registers, treating the register as two independent 128-bit halves.

; Key difference for AVX2:
vmovdqu   ymm0, [rsi + rcx*4]     ; load 8 RGBA pixels (32 bytes)
; Apply the same shuffle and multiply logic with ymm registers
; VPMADDUBSW ymm1, ymm1, ymm4    ; 16 multiply-add pairs simultaneously
; VPSRLW ymm1, ymm1, 8
; VPACKUSWB ymm1, ymm1, ymm7
; NOTE: VPACKUSWB in AVX2 packs within each 128-bit lane, not across lanes
; so the 8 output bytes end up split between low 8 bytes of each 128-bit half:
; Result bytes: Y0 Y1 Y2 Y3 ?? ?? ?? ?? Y4 Y5 Y6 Y7 ?? ?? ?? ??
; Fix with VPERMQ ymm1, ymm1, 0b10_00_10_00  to consolidate
vpermq    ymm1, ymm1, 0xD8        ; reorder 64-bit lanes to consolidate output bytes
vmovdqu   [rdi + rcx], xmm1       ; store 8 bytes from low 128 bits
vzeroupper

Key Insight: Lane Crossing in AVX2

The most common mistake with AVX2 byte operations is forgetting that many instructions operate independently within each 128-bit half — they do NOT cross the 128-bit lane boundary. VPSHUFB shuffles within each 128-bit half. VPACKUSWB packs within each half. This means the 8 output bytes from 8 pixels end up split: 4 bytes in the low 128 bits and 4 bytes in the high 128 bits, in the wrong positions.

VPERMQ (which CAN cross lanes, operating on 64-bit quadwords) is used to consolidate the output into contiguous bytes before storing.

Practical Lesson

SIMD image processing requires understanding both: 1. Data layout — AoS (RGBA RGBA...) vs. SoA (RRR... GGG... BBB...). SoA is almost always better for SIMD; AoS requires shuffles to extract channels. 2. Lane behavior — AVX2 operates in two independent 128-bit halves for most byte/word instructions. Always check the manual: does your instruction cross lanes or not?

The SSSE3 PMADDUBSW instruction is particularly elegant for this problem: it simultaneously multiplies and sums two pairs of channels per output element, halving the instruction count compared to separate multiply-then-add.