Chapter 18 Exercises: ARM64 Programming
Exercise 1: Array Access with Shifts
Write ARM64 code to access element arr[i] for each array type. Assume X0 = arr (base address), W1 = i (index). Load the element into X2 (or W2 for smaller types).
a) int8_t arr[] (1-byte elements)
b) int16_t arr[] (2-byte elements)
c) int32_t arr[] (4-byte elements)
d) int64_t arr[] (8-byte elements)
e) double arr[] (8-byte elements, load into D0)
Exercise 2: memset Implementation
Implement my_memset(void *dest, int value, size_t n) that:
1. Fills one byte at a time (simple version)
2. Then optimize it to fill 8 bytes at a time using STR X (for large n)
For the optimized version, what do you do with the leading bytes (if any) before the first 8-byte-aligned address?
Exercise 3: String Length Using LDRB
Implement my_strlen(const char *s) in ARM64 assembly using LDRB and CBZ. Then trace your implementation on the string "ARM64" to verify the output.
Bonus: Rewrite using post-incremented LDRB W2, [X0], #1 instead of explicit index tracking.
Exercise 4: Floating-Point Operations
Write ARM64 scalar floating-point code for each expression. Use D registers (double precision).
a) double result = a * b + c; — use FMADD for a single-rounding fused operation
b) double result = sqrt(x*x + y*y); — (Euclidean distance 2D)
c) Convert an int64_t in X0 to a double in D0
d) Convert a double in D0 to an int64_t in X0 (truncate)
e) Copy the bit pattern of double D0 into integer register X0 without conversion
Exercise 5: NEON SIMD Basics
a) Write ARM64 NEON code to add two arrays of 4 int32_t values each (4-element vectors): - V0.4S = {a0, a1, a2, a3} - V1.4S = {b0, b1, b2, b3} - Compute V2.4S = V0 + V1
b) Write code to compute the absolute values of 8 int16_t values in V0.8H.
c) Write code to find the maximum of 4 float32 values in V0.4S, storing result in S0.
d) What NEON instruction would you use to multiply all 4 float32 values in V0 by the scalar value in S1 (element 0)?
Exercise 6: NEON Array Operations
Extend the sum_float_neon function from the chapter to handle arrays that are not multiples of 4 elements. After the main NEON loop, add a tail loop that handles the remaining 0-3 elements using scalar FADD.
Exercise 7: Linux System Calls
Write a complete ARM64 Linux assembly program that:
- Uses the
getpidsyscall (number 172) to get the current PID - Stores the PID in a memory variable
- Uses
writeto print "PID: " followed by the PID as a decimal number - Exits normally
(Hint: You'll need to convert an integer to its ASCII decimal representation. Implement a simple int-to-string conversion using UDIV/MSUB.)
Exercise 8: argv Parsing
Given that on Linux ARM64 program entry, the stack contains:
[SP+0] = argc (int64)
[SP+8] = argv[0] (char* pointer)
[SP+16] = argv[1] (char* pointer, or NULL if no arguments)
...
Write a _start that:
a) Loads argc into W0
b) Loads argv[0] (the program name) into X1
c) Loads argv[1] (first argument) into X2 (checking if it exists)
d) If no argument was provided (argc < 2), exits with status 1
e) Otherwise, passes argv[1] to strlen (which you also implement)
Exercise 9: NEON String Search
Using NEON CMEQ instruction, write a function that searches for a null byte in a 16-byte-aligned string buffer, 16 bytes at a time. Return the index of the first null byte.
// find_null: X0 = string (16-byte aligned), returns X0 = index of '\0'
Use:
- LDR Q0, [X1], #16 to load 16 bytes
- MOVI V1.16B, #0 to set up the comparison target
- CMEQ V0.16B, V0.16B, V1.16B to compare (produces 0xFF where equal)
- UMAXV B2, V0.16B to reduce (find if any byte is 0xFF)
Exercise 10: Apple Silicon vs. Linux
For each Linux ARM64 assembly snippet, rewrite it for macOS ARM64 (Apple Silicon). State what changes are required.
a) Linux write syscall:
MOV X8, #64
MOV X0, #1
ADR X1, msg
MOV X2, #len
SVC #0
b) Linux exit syscall:
MOV X8, #93
MOV X0, #0
SVC #0
c) What is the macOS equivalent of Linux's .section .rodata?
d) What command would you use to assemble and link an ARM64 program on macOS (assuming Xcode command line tools are installed)?
Exercise 11: FP Register Conventions
a) Which NEON/FP registers must a function preserve if it modifies them (callee-saved in AAPCS64)? b) For callee-saved FP registers, how many bytes of each register must actually be preserved? c) What instructions would you use to save D8 and D9 to the stack? d) If V8 is used as a full 128-bit NEON register, what must be saved and restored?
Exercise 12: Performance Comparison
Given an array of 1000 float32 values:
a) Estimate how many loop iterations are needed for: - Scalar FADD loop (1 float per iteration) - NEON FADD V.4S loop (4 floats per iteration) - NEON FADD V.4S with 2× loop unrolling (8 floats per iteration)
b) If each iteration takes 1 clock cycle, what is the approximate speedup of the NEON approach vs. scalar?
c) What is the bottleneck for very large arrays? (Hint: memory bandwidth)
Exercise 13: LDP/STP Optimization
Given this naive register save sequence:
STR X19, [SP, #-8]!
STR X20, [SP, #-8]!
STR X21, [SP, #-8]!
STR X22, [SP, #-8]!
a) Rewrite using STP instructions (two pairs). Show the equivalent stack layout. b) How many instructions are saved? c) Write the corresponding restore sequence using LDP. d) Does the rewritten version maintain 16-byte stack alignment after all stores? Explain.
Exercise 14: Integer to String Conversion
Implement a function itoa(uint64_t n, char *buf) in ARM64 assembly that converts an unsigned 64-bit integer to its ASCII decimal representation in buf. Return the length.
Hint: 1. Repeatedly divide by 10 using UDIV, get remainder via MSUB 2. Store digits in reverse order, then reverse the string 3. Handle special case: n = 0
Exercise 15: Dot Product (Preview of Case Study)
Write an ARM64 NEON function to compute the dot product of two float32 arrays:
dot_product(float *a, float *b, int n) → float
Use FMLA (multiply-accumulate) with V register operands. Handle the case where n is not a multiple of 4 using a scalar tail loop.
This is the basis of digital signal processing, machine learning inference, and audio processing.