What is SIMD?
SIMD = Single Instruction, Multiple Data.
It lets a CPU core apply one operation (like add, compare, multiply) to many values at once packed into a wide “vector” register.
Think: instead of doing
c[i] = a[i] + b[i] one element at a time,
SIMD does 8 or 16 of them in one go.
How it works (mental model)
-
The CPU has vector registers (128/256/512-bit).
-
Each register is split into lanes (e.g., eight 32-bit floats).
-
One instruction (e.g.,
ADD) runs over all lanes simultaneously.
| a0 a1 a2 a3 a4 a5 a6 a7 |
+ | b0 b1 b2 b3 b4 b5 b6 b7 |
= | c0 c1 c2 c3 c4 c5 c6 c7 | (one SIMD add)
Why it’s fast
-
Fewer instructions and loops.
-
Better use of CPU pipelines and caches.
-
Great for vectorized code: scans, filters, math, encoding/decoding.
Where you see it
-
Analytics engines (columnar DBs): filter
price < 100on 1024 values per batch with a few SIMD compares + a mask. -
Media/ML: image ops, DSP, linear algebra.
-
Cryptography, compression.
Names you might recognize
-
x86: SSE, AVX2, AVX-512
-
ARM: NEON, SVE
SIMD vs threads
-
SIMD: parallelism within one core across data lanes.
-
Multithreading: parallelism across cores (each core may also use SIMD).
Gotchas
-
Works best when data is contiguous and aligned.
-
Branchy code can hurt; use masks and branchless logic.
-
Speedup bounded by memory bandwidth if data can’t be fed fast enough.