Python, C, Meeting – 2’500x Quicker Cosine Similarity 📐
In this fourth article of the “Less Slow” series, I’m accelerating Unum’s opensource Vector Search primitives utilized by some nice database and cloud suppliers to switch Meta’s FAISS and scaleup search of their merchandise. This time, our focus is on essentially the most frequent operation for these duties – computing the the Cosine Similarity/Distance between two vectors. It’s so frequent, even doubling it’s efficiency can have a noticeable influence on purposes economics. However in comparison with a pure Python baseline our singlethreaded efficiency grew by an element of two’500x. Let’s undergo optimizations one after the other:
Some highlights:
 Once more,
goto
doesn’t get the adore it deserves in C.  BMI2 directions on x86 are constantly missed… due to AMD Zen2, I assume.
 AVX512FP16 might be an important extension within the present AI race.
I’m nonetheless scratching my head on the 4VNNI extensions of AVX512, and couldn’t discover a great way to learn from them right here and even within the polynomial approximations of the Jensen Shannon divergence computations in the last post, so please let me know the place I ought to attempt them 🤗
Cosine Similarity#
Cosine Similarity is a strategy to test if two “vectors” are wanting in the identical path, no matter their magnitude. It’s a widespread metric utilized in Machine Studying and Data Retrieval, and it’s outlined as:
$$
cos(theta) = frac{A cdot B}B = frac{sum_{i=1}^{n} A_i B_i}{sqrt{sum_{i=1}^{n} A_i^2} sqrt{sum_{i=1}^{n} B_i^2}}
$$
The place $A$ and $B$ are two vectors with $n$ dimensions. The cosine similarity is a worth between $1$ and $1$, the place $1$ signifies that the 2 vectors are pointing in the identical path, $1$ implies that they’re pointing in reverse instructions and $0$ signifies that they’re orthogonal. Cosine Distance, in flip, is a distance operate, which is outlined as $1 – cos(theta)$. It’s a worth between $0$ and $2$, the place $0$ signifies that the 2 vectors are equivalent, and $2$ signifies that they’re pointing in reverse instructions. I could use the phrases interchangeably, however they don’t seem to be precisely the identical.
Python#
The primary implementation is essentially the most naive one, written in pure Python… by ChatGPT. It’s a direct translation of the mathematical method and may be very straightforward to learn and perceive.
I’ll run on 4th Gen Intel Xeon CPUs, codenamed Sapphire Rapids, and obtainable on AWS because the r7iz
situations. Earlier than working a benchmark, let’s generate random numbers and put them into a listing, simulating the 1536dimensional “embeddings” from the OpenAI Ada service.
Operating the benchmark with the %timeit
utility, we get 93.2 µs ± 1.75 µs per loop.. Or roughly 100 microseconds per name. Is that good or dangerous?
Our answer is Pythonic however inefficient, because it traverses every record
twice. So let’s use the frequent Pythonic zip
idiom:
Operating the benchmark, we get 65.3 µs ± 716 ns per loop., ensuing 30% financial savings! I consider it’s a good baseline.
NumPy: Between Python and C#
NumPy is a strong instrument in Python’s toolbox, serving to people work quick with arrays. Many individuals see it because the goto for doing science stuff in Python. Lots of machine studying instruments lean on it. Because it’s constructed with C, you’d suppose it’d be speedy. Let’s take our primary Python lists and make them sharper with NumPy. We’re singleprecision, halfprecision numbers, and 8bit integers.
These are wellliked decisions for quantization (making knowledge smaller) in instruments like Meta’s FAISS and Unum’s USearch. Halfprecision is fairly useful, working properly more often than not. However utilizing integers? That depends upon the AI mannequin you’re utilizing. Due to Quantization Aware Training, two of my faves — Microsoft’s E5 for simply textual content and Unum’s UForm for multimodal knowledge — work properly even compressed to 8bit numbers.
After getting our vectors arrange, I used our cosine_distance
operate to see how related the three arrays are:
floats
: 349 µs ± 5.71 µs per loop.halfs
: 525 µs ± 9.69 µs per loop.ints
: 2.31 ms ± 26 µs per loop.
However right here’s the issue. As an alternative of getting quicker, issues went 35x slower than our 65.3 µs place to begin. Why?

Reminiscence Administration: Certain, NumPy makes use of C arrays, that are cool. However each time we loop by them, we flip small byte stuff into larger Python stuff. And with reminiscence being unpredictable, it’s stunning issues didn’t go even slower.

HalfPrecision Help: Most new units help halfprecision. However the software program facet? Not a lot. Just a few AI instruments use it, they usually typically concentrate on GPU stuff, leaving out the CPU. So, changing halfprecision issues on the go will be gradual.

Integer Overflows: Doing math with our tiny integers isn’t clean. We hold getting these annoying overflow warnings. The CPU spends extra time checking issues than really doing the maths. We regularly see issues like: “RuntimeWarning: overflow encountered in scalar multiply”.
Right here’s a tip: When you’re utilizing NumPy, go all in. Mixing it with common Python can actually gradual you down!
SciPy#
NumPy can also be the muse of the SciPy library, which supplies many useful capabilities for scientific computing, together with the scipy.distance.spatial.cosine
. It would use the native NumPy operations for as a lot as attainable.
floats
: 15.8 µs ± 416 ns per loop.halfs
: 46.6 µs ± 291 ns per loop.ints
: 12.2 µs ± 37.5 ns per loop.
Now, we see the true potential of NumPy, and the underlying Fundamental Linear Algebra Subroutines (BLAS) libraries carried out in C, Fortran, and Meeting. Our pure Python baseline was 65.3 µs, and we now acquired 25 occasions quicker, relying on the info kind. Notably, halfs
are nonetheless gradual. Checking the specs of an analogous CPU, we will clearly see help for f16c
directions, which signifies that the CPU can at the least decode halfprecision values with out software program emulation, and we shouldn’t be experiencing this a lot throttling.
C is the lowestlevel hardwareagnostic programming language – therefore one of the simplest ways to optimize small numerical capabilities, just like the Cosine Similarity and Distance.
Additionally it is trivial to wrap C capabilities into pure CPython bindings for the default Python runtime, and in the event you use the FastCall conference, like we do, you can also make your customized code as quick because the builtin Python capabilities, like what I’ve described not too long ago with StringZilla, changing Python’s default str
string class with a quicker different.
In contrast to C++, nevertheless, C doesn’t help “generics” or “template capabilities”.
So we’ve to individually implement the cosine_distance
operate for every knowledge kind we need to help, or use the ugly “macros”:


It is a actual snippet from the library and depends upon yet one more macro – SIMSIMD_RSQRT(x)
, outlined because the 1 / sqrtf(x)
by default.
We later instantiate it for all the info sorts we’d like:
These macros will generate the next capabilities:
simsimd_serial_f32_cos
: for 32bitfloats
.simsimd_serial_f16_cos
: for 16bithalfs
, accumulating 32bit values.simsimd_serial_i8_cos
: for 8bitints
, accumulating 32bit values.
We benchmark these utilizing the Google Benchmark library:
floats
: 1956 ns ~ 33x quicker than Python.halfs
: 1118 ns ~ 58x quicker than Python.ints
: 299 ns ~ 218x quicker than Python.
That’s an awesome outcome, however this code depends on the compiler to carry out heavy lifting and produce environment friendly Meeting. As we all know, generally even the latest compilers, like GCC 12, can be 119x slower than handwritten Assembly. Even on the best dataparallel duties, like computing the Jensen Shannon divergence of two discrete chance distributions.
I’ve discovered a reasonably easy dataparallel workload, the place AVX512FP16 #SIMD code beats O3/fastmath meeting produced by GCC 12 by an element of 118x 🤯🤯🤯https://t.co/cFftGbMBQe
— Ash Vardanian (@ashvardanian) October 23, 2023
Meeting#
Meeting is the lowestlevel programming language, and it’s the closest to the {hardware}. Additionally it is essentially the most troublesome to write down and keep, as it isn’t moveable, and it is extremely straightforward to make errors. However it’s also essentially the most rewarding, because it permits us to write down essentially the most environment friendly code and to make use of essentially the most superior {hardware} options, like AVX512. AVX512, in flip, shouldn’t be a monolith. It’s a really complicated instruction set with the next subsets:
 AVX512F: 512bit SIMD directions.
 AVX512DQ: doubleprecision floatingpoint directions.
 AVX512BW: byte and phrase directions.
 AVX512VL: vector size extensions.
 AVX512CD: battle detection directions.
 AVX512ER: exponential and reciprocal directions.
 AVX512PF: prefetch directions.
 AVX512VBMI: vector byte manipulation directions.
 AVX512IFMA: integer fused multiplyadd directions.
 AVX512VBMI2: vector byte manipulation directions 2.
 AVX512VNNI: vector neural community directions.
 AVX512BITALG: bit algorithms directions.
 AVX512VPOPCNTDQ: vector inhabitants depend directions.
 AVX5124VNNIW: vector neural community directions phrase variable precision.
 AVX5124FMAPS: fused multiplyadd directions single precision.
 AVX512VP2INTERSECT: vector pairwise intersect directions.
 AVX512BF16:
bfloat16
directions.  AVX512FP16: halfprecision floatingpoint directions.
Fortunately, we gained’t want all of them at the moment. If you’re curious, you may learn extra about them within the Intel 64 and IA32 Architectures Software Developer’s Manual… however be prepared, it’s a really lengthy learn.
Furthermore, we don’t have to write down the Meeting per se, as we will use the SIMD Intrinsics to primarily write the Meeting directions with out leaving C.


Let’s begin with a comparatively easy implementation:
 A single
for
loop iterates by 2 vectors, scanning as much as 16 entries concurrently.  When it reaches the top of the vectors, it makes use of a masks to load the remaining entries with the
((1u << (n  i))  1u)
masks.  It doesn’t count on the vectors to be aligned, so it makes use of the
loadu
directions.  It avoids separate additions utilizing the
fmadd
instruction, which computesa * b + c
.  It makes use of the
reduce_add
intrinsic to sum all the weather within the vector, which isn’t a SIMD code, however the compiler can optimize that half for us.  It makes use of the
rsqrt14
instruction to compute the reciprocal sq. root of the sum ofa2
andb2
, very precisely approximating1/sqrt(a2)
and1/sqrt(b2)
, and avoiding LibC calls.
Benchmarking this code and its symmetric counterparts for different knowledge sorts, we get the next:
floats
: 118 ns ~ 553x quicker than Python.halfs
: 79 ns ~ 827x quicker than Python.ints
: 158 ns ~ 413x quicker than Python.
We are actually within the increased three digits.
BMI2 and goto
#
The world is stuffed with prejudice and unfairness, and a number of the largest ones in programming are:
 contemplating a
goto
to be an antipattern.  not utilizing the BMI2 household of meeting directions.
The primary one is useful when optimizing lowlevel code and avoiding pointless branching. The second is a tiny set of bzhi
, pdep
, pext
, and some different, handy for bit manipulation! We are going to restore equity and introduce them to our code, changing doubleprecision computations with singleprecision.


End result for floats
: 87.4 ns ~ 747x quicker than Python.
One of many causes the BMI2 extensions didn’t take off initially is that, AMD processors earlier than Zen 3 carried out
pdep
andpext
with a latency of 18 cycles. Newer CPUs do it with the latency of three cycles.
FP16#
Ice Lake CPUs supported all of the directions we’ve used to date. In distinction, the newer Sapphire Rapids directions add native help for halfprecision math.


The physique of our operate has modified:
 we are actually scanning 32 entries per cycle, because the scalars are 2x smaller.
 we’re utilizing the
_ph
intrinsics, the halfprecision counterparts of the_ps
intrinsics.
End result for halfs
: 51.8 ns ~ 1’260x quicker than Python.
VNNI
Immediately’s remaining instruction we’ll discover is the vpdpbusd
from the AVX512VNNI subset. It multiplies 8bit integers, producing 16bit intermediate outcomes, that are then added to a 32bit accumulator. With smaller scalars, we will home 64 of them in a single ZMM register.


After compiling, you would possibly count on to see three vpdpbusd
invocations. Nonetheless, the compiler had different concepts. As an alternative of sticking to our anticipated circulation, it duplicated vpdpbusd
calls in each branches of the if
situation to attenuate code sections and jumps. Right here’s a glimpse of our essential operational part:
.L2:
; Examine `if (n < 64)` and bounce to the L7 if it is true.
cmp rax, rdx
je .L7
; Load 64 bytes from `a` and `b` into `zmm1` and `zmm0`.
vmovdqu8 zmm1, ZMMWORD PTR [rdi]
vmovdqu8 zmm0, ZMMWORD PTR [rsi]
; Increment `a` and `b` pointers by 64.
add rdi, 64
add rsi, 64
; Carry out mixedprecision dotproducts.
vpdpbusd zmm4, zmm1, zmm0 ; ab = b * a
vpdpbusd zmm3, zmm1, zmm1 ; b2 = b * b
vpdpbusd zmm2, zmm0, zmm0 ; a2 = a * a
; Subtract the variety of remaining entries and bounce again.
sub rdx, 64
jne .L2
.L7:
; Course of the tail, by constructing the `k1` masks first.
; We keep away from `k0`, which is a hardcoded fixed used to point unmasked operations.
mov rdx, 1
bzhi rax, rdx, rax
kmovq k1, rax
; Load below 64 bytes from `a` and `b` into `zmm1` and `zmm0`,
; utilizing the masks from the `k1`, which will be utilized to any AVX512 argument.
vmovdqu8 zmm1{k1}{z}, ZMMWORD PTR [rdi]
vmovdqu8 zmm0{k1}{z}, ZMMWORD PTR [rsi]
; Carry out mixedprecision dotproducts.
vpdpbusd zmm3, zmm1, zmm1 ; b2 = b * b
vpdpbusd zmm4, zmm1, zmm0 ; ab = b * a
vpdpbusd zmm2, zmm0, zmm0 ; a2 = a * a
; Exit the loop.
jmp .L4
It’s typically intriguing to see how compilers shuffle round my directions, particularly when it appears considerably arbitrary briefly code segments. Our heavy SIMD will in any case be decoded into microops, after which the CPU rearranges them once more, disregarding compiler’s sequence. Nonetheless, upon testing this with the Google Benchmark library, we noticed the next for ints
: 25.9 ns, which is roughly 2’521 occasions quicker than our authentic Python baseline. Delivered as promised 🤗
Conclusion#
There’s a standard perception that you just want a large infrastructure, akin to what giants like OpenAI or Google possess, to create impactful knowhow. Nonetheless, I’m a proponent of the concept many wonderful improvements can spring from humble beginnings – even on a shoestring finances. When you share this sentiment and are eager on optimizing and innovating, you is likely to be excited by another libraries I’ve had a hand in. 😉
Appendix: Evaluating Compilers#
For these wanting to delve deeper, analyzing the Meeting generated by completely different compilers will be insightful. A preferred comparability is between GCC and Intel’s new ICX compiler, with the latter now being primarily based on LLVM. Earlier than diving into the Meeting particulars, let’s first benchmark their efficiency.


The above script compiles the code utilizing each compilers after which runs every binary, exporting the outcomes into distinct JSON information. Afterwards, you should use Google Benchmark’s lesserknown instrument, examine.py
, to find out if there are important efficiency variations and whether or not a deeper dive into the Meeting is warranted:


From the outcomes, we observe minimal runtime variations between the 2 compilers. The generated Meeting is remarkably related, notably within the vital sections. Intel’s output is usually 10% shorter which is usually advantageous. Essentially the most pronounced variations emerge outdoors of the for
loop. Notably, the _mm512_reduce_add_epi32
intrinsic doesn’t correspond on to a particular SIMD instruction, granting the compilers a bit extra leeway:
 GCC opts for a lengthier discount utilizing
vextracti64x4
andvextracti64x2
.  Intel prefers the succinct
vextracti128
.
Each compilers make use of the vpshufd
for shuffling however use diversified masks. For an in depth exploration, go to the Compiler Explorer.