Now Reading
Python, C, Meeting – 2’500x Quicker Cosine Similarity 📐

Python, C, Meeting – 2’500x Quicker Cosine Similarity 📐

2023-12-18 10:14:22

In this fourth article of the “Less Slow” series, I’m accelerating Unum’s open-source Vector Search primitives utilized by some nice database and cloud suppliers to switch Meta’s FAISS and scale-up search of their merchandise. This time, our focus is on essentially the most frequent operation for these duties – computing the the Cosine Similarity/Distance between two vectors. It’s so frequent, even doubling it’s efficiency can have a noticeable influence on purposes economics. However in comparison with a pure Python baseline our single-threaded efficiency grew by an element of two’500x. Let’s undergo optimizations one after the other:

Cover

Some highlights:

  • Once more, goto doesn’t get the adore it deserves in C.
  • BMI2 directions on x86 are constantly missed… due to AMD Zen2, I assume.
  • AVX-512FP16 might be an important extension within the present AI race.

I’m nonetheless scratching my head on the 4VNNI extensions of AVX-512, and couldn’t discover a great way to learn from them right here and even within the polynomial approximations of the Jensen Shannon divergence computations in the last post, so please let me know the place I ought to attempt them 🤗

Cosine Similarity

Cosine Similarity is a strategy to test if two “vectors” are wanting in the identical path, no matter their magnitude. It’s a widespread metric utilized in Machine Studying and Data Retrieval, and it’s outlined as:

$$
cos(theta) = frac{A cdot B}B = frac{sum_{i=1}^{n} A_i B_i}{sqrt{sum_{i=1}^{n} A_i^2} sqrt{sum_{i=1}^{n} B_i^2}}
$$

The place $A$ and $B$ are two vectors with $n$ dimensions. The cosine similarity is a worth between $-1$ and $1$, the place $1$ signifies that the 2 vectors are pointing in the identical path, $-1$ implies that they’re pointing in reverse instructions and $0$ signifies that they’re orthogonal. Cosine Distance, in flip, is a distance operate, which is outlined as $1 – cos(theta)$. It’s a worth between $0$ and $2$, the place $0$ signifies that the 2 vectors are equivalent, and $2$ signifies that they’re pointing in reverse instructions. I could use the phrases interchangeably, however they don’t seem to be precisely the identical.

Python

The primary implementation is essentially the most naive one, written in pure Python… by ChatGPT. It’s a direct translation of the mathematical method and may be very straightforward to learn and perceive.

1
2
3
4
5
6
def cosine_distance(a, b):
    dot_product = sum(ai * bi for ai, bi in zip(a, b))
    magnitude_a = math.sqrt(sum(ai * ai for ai in a))
    magnitude_b = math.sqrt(sum(bi * bi for bi in b))
    cosine_similarity = dot_product / (magnitude_a * magnitude_b)
    return 1 - cosine_similarity

I’ll run on 4th Gen Intel Xeon CPUs, codenamed Sapphire Rapids, and obtainable on AWS because the r7iz situations. Earlier than working a benchmark, let’s generate random numbers and put them into a listing, simulating the 1536-dimensional “embeddings” from the OpenAI Ada service.

1
2
3
np.random.seed(0)
a_list = np.random.rand(1536).tolist()
b_list = np.random.rand(1536).tolist()

Operating the benchmark with the %timeit utility, we get 93.2 µs ± 1.75 µs per loop.. Or roughly 100 microseconds per name. Is that good or dangerous?

Our answer is Pythonic however inefficient, because it traverses every record twice. So let’s use the frequent Pythonic zip idiom:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def cosine_distance(a, b):
    dot_product = 0
    magnitude_a = 0
    magnitude_b = 0
    for ai, bi in zip(a, b):
        dot_product += ai * bi
        magnitude_a += ai * ai
        magnitude_b += bi * bi
    cosine_similarity = dot_product / (magnitude_a * magnitude_b)
    return 1 - cosine_similarity

Operating the benchmark, we get 65.3 µs ± 716 ns per loop., ensuing 30% financial savings! I consider it’s a good baseline.

NumPy: Between Python and C

NumPy is a strong instrument in Python’s toolbox, serving to people work quick with arrays. Many individuals see it because the go-to for doing science stuff in Python. Lots of machine studying instruments lean on it. Because it’s constructed with C, you’d suppose it’d be speedy. Let’s take our primary Python lists and make them sharper with NumPy. We’re single-precision, half-precision numbers, and 8-bit integers.

1
2
3
floats = np.random.rand(1536).astype(np.float32)
halfs = np.random.rand(1536).astype(np.float16)
ints = (np.random.rand(1536) * 100).astype(np.int8)

These are well-liked decisions for quantization (making knowledge smaller) in instruments like Meta’s FAISS and Unum’s USearch. Half-precision is fairly useful, working properly more often than not. However utilizing integers? That depends upon the AI mannequin you’re utilizing. Due to Quantization Aware Training, two of my faves — Microsoft’s E5 for simply textual content and Unum’s UForm for multi-modal knowledge — work properly even compressed to 8-bit numbers.


After getting our vectors arrange, I used our cosine_distance operate to see how related the three arrays are:

  • floats: 349 µs ± 5.71 µs per loop.
  • halfs: 525 µs ± 9.69 µs per loop.
  • ints: 2.31 ms ± 26 µs per loop.

However right here’s the issue. As an alternative of getting quicker, issues went 35x slower than our 65.3 µs place to begin. Why?

  1. Reminiscence Administration: Certain, NumPy makes use of C arrays, that are cool. However each time we loop by them, we flip small byte stuff into larger Python stuff. And with reminiscence being unpredictable, it’s stunning issues didn’t go even slower.

  2. Half-Precision Help: Most new units help half-precision. However the software program facet? Not a lot. Just a few AI instruments use it, they usually typically concentrate on GPU stuff, leaving out the CPU. So, changing half-precision issues on the go will be gradual.

  3. Integer Overflows: Doing math with our tiny integers isn’t clean. We hold getting these annoying overflow warnings. The CPU spends extra time checking issues than really doing the maths. We regularly see issues like: “RuntimeWarning: overflow encountered in scalar multiply”.

Right here’s a tip: When you’re utilizing NumPy, go all in. Mixing it with common Python can actually gradual you down!

SciPy

NumPy can also be the muse of the SciPy library, which supplies many useful capabilities for scientific computing, together with the scipy.distance.spatial.cosine. It would use the native NumPy operations for as a lot as attainable.

  • floats: 15.8 µs ± 416 ns per loop.
  • halfs: 46.6 µs ± 291 ns per loop.
  • ints: 12.2 µs ± 37.5 ns per loop.

Now, we see the true potential of NumPy, and the underlying Fundamental Linear Algebra Subroutines (BLAS) libraries carried out in C, Fortran, and Meeting. Our pure Python baseline was 65.3 µs, and we now acquired 2-5 occasions quicker, relying on the info kind. Notably, halfs are nonetheless gradual. Checking the specs of an analogous CPU, we will clearly see help for f16c directions, which signifies that the CPU can at the least decode half-precision values with out software program emulation, and we shouldn’t be experiencing this a lot throttling.

Sapphire Rapids Features

C is the lowest-level hardware-agnostic programming language – therefore one of the simplest ways to optimize small numerical capabilities, just like the Cosine Similarity and Distance.
Additionally it is trivial to wrap C capabilities into pure CPython bindings for the default Python runtime, and in the event you use the FastCall conference, like we do, you can also make your customized code as quick because the built-in Python capabilities, like what I’ve described not too long ago with StringZilla, changing Python’s default str string class with a quicker different.
In contrast to C++, nevertheless, C doesn’t help “generics” or “template capabilities”.
So we’ve to individually implement the cosine_distance operate for every knowledge kind we need to help, or use the ugly “macros”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#outline SIMSIMD_MAKE_COS(title, input_type, accumulator_type, converter)                                                
    inline static simsimd_f32_t simsimd_##title##_##input_type##_cos(                                                   
        simsimd_##input_type##_t const* a, simsimd_##input_type##_t const* b, simsimd_size_t n) {                      
        simsimd_##accumulator_type##_t ab = 0, a2 = 0, b2 = 0;                                                         
        for (simsimd_size_t i = 0; i != n; ++i) {                                                                      
            simsimd_##accumulator_type##_t ai = converter(a[i]);                                                       
            simsimd_##accumulator_type##_t bi = converter(b[i]);                                                       
            ab += ai * bi;                                                                                             
            a2 += ai * ai;                                                                                             
            b2 += bi * bi;                                                                                             
        }                                                                                                              
        return ab != 0 ? 1 - ab * SIMSIMD_RSQRT(a2) * SIMSIMD_RSQRT(b2) : 1;                                           
    }

It is a actual snippet from the library and depends upon yet one more macro – SIMSIMD_RSQRT(x), outlined because the 1 / sqrtf(x) by default.
We later instantiate it for all the info sorts we’d like:

1
2
3
SIMSIMD_MAKE_COS(serial, f32, f32, SIMSIMD_IDENTIFY) // `simsimd_serial_f32_cos`
SIMSIMD_MAKE_COS(serial, f16, f32, SIMSIMD_UNCOMPRESS_F16) // `simsimd_serial_f16_cos`
SIMSIMD_MAKE_COS(serial, i8, i32, SIMSIMD_IDENTIFY) // `simsimd_serial_i8_cos`

These macros will generate the next capabilities:

  • simsimd_serial_f32_cos: for 32-bit floats.
  • simsimd_serial_f16_cos: for 16-bit halfs, accumulating 32-bit values.
  • simsimd_serial_i8_cos: for 8-bit ints, accumulating 32-bit values.

We benchmark these utilizing the Google Benchmark library:

  • floats: 1956 ns ~ 33x quicker than Python.
  • halfs: 1118 ns ~ 58x quicker than Python.
  • ints: 299 ns ~ 218x quicker than Python.

That’s an awesome outcome, however this code depends on the compiler to carry out heavy lifting and produce environment friendly Meeting. As we all know, generally even the latest compilers, like GCC 12, can be 119x slower than hand-written Assembly. Even on the best data-parallel duties, like computing the Jensen Shannon divergence of two discrete chance distributions.

Meeting

Meeting is the lowest-level programming language, and it’s the closest to the {hardware}. Additionally it is essentially the most troublesome to write down and keep, as it isn’t moveable, and it is extremely straightforward to make errors. However it’s also essentially the most rewarding, because it permits us to write down essentially the most environment friendly code and to make use of essentially the most superior {hardware} options, like AVX-512. AVX-512, in flip, shouldn’t be a monolith. It’s a really complicated instruction set with the next subsets:

  • AVX-512F: 512-bit SIMD directions.
  • AVX-512DQ: double-precision floating-point directions.
  • AVX-512BW: byte and phrase directions.
  • AVX-512VL: vector size extensions.
  • AVX-512CD: battle detection directions.
  • AVX-512ER: exponential and reciprocal directions.
  • AVX-512PF: prefetch directions.
  • AVX-512VBMI: vector byte manipulation directions.
  • AVX-512IFMA: integer fused multiply-add directions.
  • AVX-512VBMI2: vector byte manipulation directions 2.
  • AVX-512VNNI: vector neural community directions.
  • AVX-512BITALG: bit algorithms directions.
  • AVX-512VPOPCNTDQ: vector inhabitants depend directions.
  • AVX-5124VNNIW: vector neural community directions phrase variable precision.
  • AVX-5124FMAPS: fused multiply-add directions single precision.
  • AVX-512VP2INTERSECT: vector pairwise intersect directions.
  • AVX-512BF16: bfloat16 directions.
  • AVX-512FP16: half-precision floating-point directions.

Fortunately, we gained’t want all of them at the moment. If you’re curious, you may learn extra about them within the Intel 64 and IA-32 Architectures Software Developer’s Manual… however be prepared, it’s a really lengthy learn.

Furthermore, we don’t have to write down the Meeting per se, as we will use the SIMD Intrinsics to primarily write the Meeting directions with out leaving C.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
__attribute__((goal("avx512f,avx512vl"))) //
inline static simsimd_f32_t
simsimd_avx512_f32_cos(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n) {
    __m512 ab_vec = _mm512_set1_ps(0);
    __m512 a2_vec = _mm512_set1_ps(0);
    __m512 b2_vec = _mm512_set1_ps(0);

    for (simsimd_size_t i = 0; i < n; i += 16) {
        __mmask16 masks = n - i >= 16 ? 0xFFFF : ((1u << (n - i)) - 1u);
        __m512 a_vec = _mm512_maskz_loadu_ps(masks, a + i);
        __m512 b_vec = _mm512_maskz_loadu_ps(masks, b + i);
        ab_vec = _mm512_fmadd_ps(a_vec, b_vec, ab_vec);
        a2_vec = _mm512_fmadd_ps(a_vec, a_vec, a2_vec);
        b2_vec = _mm512_fmadd_ps(b_vec, b_vec, b2_vec);
    }

    simsimd_f32_t ab = _mm512_reduce_add_ps(ab_vec);
    simsimd_f32_t a2 = _mm512_reduce_add_ps(a2_vec);
    simsimd_f32_t b2 = _mm512_reduce_add_ps(b2_vec);

    __m128d a2_b2 = _mm_set_pd((double)a2, (double)b2);
    __m128d rsqrts = _mm_mask_rsqrt14_pd(_mm_setzero_pd(), 0xFF, a2_b2);
    double rsqrts_array[2];
    _mm_storeu_pd(rsqrts_array, rsqrts);
    return 1 - ab * rsqrts_array[0] * rsqrts_array[1];
}

Let’s begin with a comparatively easy implementation:

  • A single for-loop iterates by 2 vectors, scanning as much as 16 entries concurrently.
  • When it reaches the top of the vectors, it makes use of a masks to load the remaining entries with the ((1u << (n - i)) - 1u) masks.
  • It doesn’t count on the vectors to be aligned, so it makes use of the loadu directions.
  • It avoids separate additions utilizing the fmadd instruction, which computes a * b + c.
  • It makes use of the reduce_add intrinsic to sum all the weather within the vector, which isn’t a SIMD code, however the compiler can optimize that half for us.
  • It makes use of the rsqrt14 instruction to compute the reciprocal sq. root of the sum of a2 and b2, very precisely approximating 1/sqrt(a2) and 1/sqrt(b2), and avoiding LibC calls.

Benchmarking this code and its symmetric counterparts for different knowledge sorts, we get the next:

See Also

  • floats: 118 ns ~ 553x quicker than Python.
  • halfs: 79 ns ~ 827x quicker than Python.
  • ints: 158 ns ~ 413x quicker than Python.

We are actually within the increased three digits.

BMI2 and goto

The world is stuffed with prejudice and unfairness, and a number of the largest ones in programming are:

  • contemplating a goto to be an anti-pattern.
  • not utilizing the BMI2 household of meeting directions.

The primary one is useful when optimizing low-level code and avoiding pointless branching. The second is a tiny set of bzhi, pdep, pext, and some different, handy for bit manipulation! We are going to restore equity and introduce them to our code, changing double-precision computations with single-precision.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
__attribute__((goal("avx512f,avx512vl,bmi2"))) //
inline static simsimd_f32_t
simsimd_avx512_f32_cos(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n) {
    __m512 ab_vec = _mm512_set1_ps(0);
    __m512 a2_vec = _mm512_set1_ps(0);
    __m512 b2_vec = _mm512_set1_ps(0);
    __m512 a_vec, b_vec;

simsimd_avx512_f32_cos_cycle:
    if (n < 16) {
        __mmask16 masks = _bzhi_u32(0xFFFFFFFF, n);
        a_vec = _mm512_maskz_loadu_ps(masks, a);
        b_vec = _mm512_maskz_loadu_ps(masks, b);
        n = 0;
    } else {
        a_vec = _mm512_loadu_ps(a);
        b_vec = _mm512_loadu_ps(b);
        a += 16, b += 16, n -= 16;
    }
    ab_vec = _mm512_fmadd_ps(a_vec, b_vec, ab_vec);
    a2_vec = _mm512_fmadd_ps(a_vec, a_vec, a2_vec);
    b2_vec = _mm512_fmadd_ps(b_vec, b_vec, b2_vec);
    if (n)
        goto simsimd_avx512_f32_cos_cycle;

    simsimd_f32_t ab = _mm512_reduce_add_ps(ab_vec);
    simsimd_f32_t a2 = _mm512_reduce_add_ps(a2_vec);
    simsimd_f32_t b2 = _mm512_reduce_add_ps(b2_vec);

    // Compute the reciprocal sq. roots of a2 and b2
    __m128 rsqrts = _mm_rsqrt14_ps(_mm_set_ps(0.f, 0.f, a2 + 1.e-9f, b2 + 1.e-9f));
    simsimd_f32_t rsqrt_a2 = _mm_cvtss_f32(rsqrts);
    simsimd_f32_t rsqrt_b2 = _mm_cvtss_f32(_mm_shuffle_ps(rsqrts, rsqrts, _MM_SHUFFLE(0, 0, 0, 1)));
    return 1 - ab * rsqrt_a2 * rsqrt_b2;
}

End result for floats: 87.4 ns ~ 747x quicker than Python.

One of many causes the BMI2 extensions didn’t take off initially is that, AMD processors earlier than Zen 3 carried out pdep and pext with a latency of 18 cycles. Newer CPUs do it with the latency of three cycles.

FP16

Ice Lake CPUs supported all of the directions we’ve used to date. In distinction, the newer Sapphire Rapids directions add native help for half-precision math.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
simsimd_avx512_f16_cos_cycle:
    if (n < 32) {
        __mmask32 masks = _bzhi_u32(0xFFFFFFFF, n);
        a_i16_vec = _mm512_maskz_loadu_epi16(masks, a);
        b_i16_vec = _mm512_maskz_loadu_epi16(masks, b);
        n = 0;
    } else {
        a_i16_vec = _mm512_loadu_epi16(a);
        b_i16_vec = _mm512_loadu_epi16(b);
        a += 32, b += 32, n -= 32;
    }
    ab_vec = _mm512_fmadd_ph(_mm512_castsi512_ph(a_i16_vec), _mm512_castsi512_ph(b_i16_vec), ab_vec);
    a2_vec = _mm512_fmadd_ph(_mm512_castsi512_ph(a_i16_vec), _mm512_castsi512_ph(a_i16_vec), a2_vec);
    b2_vec = _mm512_fmadd_ph(_mm512_castsi512_ph(b_i16_vec), _mm512_castsi512_ph(b_i16_vec), b2_vec);
    if (n)
        goto simsimd_avx512_f16_cos_cycle;

The physique of our operate has modified:

  • we are actually scanning 32 entries per cycle, because the scalars are 2x smaller.
  • we’re utilizing the _ph intrinsics, the half-precision counterparts of the _ps intrinsics.

End result for halfs: 51.8 ns ~ 1’260x quicker than Python.

VNNI
Immediately’s remaining instruction we’ll discover is the vpdpbusd from the AVX-512VNNI subset. It multiplies 8-bit integers, producing 16-bit intermediate outcomes, that are then added to a 32-bit accumulator. With smaller scalars, we will home 64 of them in a single ZMM register.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
simsimd_avx512_i8_cos_cycle:
    if (n < 64) {
        __mmask64 masks = _bzhi_u64(0xFFFFFFFFFFFFFFFF, n);
        a_vec = _mm512_maskz_loadu_epi8(masks, a);
        b_vec = _mm512_maskz_loadu_epi8(masks, b);
        n = 0;
    } else {
        a_vec = _mm512_loadu_epi8(a);
        b_vec = _mm512_loadu_epi8(b);
        a += 64, b += 64, n -= 64;
    }
    ab_i32s_vec = _mm512_dpbusd_epi32(ab_i32s_vec, a_vec, b_vec);
    a2_i32s_vec = _mm512_dpbusd_epi32(a2_i32s_vec, a_vec, a_vec);
    b2_i32s_vec = _mm512_dpbusd_epi32(b2_i32s_vec, b_vec, b_vec);
    if (n)
        goto simsimd_avx512_i8_cos_cycle;

After compiling, you would possibly count on to see three vpdpbusd invocations. Nonetheless, the compiler had different concepts. As an alternative of sticking to our anticipated circulation, it duplicated vpdpbusd calls in each branches of the if situation to attenuate code sections and jumps. Right here’s a glimpse of our essential operational part:

.L2:
        ; Examine `if (n < 64)` and bounce to the L7 if it is true.
        cmp     rax, rdx                            
        je      .L7
        ; Load 64 bytes from `a` and `b` into `zmm1` and `zmm0`.
        vmovdqu8        zmm1, ZMMWORD PTR [rdi]
        vmovdqu8        zmm0, ZMMWORD PTR [rsi]
        ; Increment `a` and `b` pointers by 64.
        add     rdi, 64
        add     rsi, 64
        ; Carry out mixed-precision dot-products.
        vpdpbusd        zmm4, zmm1, zmm0 ; ab = b * a
        vpdpbusd        zmm3, zmm1, zmm1 ; b2 = b * b
        vpdpbusd        zmm2, zmm0, zmm0 ; a2 = a * a
        ; Subtract the variety of remaining entries and bounce again.
        sub     rdx, 64
        jne     .L2
.L7:
        ; Course of the tail, by constructing the `k1` masks first.
        ; We keep away from `k0`, which is a hardcoded fixed used to point unmasked operations.
        mov     rdx, -1
        bzhi    rax, rdx, rax
        kmovq   k1, rax
        ; Load below 64 bytes from `a` and `b` into `zmm1` and `zmm0`,
        ; utilizing the masks from the `k1`, which will be utilized to any AVX-512 argument.
        vmovdqu8        zmm1{k1}{z}, ZMMWORD PTR [rdi]
        vmovdqu8        zmm0{k1}{z}, ZMMWORD PTR [rsi]
        ; Carry out mixed-precision dot-products.
        vpdpbusd        zmm3, zmm1, zmm1 ; b2 = b * b
        vpdpbusd        zmm4, zmm1, zmm0 ; ab = b * a
        vpdpbusd        zmm2, zmm0, zmm0 ; a2 = a * a
        ; Exit the loop.
        jmp     .L4

It’s typically intriguing to see how compilers shuffle round my directions, particularly when it appears considerably arbitrary briefly code segments. Our heavy SIMD will in any case be decoded into micro-ops, after which the CPU rearranges them once more, disregarding compiler’s sequence. Nonetheless, upon testing this with the Google Benchmark library, we noticed the next for ints: 25.9 ns, which is roughly 2’521 occasions quicker than our authentic Python baseline. Delivered as promised 🤗

Conclusion

There’s a standard perception that you just want a large infrastructure, akin to what giants like OpenAI or Google possess, to create impactful know-how. Nonetheless, I’m a proponent of the concept many wonderful improvements can spring from humble beginnings – even on a shoestring finances. When you share this sentiment and are eager on optimizing and innovating, you is likely to be excited by another libraries I’ve had a hand in. 😉

Appendix: Evaluating Compilers

For these wanting to delve deeper, analyzing the Meeting generated by completely different compilers will be insightful. A preferred comparability is between GCC and Intel’s new ICX compiler, with the latter now being primarily based on LLVM. Earlier than diving into the Meeting particulars, let’s first benchmark their efficiency.

1
2
3
4
5
6
$ cmake -DCMAKE_C_COMPILER="gcc-12" -DCMAKE_CXX_COMPILER="g++-12" -DCMAKE_BUILD_TYPE=Launch -DSIMSIMD_BUILD_BENCHMARKS=1 -B ./build_release
$ cmake --build build_release --config Launch && ./build_release/simsimd_bench --benchmark_out="gcc-12.json" --benchmark_filter=1536d

$ supply /decide/intel/oneapi/setvars.sh
$ cmake -DCMAKE_C_COMPILER="icx" -DCMAKE_CXX_COMPILER="icpx" -DCMAKE_BUILD_TYPE=Launch -DSIMSIMD_BUILD_BENCHMARKS=1 -B ./build_release
$ cmake --build build_release --config Launch && ./build_release/simsimd_bench --benchmark_out="intel-2023.json" --benchmark_filter=1536d

The above script compiles the code utilizing each compilers after which runs every binary, exporting the outcomes into distinct JSON information. Afterwards, you should use Google Benchmark’s lesser-known instrument, examine.py, to find out if there are important efficiency variations and whether or not a deeper dive into the Meeting is warranted:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
$ mkdir -p gbench
$ wget https://github.com/google/benchmark/blob/essential/instruments/examine.py?uncooked=true -O examine.py
$ wget https://github.com/google/benchmark/blob/essential/instruments/gbench/report.py?uncooked=true -O gbench/report.py
$ wget https://github.com/google/benchmark/blob/essential/instruments/gbench/util.py?uncooked=true -O gbench/util.py
$ python examine.py benchmarks gcc-12.json intel-2023.json

Evaluating gcc-12.json to intel-2023.json
Benchmark                                                         Time             CPU      Time Previous      Time New       CPU Previous       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
avx2_f16_cos_1536d/min_time:10.000/threads:1                   +0.0002         +0.0002           235           235           235           235
avx2_i8_cos_1536d/min_time:10.000/threads:1                    -0.0034         -0.0034            91            91            91            91
avx512_f16_cos_1536d/min_time:10.000/threads:1                 +0.0024         +0.0024            52            52            52            52
avx512_i8_cos_1536d/min_time:10.000/threads:1                  +0.0110         +0.0110            26            26            26            26
avx512_f32_cos_1536d/min_time:10.000/threads:1                 +0.0206         +0.0206            87            89            87            89
serial_f16_cos_1536d/min_time:10.000/threads:1                 -0.0013         -0.0013          1988          1985          1988          1985
serial_f32_cos_1536d/min_time:10.000/threads:1                 -0.0000         -0.0000          1118          1118          1118          1118
serial_i8_cos_1536d/min_time:10.000/threads:1                  -0.0001         -0.0001           299           299           299           299

From the outcomes, we observe minimal runtime variations between the 2 compilers. The generated Meeting is remarkably related, notably within the vital sections. Intel’s output is usually 10% shorter which is usually advantageous. Essentially the most pronounced variations emerge outdoors of the for loop. Notably, the _mm512_reduce_add_epi32 intrinsic doesn’t correspond on to a particular SIMD instruction, granting the compilers a bit extra leeway:

  • GCC opts for a lengthier discount utilizing vextracti64x4 and vextracti64x2.
  • Intel prefers the succinct vextracti128.

Each compilers make use of the vpshufd for shuffling however use diversified masks. For an in depth exploration, go to the Compiler Explorer.



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top