Now Reading
Vectorizing Unicode conversions on actual RISC-V {hardware}

Vectorizing Unicode conversions on actual RISC-V {hardware}

2024-01-27 10:44:41

On this article we’ll focus on easy methods to obtain the speedup under for UTF-8 to UTF-16 conversion, utilizing the RISC-V Vector extension.

UTF-8 to UTF-16 benchmarks, we get on average a ~3x speedup over scalar

I excluded the plain ASCII case from the graph above as a result of it made the opposite outcomes much less readable, the speedup was: 8x for C908, and 11x for C920. Extra complete measurements are on the end of the article.

The overwhelming majority of textual content you may come throughout shall be encoded within the UTF-8 format, however some languages and APIs use UTF-16 as their native format as a substitute (JavaScript, Java, Home windows, …).
This and different causes, may trigger you to transform between completely different Unicode encodings. As demonstrated by simdutf, this conversion course of has a whole lot of optimization potential.

Right here we’ll concentrate on UTF-8 to UTF-16 conversion, and purpose to develop an optimized RISC-V implementation that may be upstreamed to the simdutf library, which is used amongst others by Node.js and Bun. (Upstreaming is, as of writing, nonetheless work in progress)

The RISC-V Vector extension (RVV) provides a set of 32 vector registers which might be every VLEN bits large, the place VLEN is a power-of-two higher or equal to 128.
Vector registers might be interpreted as a number of 8/16/32/64 bit components, and operated on accordingly, as signed/unsigned integers or single/double precision floating level numbers.
Since we are able to function on a number of components at a time giant speedup over scalar code is feasible.

On the time of writing (early 2024) there may be virtually no {hardware} out there that helps RVV.

  • The one {hardware} with RVV assist, common customers should purchase, is the Kendryte K230. It has a C908 in-order core from Xuantie working at 1.6GHz with a VLEN of 128 bits.
    (C908 benchmark page)
  • You may as well purchase two different CPUs that assist an early incompatible draft model (0.7.1) of the vector extension, the C906 and the C920 (C910 with RVV).
    Because the C906s efficiency traits are similar to the C908, we cannot embody benchmarks for this one.
    (C906 benchmark page)
  • The C920 nonetheless is a a lot sooner out-of-order core at 2GHz, the outcomes of that are extra fascinating than those from C908 for future {hardware}.
    Focusing on RVV 0.7.1 is a ache, as there is no such thing as a official toolchain assist, however I’ve taken the time to manually translate the generated meeting and assemble it with an older GCC branch.
    (C920 benchmark page)
  • There are a few open-source RVV implementations, however most are nonetheless in improvement/incomplete. The one one that’s full, and we may simulate domestically is Tenstorrents bobcat (formally ocelot), however it was a proof-of-concept design and a few directions we’ll be utilizing have been explicitly not optimized. (we’ll focus on that later)

Luckily, I personal the Kendryte K230, and have ssh entry to a Milk-V Pioneer server with 64 C920 cores, because of perfXlab.
The event was achieved through qemu emulation, as that is far easier than utilizing actual {hardware}, for now.

The subsequent two sections will cowl the fundamentals of RVV and Unicode, be happy to skip ahead if you’re already conversant in the subjects.

The RISC-V vector extension by example: UTF-8 count

I will attempt to clarify the RVV fundamentals utilizing a brief instance:

size_t utf8_count_rvv(char const *buf, size_t len) {
	size_t sum = 0;
	for (size_t vl; len > 0; len -= vl, buf += vl) {
		vl = __riscv_vsetvl_e8m8(len);
		vuint8m8_t v = __riscv_vle8_v_u8m8((uint8_t*)buf, vl);
		vbool1_t b = __riscv_vmsne(__riscv_vsrl(v, 6, vl), 0b10, vl);  
		sum += __riscv_vcpop(b, vl);
	}
	return sum;
}
// the corresponding meeting code:
utf8_count_rvv:
    li      a2, 0
loop:
    vsetvli a3, a1, e8, m8, ta, ma
    vle8.v  v8, (a0)
    vsrl.vi v8, v8, 6
    vmsne.vi        v16, v8, 0b10
    vcpop.m a4, v16
    add     a2, a2, a4
    sub     a1, a1, a3
    add     a0, a0, a3
    bnez    a1, loop
    mv      a0, a2
    ret

Right here we’re utilizing the C intrinsics API to rely the variety of UTF-8 characters* within the provided knowledge.

The subsequent part will describe this in additional element, however to rely the variety of UTF-8 characters we simply must rely the variety of bytes that do not match the sample 0b10xxxxxx, assuming the enter is legitimate.


*From right here on out after I discuss with a “character” I imply a Unicode code level. This does not instantly map right into a single character on display screen, for instance, this emoji “🧙‍♀️” is constructed with three Unicode code factors: 🧙 + Zero Width Joiner + ♀️ = 🧙‍♀️

As talked about above RVV helps completely different vector lengths.
To facilitate having the identical code work on machines with completely different vector lengths RVV has the vsetvl* instruction.
You give it the aspect rely of your enter, a component width, and it will provide you with a rely smaller or equal to your provided rely that it could actually match right into a vector register.

The code above makes use of this to iterate over the enter, vl = vsetvl_e8m8(len) represents the of quantity components one iteration processes.
Subsequent vle8_v_u8m8() masses vl 8-bit integer components from our enter right into a vector register.

Then a masks is created the place every lively aspect (aspect the place the coresponding bit within the masks is about) does not match the sample 0b10xxxxxx. vsrl stands for shift proper logical, and vmsne for not equal to, so vmsne(vsrl(v, 6, vl), 0b10, vl) does (x >> 6) != 0b10 on every aspect.
RVV does not have separate masks registers, as a substitute, masks are saved within the decrease bits of a vector register, the intrinsics API provides the vboolN_t sorts to offer this extra sort security.
Lastly, we use vcpop to rely the variety of lively components in our masks and add that to our sum.

You is likely to be questioning what the “m8” means, I’ve omitted that thus far.
RVV has 32 VLEN bits large vector registers, however with vsetvl it’s also possible to configure the LMUL (size multiplier), and trigger the processor to group these registers.
Because of this subsequent directions will act on a register group, and vsetvl will return a vl equivalent to LMUL.

When LMUL=1 we have got 32 VLEN bits large registers, for LMUL=2 they now act like 16 VLEN*2 bit large registers, so for “m8” (LMUL=8), we have got 4 VLEN*8 bit large registers.
Right here we use lower than 5 vector registers, so utilizing LMUL=8 offers us basically free loop unrolling, which makes the scalar- and masks operations inexpensive.
Unrolling is not the one benefit of LMUL, it additionally permits us to simply work with mixed-width knowledge, we’ll be closely utilizing this later.

This additionally explains why masks are saved in a vector register, extra particularly in an LMUL=1 vector register, regardless that they solely retailer one bit per aspect they’re referring to.
For LMUL=8 and 8-bit components you want a full LMUL=1 register to have sufficient bits to signify its masks.

Other than the RVV characteristic mentioned already, the opposite options we’ll be utilizing are:

  • reductions over components: Applies an operation to all components to supply a scalar consequence. E.g. sum all components.
  • narrowing/widening arithmetic operations: Operation that decreases/will increase aspect width and LMUL.
  • permutations: Directions to maneuver components, RVV helps slides, merge (mix), compress, and collect (shuffle)

I hope this wasn’t too complicated, listed below are some extra in-depth references on RVV in the event you do not feel ready to comply with together with the remainder of the article:

Quick Unicode intro and reference

Unicode defines a set of ~150,000 characters and assigns them a singular 32-bit quantity, referred to as code level.
Storing simply the code factors themself is named UTF-32. This is not achieved in follow, as a result of decrease code factors happen extra usually, and this wastes a whole lot of house.
There are two different encoding schemes: UTF-8, and UTF-16.

UTF-8 makes use of one- to four-bytes to signify a code level utilizing the format visualized under:

Discover the small Invalid vary. Code factors between 0xD800-0xDFFF are unassigned and invalid, that is used to permit for the UTF-16 encoding.

UTF-16 encodes the code factors from 1, 2 and three byte UTF-8 characters instantly as a single 16-bit character.
4-byte UTF-8 code factors are encoded in two 16-bit characters by leveraging a part of the invalid character vary to sign that it is a multi-word UTF-16 character.

16-bit phrases within the vary 0xD800-0xDBFF are referred to as excessive surrogates and within the vary 0xDC00-0xDFFF low surrogates.
A excessive surrogate is at all times adopted by a low surrogate, the code factors are encoded as follows:

Lastly, here’s a side-by-side comparability of the encodings for the string “rνṿ🧙”, which incorporates all UTF-8 character lengths:

Our purpose is to implement a quick vectorized validating UTF-8 to UTF-16 conversion routine, however let’s deal with non-validating common UTF-8 to UTF-32 conversion first and see the place that leads us.
We would find yourself merely changing the Unicode code level (UTF-32) to UTF-16, or work out a use of some intermediate variables to get to UTF-16.

So, this leaves us with just a few issues to do:

  • determine character positions
  • take away prefixes
  • mix to UTF-32 code level

An important query appears to be, how we cope with the completely different character sizes.
Initially, I had two concepts on easy methods to method this:

  • vdecompress:
    Skimming by way of the specification, the primary instruction that appeared to suit the invoice was vdecompress.
    Though it is not actually an instruction, however moderately a mixture of the viota and vrgather directions to synthesize a vcompress inverse.
    It makes use of a masks to maneuver each nth aspect of a vector to the nth lively aspect within the supply register.
    This might permit us to widen each UTF-8 character to 4 bytes lengthy, so we are able to work on the different-sized characters uniformly.

  • vcompress:
    Alternatively we may additionally vcompress each nth byte of all UTF-8 characters into the nth of 4 separate vector registers.
    Then we may additionally write code that operates on all character sizes uniformly, however we would must recombine the registers to retailer the ultimate code level.

The primary method appeared fairly promising to me, so I sketched out the creation of the decompress-mask however received caught on easy methods to proceed from there.
The issue is, that now you go from an enter register of, let’s for now assume LMUL=1, to an LMUL=4 register, and nonetheless must do all the logic to take away prefixes and shift the bits into place.
That makes each operation we do 4 occasions slower, and we would must fairly just a few operations.
Add to that, that vrgather is gradual with bigger LMULs (see later discussion), and this does not look like that good of an concept anymore.

Let’s think about the vcompress method once more.
As soon as we have eliminated the prefixes from our 4 registers, we get the “shifting the bits into the proper place” half mainly free of charge, as a result of we have to recombine them anyway.
Utilizing a widening multiply makes combining them whereas shifting by six bits even simpler than interleaving the bytes (shifting by eight bits), as a result of we won’t multiply by 1<<8 because it does not match into 8-bits.

I hoped to make use of masked widening provides and multiplies, however that did not find yourself being price it, as we have to specify a vacation spot operand that’s already widened.
One other complication is that combining the primary two bytes with the final two bytes must shift the primary two by 0, 6 or 12 bits, which does not properly translate right into a masked operation.

We will nonetheless at all times act like we have now a four-byte UTF-8 character and later calculate and apply a correction proper shift quantity. This additionally removes the necessity to masks the add operations, as any residual bits are shifted away.

So right here is the sport plan:

We need to course of the enter in chunks, but when we have been to load the info instantly right into a single vector register, we would want to determine the place the final full character ends within the register.
As an alternative of doing that we are able to at all times lookahead three bytes, and solely think about them as continuation bytes, you may see why that works out fairly properly later.
Right here is the framework we’ll be constructing on prime of:

size_t utf8_to_utf32(char const *src, size_t rely, uint32_t *dest) {
	size_t tail = 3;
	if (rely < tail) return utf8_to_utf32_scalar(src, rely, dest);

	size_t n = rely - tail;
	uint32_t *destBeg = dest;

	for (size_t vl, vlOut; n > 0; n -= vl, src += vl, dest += vlOut) {
		vl = __riscv_vsetvl_e8m2(n);
		vuint8m2_t v0 = __riscv_vle8_v_u8m2((uint8_t const*)src, vl);

		/* TODO: extract b1/b2/b3/b4 */
		/* TODO: take away prefixes */
		/* TODO: mix to b1234 */

		__riscv_vse32(dest, b1234, vlOut);
	}

	/* reparse final character + tail */
	if (rely > tail) {
		if ((src[0] >> 6) == 0b10) --dest;
		whereas ((src[0] >> 6) == 0b10 && tail < rely)
			--src, ++tail;
	}
	size_t ret = utf8_to_utf32_scalar(src, tail, dest);
	if (ret == 0) return 0;
	return (size_t)(dest - destBeg) + ret;
}

The loop construction itself is comparatively easy, however we additionally must make it possible for we deal with our lookahead of three appropriately.
For simplicity, we’ll fall again to a scalar implementation for the tail, however we want to verify we go it a pointer to the start of a UTF-8 character.

Discover how we’re loading vuint8m2_t, it is a bit optimistic, however the concept is that we are able to get away with utilizing 16 as a substitute of 32 registers. This might basically unroll the loop and make the scalar and masks operations inexpensive.

Extracting nth UTF-8 bytes into b1/b2/b3/b4

Initially, my thought was to create a masks of all the primary bytes of all UTF-8 characters, compress these, and shift the masks proper to extract the second bytes, and so forth.
Acquiring the masks is trivial, you simply discover all bytes that are not continuation bytes: x >> 6 != 0b10.
RVV cannot shift masks although, until you deal with the masks vector as a traditional vector and assume it matches in 64-bit components, or deal with the carry.

As an alternative, we are able to left-shift (vslide1down) the weather of the vector itself and go away the masks fixed.
This now comes again to why the lookahead works so properly, as we are able to specify a scalar to shift into the rightmost aspect.

The next brings the above ideas collectively:

/* IMPL: extract b1/b2/b3/b4 */
vuint8m2_t v1 = __riscv_vslide1down(v0, src[vl+0], vl);
vuint8m2_t v2 = __riscv_vslide1down(v1, src[vl+1], vl);
vuint8m2_t v3 = __riscv_vslide1down(v2, src[vl+2], vl);
/* masks of non-continuation bytes */
vbool4_t m = __riscv_vmsne(__riscv_vsrl(v0, 6, vl), 0b10, vl);
/* extract third and fourth bytes */
vuint8m2_t b1 = __riscv_vcompress(v0, m, vl);
vuint8m2_t b2 = __riscv_vcompress(v1, m, vl);
vuint8m2_t b3 = __riscv_vcompress(v2, m, vl);
vuint8m2_t b4 = __riscv_vcompress(v3, m, vl);

Removing the prefix

Eradicating the prefixes of b2/b3/b4 is trivial, we solely must masks out the 2 MSBs.

/* IMPL: take away prefixes */
/* take away prefix from trailing bytes */
vlOut = __riscv_vcpop(m, vl);
b2 = __riscv_vand(b2, 0b00111111, vlOut);
b3 = __riscv_vand(b3, 0b00111111, vlOut);
b4 = __riscv_vand(b4, 0b00111111, vlOut);
/* TODO: take away prefix from main bytes */

For b1 we have to decide what number of bytes to mask-off, relying on the prefix.
Luckily, the primary 4 bytes are sufficient to find out which bits to mask-off, and we may merely use a vrgater lookup desk.
On present {hardware} nonetheless one other method appears to be sooner.

As an alternative of a masks, we are able to additionally use left after which instantly proper shifts by the identical worth to take away prefix bits.
The shift quantities for this could virtually be calculated utilizing a single saturating subtract 10:

There’s one outlier, we are able to simply deal with utilizing a merge:

/* IMPL: take away prefix from main bytes */
vuint8m2_t shift = __riscv_vsrl(b1, 4, vlOut);
shift = __riscv_vmerge(__riscv_vssubu(shift, 10, vlOut), 3,
                       __riscv_vmseq(shift, 12, vlOut), vlOut);
b1 = __riscv_vsll(b1, shift, vlOut);
b1 = __riscv_vsrl(b1, shift, vlOut);

The vrgather implementation ought to in all probability be revisited as soon as extra {hardware} is obtainable.

Combining b1/b2/b3/b4 into b1234

As described above, properly first deal with all components as four-byte UTF-8 characters after which proper shift this into place.

The primary half is trivially utilizing widening operations, and assuming we have already calculated the right proper shift quantity that we are able to simply widen and apply:

/* IMPL: mix to b1234 */
/* unconditionally widen and mix to b1234 */
vuint16m4_t b34   = __riscv_vwaddu_wv(__riscv_vwmulu(b3,  1<<6,  vlOut), b4,  vlOut);
vuint16m4_t b12   = __riscv_vwaddu_wv(__riscv_vwmulu(b1,  1<<6,  vlOut), b2,  vlOut);
vuint32m8_t b1234 = __riscv_vwaddu_wv(__riscv_vwmulu(b12, 1<<12, vlOut), b34, vlOut);
/* TODO: compute shift quantity */
b1234 = __riscv_vsrl(b1234, __riscv_vzext_vf4(shift, vlOut), vlOut);

Computing the shift quantity it is a bit extra concerned than final time, however it may be achieved utilizing as follows:

This maps instantly into the next code:

/* IMPL: compute shift quantity */
/* derive required right-shift quantity from `shift` to cut back
 * b1234 to the required variety of bytes */
shift = __riscv_vmul(__riscv_vrsub(__riscv_vssubu(shift, 2, vlOut), 3, vlOut), 6, vlOut);

For validation, we use the strategy described in “Validating UTF-8 In Less Than One Instruction Per Byte”.
I will not go into the complete element, however I will summarise the way it works:

There are seven error patterns in a 2 byte UTF-8 sequence, they’ll all be detected by solely trying on the first 12 bits.
We use three 4-bit lookup tables that map to a bitmask of the errors matching that sample and AND them collectively to find out if the was an error.
The one error not detected by this are associated to 3-4 byte UTF-8 characters which have the flawed quantity of continuation bytes.
To detect these, the final remaining bit within the bitmask of the LUTs is used to point a pair of continuation bytes, which mixed with just a few comparisons manages to detect all invalid UTF-8 sequences.

As an alternative of doing a 3 LMUL=2 vrgather, we do six LMUL=1 vrgather, as we needn’t cross any lanes, and this performs higher as a result of it must do a much less advanced job (see later discussion).
To assist with that, we outline the VRGATHER_u8m1x2 macro, which unrolls the vrgather for us.
Earlier than we get into the implementation we additionally must outline the error lookup tables, and a few constants we’ll use later:

#outline VRGATHER_u8m1x2(tbl, idx) 
	__riscv_vset(__riscv_vlmul_ext_u8m2( 
		__riscv_vrgather(tbl, __riscv_vget_u8m1(idx, 0), vl8m1)), 1, 
		__riscv_vrgather(tbl, __riscv_vget_u8m1(idx, 1), vl8m1));

static const uint64_t err1m[] = { 0x0202020202020202, 0x4915012180808080 };
static const uint64_t err2m[] = { 0xCBCBCB8B8383A3E7, 0xCBCBDBCBCBCBCBCB };
static const uint64_t err3m[] = { 0x0101010101010101, 0x01010101BABAAEE6 };
const vuint8m1_t err1tbl = __riscv_vreinterpret_u8m1(__riscv_vle64_v_u64m1(err1m, 2));
const vuint8m1_t err2tbl = __riscv_vreinterpret_u8m1(__riscv_vle64_v_u64m1(err2m, 2));
const vuint8m1_t err3tbl = __riscv_vreinterpret_u8m1(__riscv_vle64_v_u64m1(err3m, 2));
const size_t vl8m1 = __riscv_vsetvlmax_e8m1();
const size_t vl16m2 = __riscv_vsetvlmax_e16m2();

Now we have to extract the 4-bits and do the lookup 3 times.

...
vuint8m2_t v3 = __riscv_vslide1down(v2, src[vl+2], vl);

vuint8m2_t s1 = __riscv_vreinterpret_u8m2(__riscv_vsrl(__riscv_vreinterpret_u16m2(v2), 4, vl16m2));
vuint8m2_t s3 = __riscv_vreinterpret_u8m2(__riscv_vsrl(__riscv_vreinterpret_u16m2(v3), 4, vl16m2));

vuint8m2_t idx2 = __riscv_vand(v2, 0xF, vl);
vuint8m2_t idx1 = __riscv_vand(s1, 0xF, vl);
vuint8m2_t idx3 = __riscv_vand(s3, 0xF, vl);

vuint8m2_t err1 = VRGATHER_u8m1x2(err1tbl, idx1);
vuint8m2_t err2 = VRGATHER_u8m1x2(err2tbl, idx2);
vuint8m2_t err3 = VRGATHER_u8m1x2(err3tbl, idx3);
vint8m2_t  errs = __riscv_vreinterpret_i8m2(
                  __riscv_vand(__riscv_vand(err1, err2, vl), err3, vl));
/* TODO: detect 3/4 byte errors */

To verify for 3/4 byte errors, we verify if the earlier enter had a 3 or 4 byte character, which needs to be adopted by two continuation bytes.
There isn’t a error, if we anticipate two continuations and get them, and if we do not anticipate two continuations and do not get them, this maps completely to an XOR operation.
We use the very fact, that the higher little bit of our error bit set signifies the expectation of two continuations. Decoding the byte as a signed quantity, lets us simply verify if the MSB bit is about (x < 0) and if any of the opposite bits are set (x > 0).

Lastly, we take a look at if our error masks incorporates an error, and exit the operate with an error code.

/* IMPL: detect 3/4 byte errors */
vbool4_t is_3 = __riscv_vmsgtu(v1, 0b11100000-1, vl);
vbool4_t is_4 = __riscv_vmsgtu(v0, 0b11110000-1, vl);
vbool4_t is_34 = __riscv_vmor(is_3, is_4, vl);
vbool4_t err34 = __riscv_vmxor(is_34, __riscv_vmslt(errs, 0, vl), vl);
vbool4_t errm = __riscv_vmor(__riscv_vmsgt(errs, 0, vl), err34, vl);

if (__riscv_vfirst(errm, vl) >= 0)
	return 0;

Lastly, we won’t overlook in regards to the first three bytes. They might e.g. be all continuation bytes, which might trigger our loop to disregard them.
Earlier than the beginning of our loop, we discover the tip of the thired UTF-8 character and go that to a scalar validation routine, right here we reuse utf8_to_utf32_scalar for simplicity:

...
if (rely < tail) return utf8_to_utf32_scalar(src, rely, dest);
/* validate first three bytes */
 !utf8_to_utf32_scalar(src, idx, buf))
		return 0;

Now for the enjoyable half, making issues sooner.

ASCII

If we learn an all ASCII vector, then we are able to skip the validation go, and easily widen and retailer the vector.
We use a max discount to find out if we have now solely ASCII bytes, the results of which we are able to additionally use for our different quick paths:

...
vuint8m2_t v0 = __riscv_vle8_v_u8m2((uint8_t const*)src, vl);
uint64_t max = __riscv_vmv_x(__riscv_vredmaxu(v0, __riscv_vmv_s_x_u8m1(0, vl), vl));

/* quick path: ASCII */
if (max < 0b10000000) {
	vlOut = vl;
	__riscv_vse32(dest, __riscv_vzext_vf4(v0, vlOut), vlOut);
	proceed;
}

One and two byte UTF-8

This quick path must occur after validation.
We’re solely involved with creating b12 from b1 and b2, which permits us to simplify the code from the final case quite a bit.

We needn’t trouble with shifting to take away the prefix from b1, there are solely two prospects, and one is to do nothing, therefore a masked and matches completely.

We nonetheless cannot use a masked widening multiply with out first widening the vacation spot operand, however we are able to use a easy vmerge to pick between the 2 attainable shift values.
The addition can then be achieved utilizing a masked widening add as a result of the vacation spot is already widened.

Now we have to zero prolong once more and we’re achieved:

...
vuint8m2_t b2 = __riscv_vcompress(v1, m, vl);
vlOut = __riscv_vcpop(m, vl); /* must be moved up from earlier place to right here */

/* quick path: one and two byte */
if (max < 0b11100000) {
	b2 = __riscv_vand(b2, 0b00111111, vlOut);
	vbool4_t m1 = __riscv_vmsgtu(b1, 0b10111111, vlOut);
	b1 = __riscv_vand_mu(m1, b1, b1, 63, vlOut);
	vuint16m4_t b12 = __riscv_vwmulu(b1, __riscv_vmerge(__riscv_vmv_v_x_u8m2(1, vlOut), 1<<6, m1, vlOut), vlOut);
	b12 = __riscv_vwaddu_wv_mu(m1, b12, b12, b2, vlOut);
	__riscv_vse32(dest, __riscv_vzext_vf2(b12, vlOut), vlOut);
	proceed;
}

One, two and three byte UTF-8

I will go away understanding this one as an train to the reader, be aware that the code factors of all three and under byte UTF-8 characters match into 16 bytes.

/* quick path: one and two byte */
...

/* quick path: one, two and three byte */
if (max < 0b11110000) {
	vuint8m2_t b3 = __riscv_vcompress(v2, m, vl);
	b2 = __riscv_vand(b2, 0b00111111, vlOut);
	b3 = __riscv_vand(b3, 0b00111111, vlOut);

	vbool4_t m1 = __riscv_vmsgtu(b1, 0b10111111, vlOut);
	vbool4_t m3 = __riscv_vmsgtu(b1, 0b11011111, vlOut);

	vuint8m2_t t1 = __riscv_vand(m1, b1, b1, 63, vlOut);
	b1 = __riscv_vand(m3, t1, b1, 15, vlOut);

	vuint16m4_t b12 = __riscv_vwmulu(b1, __riscv_vmerge(__riscv_vmv_v_x_u8m2(1, vlOut), 1<<6, m1, vlOut), vlOut);
	b12 = __riscv_vwaddu_wv(m1, b12, b12, b2, vlOut);
	vuint16m4_t b123 = __riscv_vwaddu_wv_(m3, b12, __riscv_vsll_vx_u16m4_mu(m3, b12, b12, 6, vlOut), b3, vlOut);
	__riscv_vse32(dest, __riscv_vzext_vf2(b123, vlOut), vlOut);
	proceed;
}

Each 1 to three byte UTF-8 character might be represented in a single phrase UTF-16 character, so for the quick paths above, we have to widen and retailer to 16 as a substitute of 32 bits.

Hey! The quick paths simply received a bit sooner!

See Also

The subsequent step begins proper after we have computed the UTF-32 code level within the common path.
We deal with all characters as two-word UTF-16 and convert them to surrogate pairs accordingly.
Then we vmerge between the unique UTF-32 code level and the two-word UTF-16 code factors.
Afterward, we use the unique UTF-32 code level to create a compression masks for our UTF-32 phrases.
This may be achieved by matching the UTF-32 code factors that match into 16 bits, their higher two bytes are zero.

We predefine a masks m2even that masks all odd indexes and two helpers, as we’ll want to modify between treating the vector as a vector of 16-bit and 32-bit components.

size_t vl8m4 = __riscv_vsetvlmax_e8m4();
const vbool2_t m2even = __riscv_vmseq(__riscv_vand(__riscv_vid_v_u8m4(vl8m4), 1, vl8m4), 0, vl8m4);

#outline DOWN __riscv_vreinterpret_u16m8
#outline UP __riscv_vreinterpret_u32m8

Now we shift the bits round to create the surrogate illustration of every code level after which choose between this and the UTF-32 code level utilizing vmerge.
To acquire the right UTF-16 output, we have to compress all odd phrases which might be zero, as these are solely current within the one single UTF-16 character at the moment saved as UTF-32.

...
b1234 = __riscv_vsrl(b1234, ...);
/* convert [000000000000aaaa|aaaaaabbbbbbbbbb]
 * to      [110111bbbbbbbbbb|110110aaaaaaaaaa] */
vuint32m8_t sur = __riscv_vsub(b1234, 0x10000, vlOut);
sur = __riscv_vor(__riscv_vsll(sur, 16, vlOut),
                  __riscv_vsrl(sur, 10, vlOut), vlOut);
sur = __riscv_vand(sur, 0x3FF03FF, vlOut);
sur = __riscv_vor(sur, 0xDC00D800, vlOut);
/* merge 1 byte b1234 and a couple of byte sur */
vbool4_t m4 = __riscv_vmsgtu(b1234, 0xFFFF, vlOut);
b1234 = __riscv_vmerge(b1234, sur, m4, vlOut);
/* swap b1234 two byte pairs */
/* compress and retailer */
vbool2_t mOut = __riscv_vmor(__riscv_vmsne(DOWN(b1234), 0, vlOut*2), m2even, vlOut*2);
b1234 = UP(__riscv_vcompress(DOWN(b1234), mOut, vlOut*2));
size_t vlDest = __riscv_vcpop(mOut, vlOut*2);
__riscv_vse16_v_u16m8(dest, DOWN(b1234), vlDest);

Lastly, we have to regulate the tail dealing with to account for UTF-16 output.
Since we need to reparse the final character, we have to decrement the vacation spot pointer, if the final output phrase was a excessive surrogate.

/* reparse final character + tail */
if (rely > tail) {
	if ((src[0] >> 6) == 0b10) --dest;
	whereas ((src[0] >> 6) == 0b10 && tail < rely)
		--src, ++tail;
	/* return yet another, when on excessive surrogate */
	if (dest[-1] >= 0xD800 && dest[-1] <= 0xDBFF)
		--dest;
}

Let’s do one final optimization.
Should you have a look at the final case once more, you may discover that we’re doing twice the wanted work wanted, if the typical enter character measurement is the same as or above two.
To avoid that, we are able to decrease the LMUL from two to 1 and unroll the code with an early exit between the iterations.

It is a easy transformation, and you’ll browse the final code here.
It is a bit extra advanced as a result of it implements each the UTF-8 to UTF-16 and UTF-8 to UTF-32 conversions in a single operate.

I used the Lemires unicode_lipsum dataset to measure the efficiency for inputs of various languages.
You may have a look at the benchmark code here, it reads an enter file right into a buffer and measures the typical time a conversion of that enter takes.

The code for the C920 wanted to be manually backported to the draft 0.7.1 RVV model, which launched just a few pessimization:

  • vmvNr wanted to get replaced with one vmv.v.v and two vsetvli directions.
  • vzext.vN wanted to get replaced with N/2 vwaddu.vx and N/2+1 vsetvli directions.
  • vredmax wanted to get replaced with one vmsgtu.vx and one vfirst.m instruction.

    This should not be wanted for RVV 0.7.1, however the {hardware} I’ve received entry to appears to supply the flawed consequence for vredmax. I have never appeared into this additional.

The unicode_lipsum dataset is cut up into two elements.
One incorporates Lorem ipsum type textual content in numerous languages and was created by Hans Kratz, for the simdutf8 venture.

This offers us very dense enter, the typical character sizes are as follows: Arabic: 1.8 Chinese language: 3.0 Emoji: 4.0 Hebrew: 1.8 Hindi: 2.7 Japanese: 2.9 Korean: 2.5 Latin: 1.0 Russian: 1.8

Excluding the, all ASCII, Latin enter, we get a median speedup of 3x over scalar on the C920, and three.5x on the C908. That is quite a bit!

We’re a bit missing within the Emoji take a look at case, which solely consists of 4-byte UTF-8 emojis. We may add an additional quick path for that, however such enter will hardly ever happen in the true world, so I do not assume it is price optimizing for.

The second a part of the info set contains the supply code of various translations for the Mars Wikipedia entries.
That is much less dense, no common character size is above 2.0, however in all probability extra consultant, as a whole lot of textual content is available in some form of structural format (XML, JSON, …) or has Arabic numerals, hyperlinks, … in between the native language characters.

Right here, we get a median speedup of three.6x over scalar on the C920, and 4.0x on the C908, excluding the all ASCII English case.

Performance portability of the permutation instructions

We managed to get speedup on actual {hardware}, however we’re nonetheless within the early days of RVV implementations.
Will the code even have related speedups on future {hardware} which may assist greater vector lengths?

Most RVV directions might be anticipated to carry out properly throughout implementations, however the permutation directions, particularly vrgather.vv and vcompress.vm, are tougher to scale and already exhibit a wide range of efficiency traits.

I’ve compiled cycle estimates for all implementations I may get the info for:

*bobcat: Be aware, that it was explicitly said, that they did not optimize the permutation directions

*x280: the numbers are from llvm-mca, however I used to be instructed they match actuality. There’s additionally alleged to be a vrgather.vv quick path for vl<=256. I feel they did not have a lot incentive to optimize this, because the x280 targets AI workloads w.

Now you may see why we have determined to make use of six LMUL=1 gathers as a substitute of three LMUL=2 gathers for our validation lookup tables.
Present implementations do not scale properly to bigger LMUL, and a few will not carry out properly sufficient for our implementation on any LMUL.

I feel that the C920 outcomes are essentially the most consultant of what to anticipate of future desktop CPUs and we are able to safely ignore the bobcat/x280 cycle counts for that use case.
I think we’ll see vrgather.vv carry out properly for LMUL=1, after which develop exponentially per aspect with larger LMUL, as an all-to-all mapping is kind of costly to scale.

vcompress.vm needs to be higher scalable than vrgather.vv, because the work is subdividable, and I feel we would see a variety of implementations that scale virtually linearly in respect to LMUL.

We have seen that vrgather.vv is basically helpful for fast lookup tables, however offers a far more highly effective all-to-all mapping than required for that.
4-bit lookup tables particularly are very doubtless for use by a bunch of different implementations as properly, as the usual V extension ensures a VLEN of at the least 128, so a 4-bit LUT can at all times be assumed to suit into one vector register.

I do not know something about {hardware} design, however implementing a LMUL=1 vrgather.vv for very lengthy vector implementations that performs, per aspect, as quick as one for a smaller implementation, appears virtually unimaginable.
Should you anticipate to make use of it for its full capabilities, it is not a giant drawback, typically you simply must get the weather into a particular permutation. However for 4-bit lookup tables we solely care about deciding on one among 16 values.

{Hardware} may be capable to have a particular case/quick path implementation for smaller indices, however we won’t anticipate most implementations to assist it.
I feel including a devoted vrgatherei4, which solely makes use of the 4 LSB for indices, to the specification is likely to be a good suggestion.

We have managed to successfully use the RISC-V Vector extension to hurry up UTF-8 to UTF-16 conversion by on common 3 to 4 occasions on actual {hardware}.
Upstreaming this and different conversion features to the simdutf library continues to be work in progress, and I will return to engaged on it after my exams.
Hopefully, we are able to see this contribute to bridging the optimization hole of RISC-V in comparison with different architectures, and perhaps even encourage some folks to offer porting to RVV a shot.

I hope you might comply with this write up, have a pleasant day :3

In case you are questioning how the RISC-V {hardware} compares to current {hardware} of different architectures, listed below are the measurements for the dataset from above together with simdutf implementations for the opposite architectures.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top