Now Reading
Unpacking 5 Billion Varints in solely 4 Billion CPU Cycles ·

Unpacking 5 Billion Varints in solely 4 Billion CPU Cycles ·

2023-06-04 11:46:44

Varint is a well known method used for compressing integer streams. Basically, it means that it may be extra environment friendly to encode a quantity utilizing a variable-length illustration as a substitute of a fixed-size binary illustration. By eradicating main zeros from the binary quantity, the general illustration dimension may be decreased. This method works significantly nicely for encoding smaller numbers.

On this article, I present a quick introduction and rationale for varint encoding. Moreover, I describe the Stream VByte format, which permits absolutely vectorized decoding by SSSE3 directions. I additionally share my findings from implementing this algorithm in Rust, which incorporates each encoding and decoding primitives and the power to learn knowledge from each RAM and disk.

Algorithm was examined on a number of platforms:

CPU Base Freq. (GHz) Turbo Freq. (GHz) Outcome (GElem/s)
Xeon E3-1245 v5 3.5 3.9 5.0
Core i7-1068NG7 2.3 4.1 5.5

The decoding pace is 0.75 CPU cycles per integer on common.

Varint compression is extensively utilized in varied contexts:

  • It’s utilized in serialization codecs to realize extra environment friendly state switch illustration. For instance, Protobuf employs varint compression.
  • Database engines usually make use of varint compression (for instance, SQLite BTree page format).
  • Serps closely depend on varint compression to compress doc lists that comprise IDs of paperwork the place a particular phrase is current (known as posting lists).
  • It may be argued that UTF8 is a type of varint encoding. Nevertheless, it’s a particular variant crafted to keep up compatibility with the binary illustration of ASCII textual content.

Regardless of its success, varint compression faces a particular problem: sluggish decoding pace. To grasp the explanation behind this, it’s mandatory to grasp how classical varint encoding capabilities.

In conventional varint encoding, probably the most vital bit of every byte is reserved to point whether or not the byte is a continuation of the earlier byte. The remaining bits carry the precise quantity.

Right here’s how numbers are encoded:

  • Numbers that may match inside 7 bits (excluding main zero bits) are encoded as 0xxxxxxx.
  • Numbers with 14 bits are encoded as 0xxxxxxx 1xxxxxxx.
  • Numbers with 21 bits are encoded as 0xxxxxxx 1xxxxxxx 1xxxxxxx, and so forth.
  • A 32-bit quantity on this scheme can be encoded as 5 bytes: 0000xxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx.

Nevertheless, this method introduces a major knowledge dependency within the format. Decoding the subsequent quantity can solely start after decoding the earlier quantity as a result of the offset the place the subsequent quantity begins within the byte stream must be decided. In consequence, directions can’t be executed in parallel on fashionable CPUs, hindering efficiency.

Varints decoding may be vectorized utilizing varied strategies, together with the patented varint-G8IU. One elegant resolution, in my view, is the Stream VByte format proposed by Daniel Lemire, Nathan Kurzb, and Christoph Ruppc.

The method is as follows: we separate the size data and the quantity knowledge into impartial streams, permitting us to decode a gaggle of numbers in parallel.

Take into account this statement: for a u32 quantity, there are 4 potential lengths in bytes. These lengths may be represented utilizing 2 bits (00 for size 1, 11 for size 4). Utilizing 1 byte, we will encode the lengths of 4 u32 numbers. We seek advice from this byte because the “management byte”. The sequence of management bytes varieties the management stream. The second stream, known as the info stream, comprises the bytes of the compressed varint numbers laid out sequentially with none 7-bit shenanigans.

Let’s take an instance. Suppose we encode the next 4 numbers: 0x00000011, 0x00002222, 0x00333333, and 0x44444444. Within the encoded format, they would seem as follows:

   0x27    <- 00_01_10_11 – lengths 1, 2, 3 and 4 respectively

   0x11, 0x22, 0x22, 0x33, 0x33,
   0x33, 0x44, 0x44, 0x44, 0x44

Now, we will learn a single management byte, decode the lengths of 4 u32 numbers, and decode them one after the other. This already represents an enchancment over the unique scalar decode implementation. Nevertheless, we will obtain even higher efficiency. The truth is, we will decode all 4 numbers in only one CPU instruction!

If we contemplate it rigorously, all we have to do is insert zeros within the applicable positions to align the numbers appropriately.

And there may be instruction for that.

PSHUFB SSSE3 instruction