Now Reading
Is it time to cease utilizing sentinel values for null / “NA” values?

Is it time to cease utilizing sentinel values for null / “NA” values?

2023-06-05 07:44:54

This weblog posts discusses the design and efficiency implications of utilizing
bitmaps to mark null values as an alternative of sentinel values (or particular values like
NaN).

Relying on who you speak to, one controversial facet of the Apache Arrow
columnar format is the truth that it makes use of a validity bitmap to mark worth as
null or not null. This is the fundamental thought:

  • The variety of nulls is a property of an array, usually computed as soon as then
    saved
  • If an array has any null values in any respect, then a bitmap have to be allotted
  • If the worth at bit i is 0, then the worth is null, in any other case it isn’t
    null

There’s some good advantages of the bitmap method:

  • If an array has no nulls, then the bitmap might be ignored or just not
    allotted in any respect
  • Nulls might be propagated in algorithms by utilizing word-wise AND or OR
    operations, processing 32 or 64 values at a time

Nonetheless, for programs like R which use particular values to mark nulls, bitmaps can
be controversial. The argument typically facilities on the efficiency
implications of getting to verify bits relatively than evaluating a worth with
INT32_MIN or checking whether it is NaN. It is usually famous that NaN values are
propagated mechanically by the CPU in arithmetic operations, so propagating
nulls within the bitmap case might be essentially slower.

From the angle of databases and information warehousing, reserving sure
values to mark a null (or NA) is extensively thought of unacceptable. NaN is legitimate
information, as is INT32_MIN and different frequent values used as sentinels.

Most SQL programs retailer the null-ness of worth utilizing an additional bit or byte. File
codecs like Apache Avro and Parquet have a separate null/not-null encoding.

To point out that the efficiency argument in opposition to bitmaps isn’t persuasive, I
wrote some in-memory efficiency benchmarks (I wish to do some
experiments with giant memory-mapped datasets). To maintain issues easy, let’s
take into account the sum operation which should mixture the entire non-null values in
an array. We additionally will monitor the non-null rely.

We’re serious about a number of typical circumstances:

  • Knowledge with no nulls
  • Knowledge with a small share of nulls, say 10%
  • Knowledge with a excessive share of nulls, say 50%

To maintain monitor of the outcomes of a sum, I exploit a easy struct, templated on the
sort of the values:

template <typename T>
struct SumState {
  SumState() : whole(0), valid_count(0) {}
  T whole;
  int64_t valid_count;
};

The naive sum seems to be like this

template <typename T>
struct SumNoNulls {
  static void Sum(const T* values, int64_t size, SumState<T>* state) {
    for (int64_t i = 0; i < size; ++i) {
      state->whole += *values++;
    }
    state->valid_count += size;
  }
};

Let’s take into account double precision floating level values, the place NaN is a standard
sentinel worth. So a sum perform for double is:

template <typename T>
struct SumWithNaN {
  static void Sum(const T* values, int64_t size, SumState<T>* state) {
    for (int64_t i = 0; i < size; ++i) {
      if (*values == *values) {
        // NaN isn't equal to itself
        state->whole += *values;
        ++state->valid_count;
      }
      ++values;
    }
  }
};

If we symbolize nulls with bits, we might write the sum naively like so:

#embody "arrow/util/bit-util.h"

template <typename T>
struct SumBitmapNaive {
  static void Sum(const T* values, const uint8_t* valid_bitmap,
                  int64_t size, SumState<T>* state) {
    for (int64_t i = 0; i < size; ++i) {
      if (BitUtil::GetBit(valid_bitmap, i)) {
        state->whole += *values;
        ++state->valid_count;
      }
      ++values;
    }
  }
};

It is attainable to go a lot quicker utilizing bitmaps with a number of optimization
strategies:

  • Remove the “if” statements by utilizing floating level operations to
    conditionally add the non-null values
  • “Unroll” a part of the for loop to sum 8 values at a time
  • Skip null checking for teams of 8 values with no nulls. That is made quicker
    nonetheless by producing a pre-populated desk of bit popcounts for uint8_t
    values.

This is the meat of this new vectorized sum algorithm:

const int64_t whole_bytes = size / 8;
for (int64_t i = 0; i < whole_bytes; ++i) {
  const uint8_t valid_byte = valid_bitmap[i];

  if (valid_byte < 0xFF) {
    // Some nulls
    state->whole += (values[0] * (valid_byte & 1)) +
      (values[1] * ((valid_byte >> 1) & 1)) +
      (values[2] * ((valid_byte >> 2) & 1)) +
      (values[3] * ((valid_byte >> 3) & 1)) +
      (values[4] * ((valid_byte >> 4) & 1)) +
      (values[5] * ((valid_byte >> 5) & 1)) +
      (values[6] * ((valid_byte >> 6) & 1)) +
      (values[7] * ((valid_byte >> 7) & 1));
    state->valid_count += kBytePopcount[valid_byte];
  } else {
    // No nulls
    state->whole += values[0] + values[1] + values[2] + values[3]
      + values[4] + values[5] + values[6] + values[7];
    state->valid_count += 8;
  }
  values += 8;
}

To present a good comparability with a sum with out nulls, I additionally wrote comparable
variations of this that sums 8 values at a time within the non-nullable case and
NaN sentinel worth case. I additionally added the identical benchmark with int64 values
utilizing INT64_MIN because the sentinel worth in case integer operations carry out
in another way from floating level operations.

See Also

Complete benchmarking code is here. Listed below are the efficiency outcomes
for the three circumstances there there are 0%, 10%, and 50% nulls, respectively. The dimensions
of the array being summed is 10 million values. The machine is my Xeon E3-1505M
Linux laptop computer.

It is positively attainable that I am doing one thing suboptimal in my
implementations. I’d love to listen to from algorithms consultants!


Benchmark results

Some observations on these benchmark outcomes:

  • The “no nulls” sum features, each vectorized and never, carry out the identical for
    all 3 circumstances, because it does no null checking in any respect
  • When there aren’t any nulls, there is no such thing as a efficiency penality with the optimized
    bitmap sum
  • When there are 10% nulls, the bitmap sum is about 30% quicker than the
    sentinel-based sum. The efficiency hole widens when the proportion of nulls
    is increased.
  • The vectorized bitmap sum is quicker when there are 50% nulls than when there
    are 10% nulls. I’m not certain why that is; I’d speculate that as a result of on
    common most of the phrases within the sum are 0 that the processor is ready to
    keep away from some floating level operations. It may be attainable to additional
    optimize by lowering the variety of attainable FP operations utilizing a “binary
    tree” sum.
  • Vectorization / batching brings little profit to the sentinel-based algorithm

Utilizing a bit- or byte-based masked null illustration is a requirement for a
frequent reminiscence format to be accepted by database programs in addition to information science
libraries, so utilizing sentinel values in any respect isn’t actually an choice in a undertaking
like Apache Arrow. However, in lots of circumstances, utilizing bitmaps permits quicker algorithms
to be developed that aren’t attainable with the sentinel value-based null
illustration utilized in pandas and R.

I am not an ideal programmer, so I will be to see what other forms of
optimizations others can cook dinner up. Check out the benchmark code!

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top