# The Pluto Scarab — Hash Features

*by*Phil Tadros

## Hash Functions

Hash capabilities are capabilities that map a bit vector to a different bit vector, often shorter than the unique vector and often of mounted size for a selected perform.

There are three major makes use of for hash capabilities:

- Quick desk lookup
- Message digests
- Encryption

**Quick Desk Lookup**

Quick desk lookup might be carried out utilizing a hash perform and a hash desk. Parts are discovered within the hash desk by calculating the hash of the component’s key and utilizing the hash worth because the index into the desk. That is clearly quicker than different strategies, resembling inspecting every component of the desk sequentially to discover a match.

**Message Digests**

Message digests mean you can examine two giant bit vectors and rapidly decide if they’re equal. As a substitute of evaluating the vectors bit-by-bit, if the hash values of every bit vector can be found you possibly can examine the hash values. If the hash values are completely different, the unique vectors have to be completely different. If the hash values are the identical then the unique vectors are very more likely to be the identical if the hash perform is sweet.

Message digests can use both cryptographic or non-cryptographic hash capabilities. If the aim of the message digest is to find out if the unique message has been tampered with, you would want to make use of a cryptographic hash perform. In case you simply need to rapidly inform if it’s the identical as one other file with a unique identify (assuming the hash values have already been computed), you need to use a non-cryptographic hash perform.

**Encryption**

Encryption is the transformation of information right into a kind unreadable by anybody with out a secret decryption key. Hash capabilities play an vital position in encryption as a result of it’s their properties that trigger the encrypted information to be unreadable and the unique information to be unrecoverable from the encrypted information with out the decryption key.

Hash capabilities on this context are typically given different names resembling mixing capabilities.

A hash perform maps a bit vector onto one other, often shorter, bit vector. The result’s uniformly distributed, which implies that for an enter vector chosen at random, every out bit is equally more likely to be 0 or 1 and isn’t correlated with the opposite bits (except the dimensions of the vary will not be an influence of two wherein case the excessive bits will present correlations).

Sometimes, m > n and that is why hash capabilities are referred to ascompression capabilitiesin some functions. As a result of the perform is non-invertible, it implies that not all m-bit enter vectors might be losslessly compressed by the identical perform, and even by completely different capabilities for those who depend the bits required to point which compression perform is for use.

## Properties of Hash Features

For a perform to be helpful as a hash perform, it should exhibit the property of *uniform distribution*. Some hash capabilities even have the *one-way* property. If a hash perform is for use for cryptography or for quick desk lookup the place the character of the keys is unknown, the one-way property is a requirement.

**Uniform Distribution**

All good hash capabilities share the property of *collision avoidance*, which implies that the specified conduct is for distinctive inputs to offer distinctive outputs. If the size of the enter vector is larger than the size of the output vector it’s not possible to keep away from all collisions as a result of the set of enter vectors shall be of higher order than the set of output vectors. The hash perform partitions the enter set into subsets of enter vectors that every one produce the identical output.

Since collisions can’t be averted, the objective is to attenuate their chance given two arbitrary enter vectors. Minimization happens when the dimensions of the most important partition is minimized, that’s, when the output vectors are evenly distributed among the many corresponding enter vectors. When this occurs we are saying the output vectors are *uniformly distributed*.

**One-Method Features**

One other vital objective is that “comparable” enter vectors produce very completely different output vectors. For the case of desk lookup, desk keys are often both human language strings, index numbers, or different information that exhibit non-randomness. For the case of message digest capabilities, the objective is to make it as troublesome as doable to seek out two completely different messages that end in the identical hash. A hash perform with this property known as a *one-way* hash perform.

Not all functions require hash capabilities to be one-way, however all cryptographic functions do. For instance, a hash worth for quick desk lookup the place the keys are phone numbers might merely be the final 4 digits of the cellphone quantity. This hash perform is appropriate as a result of it’s more likely to be uniformly distributed, nevertheless it’s clearly not one-way as a result of it’s simple to plot a cellphone quantity that has a selected hash worth.

Collision! Two completely different enter vectors produce the identical output vector. For hash tables, because of this the second desk insertion shall be slower as a result of an alternate location will have to be used. For non-cryptographic message digests, it implies that two messages will seem like the identical when they don’t seem to be. For cryptographic message digests, it means the contents of the message and/or the id of the sender might be undetectibly altered.

## Easy Hash Operate

Determine 1.

uint Hash(string s) { uint hash = 0; foreach (char c in s) { hash = hash * 33 + (uint) c; } return hash; }

Let’s have a look at a easy hash perform. Determine 1 reveals a hash perform meant for use for a hash desk the place the keys are character strings.

The hash worth is initialized to zero in the beginning of the perform, then for every character within the string the hash worth is multiplied by 33 and the Unicode code level worth is added to the hash worth. No overflow checking is specified, so in C# the worth will simply wrap round inside the vary of System.UInt32 if it will get sufficiently big.

The multiplication by 33 known as the *mixing step* and the perform *f* : `uint`

→ `uint`

outlined by *f*(*x*) = *x* * 33 the place * is the C# multiply-without-overflow operator known as the *mixing perform*. The fixed 33 is chosen arbitrarily.

The addition of the Unicode code level worth of the character known as the *combining step* and the perform *g* :`uint`

× `uint`

→ `uint`

outlined by *g*(*x*, *y*) = *x* + *y* the place + is the C# add-without-overflow operator known as the *combining perform*.

Be aware that this specific hash perform has doubtful uniform distribution properties. We’ll consider that extra later.

## Hash Operate Construction

Determine 2.

uint Hash(string s) { // initialization int n = s.Size; uint hash = 0x79FEF6E8; int i = 0; // course of every block whereas (i < n - 2) { // combining step hash += (uint) s[i++] + ((uint) s[i++]) << 16; // mixing step hash = Combine(hash); } // course of partial block if (i < n) { // combining step hash += (uint) s[i++]; // mixing step hash = Combine(hash); } // post-processing step hash = Combine(Combine(hash)); return hash; } uint Combine(int hash) { hash += hash << 11; hash ^= ~hash >> 5; hash -= hash << 13; return hash; }

The previous hash perform was too easy for example all the constructions generally present in hash capabilities. Determine 2 reveals a extra full instance.

- First is the initialization step, the place the interior state of the hash perform is readied.
- Subsequent, the message is break up up into zero or extra blocks of equal dimension, and the interior state of the hash perform is up to date for every block. On this instance, the message is split into 32-bit blocks. For every block, the combining perform is utilized to the prior state and the bits from the present block, after which the blending perform known as. This step is repeated for every full-sized block within the message.
- Then the ultimate, partial block is processed, if needed. The combining step is modified to accommodate the unfinished block.
- Lastly, some remaining processing is finished to additional randomize the interior state. On this case, the blending step is utilized two extra instances.

In case you study the hash perform from determine 1 once more, you’ll see that it virtually follows this construction. The initialization step is there (the hash worth is about to zero), and the message is processed in blocks of 16-bits. There is no such thing as a post-processing step, or quite there may be an “empty” post-processing step.

The principle distinction is that in determine 1 the blending step is carried out earlier than the combining step for every block. And since there isn’t any post-processing step, it’s clear that the bits from the final message block aren’t processed by the blending perform and subsequently don’t have an effect on the higher 16 bits of the hash worth. In case you had been to make use of this hash perform for hash desk lookup you’d need to be sure you use the lower-order bits of the hash worth as a substitute of the upper-order bits.

Later we’ll study the mysterious constants on this program, within the initialization step and within the mixing perform, to find out whether or not it is a “good” hash perform.

**Padding the Final Block**

For non-cryptographic use, it’s not extraordinarily vital how the partial remaining block is dealt with. Normally it’s OK simply to pad the message with 0’s. However that is unacceptable for cryptographic use. If all you do is pad with 0’s then it’s simple to seek out two messages with the identical hash worth– simply take a recognized message with a partial remaining block and append a zero to it. When the extra 0’s are added for padding, the 2 messages look equivalent. In case you simply pad with 1’s as a substitute, then you possibly can add a 1 bit to an current message.

For this and different causes, cryptographic hash capabilities all the time pad the unique message with numerous bits even when the unique message was an excellent block dimension, they usually all the time incorporate the message size into the hash calculation. The SHA-1 hash perform, for instance, all the time pads the message with a 1 bit, and the final 64 bits of the final block include the message size, in bits. The bits between the ultimate 1 and the size bits are set to 0. This scheme prevents an attacker from devising two completely different messages which can be equivalent after the padding operation.

## Mixing Features

The guts of a hash perform is its mixing step. The conduct of the blending perform largely determines whether or not the hash perform is collision-resistant. Subsequently it shouldn’t be stunning that good mixing capabilities have the identical attributes nearly as good hash capabilities, particularly collision resistance and uniform distribution.

Essentially the most notable distinction between a mixing perform and a hash perform is that the enter and output of a mixing perform are the identical dimension. The aim of the blending perform is to “scramble” or combine the interior state of the hash perform. The enter to the perform is the present inside state and the output of the perform turns into the brand new inside state.

As a result of the enter and output are the identical dimension, mixing capabilities *can* attain full collision-resistance, and they need to. If the blending perform reveals collisions, the hash perform will exhibit extra collisions than are needed. Collisions are assured to not occur if the blending perform is *reversible*, that’s, it’s doable to find out the enter of the blending perform by inspecting the output. If the perform will not be reversible, it implies that there are not less than two inputs that end in the identical output. By definition, that may be a collision.

**Reversible Operations**

Determine 3

*Reversible operations*

hash ^= fixed; hash *= fixed; // if fixed is odd hash += fixed; hash -= fixed; hash ^= hash >> fixed; hash ^= hash << fixed; hash += hash << fixed; hash -= hash << fixed; hash = (hash << 27) | (hash >> 5); // if 32 bits

*Non-reversible operations*

hash |= fixed; hash &= fixed; hash <<= fixed; hash >>= fixed; hash *= fixed; // if fixed is even hash /= fixed; hash %= fixed; hash += hash >> fixed;

Determine 3 lists some reversible and non-reversible mixing operations. It’s assumed that the “hash” variable is an unsigned integer and that the arithmetic operations are carried out modulo 2^{n} the place *n* is the dimensions of the hash variable in bits. Additionally it is assumed that 0 < fixed < 2^{n}. A mixing perform is assured to be reversible if it consists of steps which can be every reversible.

For the needs of this text it’s not vital *how* to reverse these operations, however you might be curious and it’s positively not apparent for a few of them. The ^= operation is simple—simply repeat it. The += and -= operations might be reversed with -= and +=.

The ^= operation mixed with << or >> shift is an attention-grabbing one. To reverse `hash ^= hash >> 9`

for instance, you’d do `hash ^= (hash >> 9) ^ (hash >> 18) ^ (hash >> 27)`

.

The += and -= operations mixed with << or >> shifts are equal to *= and might be restated that method.

The *= operation is probably the most troublesome to reverse. Search for *modular inverse* and *Euclidean algorithm* in your favourite search engine for particulars. Due to this issue, it’s tempting to make use of this sort of mixing step for one-way hash capabilities. Whether or not or not it is a sensible choice largely is dependent upon the efficiency of the CPU. Some CPU’s carry out multiplication rapidly and a few take for much longer than addition and subtraction. If multiplication is gradual and the fixed has a small variety of bits set to 1, then equal operations utilizing += and left-shifts will work.

## Avalanche!

The aim of the blending perform is to unfold the impact of every message bit all through all of the bits of the interior state. Ideally each bit within the hash state is affected by each bit within the message. And we need to do this as rapidly as doable merely for the sake of program efficiency.

A perform is claimed to fulfill the *strict avalanche criterion *if, every time a single enter bit is complemented (toggled between 0 and 1), every of the output bits ought to change with a chance of 1 half for an arbitrary collection of the remaining enter bits.

The avalanche criterion might be examined for the hash perform as an entire, or for simply the blending perform. As part of the general analysis of the hash perform it is smart to look at the avalanche conduct of the blending perform first.

The avalanche conduct might be precisely decided if the dimensions of the hash state is small. Whether it is bigger, resembling 160 bits for the SHA-1 hash, it often must be estimated. It’s doable to assemble a mixing perform with giant bit dimension for which the avalanche conduct might be precisely calculated, however these mixing capabilities are often gradual.

**Calculating Avalanche**

Determine 4

hash += hash << 1 Enter Output 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 3 0 0 1 1 2 0 0 1 0 6 0 1 1 0 3 0 0 1 1 9 1 0 0 1 4 0 1 0 0 12 1 1 0 0 5 0 1 0 1 15 1 1 1 1 6 0 1 1 0 2 0 0 1 0 7 0 1 1 1 5 0 1 0 1 8 1 0 0 0 8 1 0 0 0 9 1 0 0 1 11 1 0 1 1 10 1 0 1 0 14 1 1 1 0 11 1 0 1 1 1 0 0 0 1 12 1 1 0 0 4 0 1 0 0 13 1 1 0 1 7 0 1 1 1 14 1 1 1 0 10 1 0 1 0 15 1 1 1 1 13 1 1 0 1

Let’s have a look at a 4-bit mixing perform for example the overall methodology for calculating the chances that every enter however will have an effect on every output bit. Determine 4 reveals a mixing operation and lists all doable enter values and their corresponding output values. By inspecting this desk we will decide the chance that bit 1, for instance, will have an effect on bit 3.

To find out the impact of bit 1, we have a look at all eight doable mixtures of the opposite three enter bits and group the sixteen enter values into eight pairs the place every component of every pair differs solely in bit 1. Enter values 4 and 6 kind such a pair if we’re contemplating bit 1. These two enter values match in bit positions 0, 2, and three. The opposite pairs for bit 1 are 0 and a couple of, 1 and three, 5 and seven, 8 and 10, and so on.

Subsequent we study the corresponding pairs of output values to see which bits change when bit 1 adjustments. For enter values 4 and 6 we see that the corresponding output values change in bits 1, 2, and three however output bit 0 doesn’t change when bit 1 of the enter adjustments. Subsequently we are saying that bit 1 of the enter impacts bits 1, 2, and three of the output however not bit 0. If we have a look at all eight enter pairs for bit 1, we discover that bit 0 adjustments for not one of the pairs, bit 1 adjustments for all the pairs, bit 2 adjustments for 50% of the pairs, and bit 3 adjustments for 75% of the pairs. Subsequently this mixing step *by itself* doesn’t come near satisfying the strict avalanche criterion. We need to see about 50% for every output bit place for every enter bit.

**Propagation Criterion**

The strict avalanche situation is a particular case of the *propagation criterion*. It’s referred to as the propagation criterion of diploma 1, which implies that for those who toggle any single little bit of the enter, every output bit will change 50% of the time for those who take into account all of the doable mixtures of the *n* – 1 remaining bits.

A perform satisfies the propagation criterion of diploma 2 for those who toggle any *two* bits within the enter, every output bit adjustments 50% of the time for those who take into account all of the doable mixtures of the *n* – 2 remaining bits.

This methodology of calculating the avalanche conduct is simply possible as a result of there are solely 4 bits to take care of. For a 32-bit mixing perform, we must study sixty-four-thousand million pairs. This may be performed with a PC given adequate time, however it will be out of the query for a 64-bit, 96-bit, or bigger mixing perform.

For bigger mixing capabilities we will estimate the avalanche conduct by selecting enter values at random and toggling every bit on and off to see which output bits are affected. If we preserve a complete of the outcomes for every random selection, then if we select a number of random enter values we will get a great estimate of the chances.

**Precise C# Code**

Determine 5

*Base class for mixing capabilities. It offers a knowledge member for holding the interior hash state, a Dimension property to question the variety of bits that the blending perform handles, and a way for really mixing the state.*

public summary class MixingFunction { protected uint state = 0; public summary int Dimension { get; } public summary void Combine(); }

Bob Jenkins’ 32-bit hash for integer hash desk keys. Be aware that the blending perform makes use of solely reversible operations.

public class Jenkins32 : MixingFunction { public override int Dimension { get { return 32; } } public override void Combine() { state += (state << 12); state ^= (state >> 22); state += (state << 4); state ^= (state >> 9); state += (state << 10); state ^= (state >> 2); state += (state << 7); state ^= (state >> 12); } }

Earlier than we write a program to calculate bit-to-bit avalanche conduct, we’ll want a “harness” into which we will strap completely different mixing capabilities. Determine 5 reveals an summary base class for a mixing perform that ensures that every mixing perform could have the identical interface. I used an summary class as a substitute of an `interface`

as a result of we have to have a member variable for the hash state. This model has a 32-bit member variable referred to as `state`

.

Determine 5 additionally reveals a 32-bit mixing perform designed by Bob Jenkins. It’s meant for use for a hash desk with integer keys. It’s tempting to suppose you possibly can inform which bits have an effect on which different bits simply by wanting on the code. For instance, after step one `state += (state << 12)`

, it’s fairly clear that bits 0 by way of 19 will have an effect on bits 12 by way of 31 in a roundabout way. However while you mix this step with subsequent steps it’s doable that a few of these results will cancel out earlier results. The final step shifts issues 12 bits to the correct, for instance. And for those who have a look at all of the steps *besides* the final and add up all of the shifts, counting left shifts as constructive and proper shifts as unfavorable, you’ll see that the overall quantity of shifting earlier than the final step is zero. It’s sophisticated. We’ll simply should measure it to see the way it does.

We’ll create a perform for calculating the *avalanche matrix*. Every component `i,j`

of this 32×32 matrix of floating-point values will inform us the chance that bit `i`

of the enter will have an effect on bit `j`

of the output. We need to see a price of 0.5 for each mixture.

The most effective place to place this perform is within the `MixingFunction `

base class, which suggests the perform routinely turns into obtainable as a member perform for any mixing perform we create. It basically turns into a function of blending capabilities basically, as a substitute of getting to create this methodology in another class after which passing the blending perform as a parameter to the perform.

Determine 6

public double[,] AvalancheMatrix(int trials, int repetitions) { int dimension = this.Dimension; if (dimension != 32) throw new InvalidOperationException( "Hash have to be 32 bits."); if (trials <= 0 || repetitions <= 0) throw new ArgumentOutOfRangeException(); uint save, inb, outb; double dTrials = trials; int nBytes = 4; byte[] bytes = new byte[nBytes]; RandomNumberGenerator rng = new RNGCryptoServiceProvider(); int[,] t = new int[size, size]; whereas (trials-- > 0) { rng.GetBytes(bytes); save = state = (uint) (bytes[0] + 256U * bytes[1] + 65536U * bytes[2] + 16777216U * bytes[3]); for (int r=0; r<repetitions; r++) Combine(); inb = state; for (int i=0; i<dimension; i++) { state = save ^ (1U << i); for (int r=0; r<repetitions; r++) Combine(); outb = state ^ inb; for (int j=0; j<dimension; j++) { if ((outb & 1) != 0) t[i, j]++; outb >>= 1; } } } double[,] outcome = new double[size, size]; for (int i=0; i<dimension; i++) for (int j=0; j<dimension; j++) outcome[i, j] = t[i, j] / dTrials; return outcome; }

Determine 6 reveals the code for the `AvalancheMatrix`

methodology of the `MixingFunction`

class. It takes two parameters. The primary is `trials`

which is the variety of random enter vectors we’re going to generate. Evaluating all 2^{31}completely different mixtures of enter vectors for every bit mixture will take too lengthy, so we’ll simply decide about 1,000,000 of them at random. That’s the `trials`

parameter.

The second parameter is `repetitions`

. We’ll see in a short while that even when a mixing perform doesn’t obtain avalanche in a single name, it should typically do higher if the blending is carried out two or extra instances in a row. The`repetitions`

parameter lets us simply take a look at the conduct of a number of functions of the blending perform.

The very first thing the perform does is to make sure the blending perform is a 32-bit mixing perform, since this model of the algorithm doesn’t assist different sizes. We additionally verify that the perform parameters are legitimate.

Subsequent we declare our variables, initialize a few of them (C# initializes every part else to default values routinely), allocate some reminiscence, and initialize the random quantity generator. I used RNGCryptoServiceProvider from the System.Safety.Cryptography namespace as a result of I do know the System.Random class has linear correlations between values *x _{n}*,

*x*, and

_{n-34}*x*since it’s a two-tap linear suggestions PRNG. I’m not 100% certain what RNGCryptoServiceProvider is doing, however the scanty documentation I’ve seen counsel it makes use of entropy sources for seed era and updating, and a cryptographic hash for output whitening. I like that higher, though it’s actually gradual.

_{n-55}Inside the primary loop we begin by getting 32 random bits and setting the interior state. We additionally save that state for later. Then we name the blending perform the suitable variety of instances and save the brand new state.

Then we toggle every of the 32 enter bits one by one with the assertion `state = save ^ (1U << i)`

, then we combine that new state and examine it to the unique combined state by doing `outb = state ^ inb`

which provides us 0’s the place the 2 states match and 1’s the place they differ. Then we depend the 1’s and preserve a tally.

Lastly, after all of the trials are completed, we divide every depend by the overall variety of trials to normalize them into the [0, 1] vary. Keep in mind that 0.5 is our objective.

## The Verdict

Determine 7

*Right here’s how one can name the AvalancheMatrix calculator:*

MixingFunction mf = new Jenkins32(); double[,] am = mf.AvalancheMatrix(1000000, 1);

Now we’re able to learn the way effectively Bob Jenkins did with this mixing perform with reference to reaching avalanche. Determine 7 reveals the 2 strains of code we have to create an occasion of the `Jenkins32`

mixing perform and name the `AvalanceMatrix`

methodology.

Listed below are the outcomes, proven as percentages from 0 to 100, rounded to 1%:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 52 50 50 52 50 54 1 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 52 50 50 52 50 2 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 52 50 50 50 3 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 52 49 50 4 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 50 5 51 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 54 6 50 51 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 7 49 50 51 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 8 50 49 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 9 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 10 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 11 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 12 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 13 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 14 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 15 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 16 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 49 50 50 50 50 50 50 50 17 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 49 50 50 50 50 50 50 18 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 49 50 50 50 50 50 19 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 49 50 50 50 50 20 50 55 50 54 50 50 48 50 51 50 50 50 50 50 50 50 50 50 50 50 51 50 50 50 50 50 50 50 50 50 50 50 21 53 50 54 50 52 50 50 48 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 47 50 22 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 23 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 24 50 50 50 50 50 49 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 50 26 49 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 50 27 49 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 54 28 50 48 49 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 51 50 50 50 50 50 50 51 46 29 50 50 48 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 52 50 50 50 50 50 49 51 30 48 50 50 48 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 52 50 50 50 50 50 50 31 50 48 50 50 50 51 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 53 50 50 50 49 51

Every row represents an enter bit and every column represents an output bit. So to find out the chance that bit 6 impacts bit 1, have a look at the row numbered “6” (the seventh row) and the column numbered “1” and also you’ll see that the result’s 51%.

The technical time period for these outcomes is *candy*. We might be fairly assured that this mixing perform will do a great job at collision-avoidance with 32-bit integers for the needs of hash desk lookup.

But when we study the information *very* intently, we do discover that this perform doesn’t behave fairly the identical as a perform that satisfies the strict avalanche criterion (SAC) would. This does *not* counsel that it is a unhealthy mixing perform. It’s superb. I’m simply utilizing it for example to indicate how one can analyze the avalanche conduct.

The numbers in crimson present the locations the place the chance differs by not less than 4% from the anticipated worth. For instance, the 0th bit impacts the thirty first bit 54% of the time. To inform whether or not that is important or whether or not it’s only a random variation as a result of the truth that we didn’t really verify *each* mixture of enter bits and as a substitute simply did a pattern, we have to examine this to a mixing perform that satisfies SAC precisely.

Such a perform would have precisely a 50% chance in each place, however as a result of we’re sampling there can be variations. Since we did a million samples it will behave like flipping a coin a million instances, and we might see a binomial distribution. This could look basically the identical as a traditional distribution with a imply of fifty and a normal deviation of 0.05. However some positions can be worse than others. With 1024 parts (32 × 32), we will count on that the *worst* constructive deviation goes to be about 3.5 customary deviations above the imply, or about 0.175. As a result of that is effectively beneath 0.5, we shouldn’t see any non-50’s within the desk if we spherical to the closest 1%.

Determine 8

public class Artificial32 : MixingFunction { RandomNumberGenerator rng = new RNGCryptoServiceProvider(); byte[] b = new byte[4]; public override int Dimension { get { return 32; } } public override void Combine() { rng.GetBytes(b); state = BitConverter.ToUInt32(b, 0); } }

Since we don’t (but) know of a mixing perform that obeys SAC precisely, we will as a substitute create one that truly does a 50% coin flip (simulated). This isn’t an actual mixing perform as a result of the outputs don’t depend upon the inputs, however it should idiot the avalanche calculator into pondering it’s actual. Determine 8 reveals this mixing perform.

I received’t present the outcomes of this perform right here as a result of it will be a whole waste of bandwidth, however you possibly can run it your self and see that it reveals 50’s in each place, and the uncooked information reveals that the worst positions in all probability deviate from this by about 0.175 share factors (it varies every time you run it, after all, so your outcomes might differ).

**How Unhealthy Can It Get?**

Determine 9

*Knuth Multiplicative mixing perform*

public class KnuthMultiplicative : MixingFunction { public override int Dimension { get { return 32; } } public override void Combine() { state *= 2654435761U; } }

*A sub-section of the avalanche matrix:*

0 1 2 3 4 5 6 7 0 100 0 0 0 100 50 75 63 1 0 100 0 0 0 100 50 75 2 0 0 100 0 0 0 100 50 3 0 0 0 100 0 0 0 100 4 0 0 0 0 100 50 25 13 5 0 0 0 0 0 100 50 25 6 0 0 0 0 0 0 100 50 7 0 0 0 0 0 0 0 100

In case you examine the Jenkins32 perform to a extra primitive multiplicative mixer that folks have utilized in actual hash capabilities, you’ll see that we might do a lot, a lot worse. Determine 9 reveals such a perform and a sub-section of the outcomes matrix. It clearly reveals that this hash perform is unhealthy at propagating high-order bits into low-order bit positions. Whether or not that’s acceptible or not is dependent upon the applying. And it’s not likely as unhealthy because it appears, as a result of the meant use of this perform requires taking a main modulus of the ultimate worth, which does scramble the decrease bits considerably.

**Can We Do Higher?**

Is it doable to do higher than Jenkins32? The reply is sure. For instance this, let’s simply run Jenkins32 *twice*. Step one already does an awesome job at mixing, so if we do it twice it needs to be even higher, proper? We will simply do that by calling the `AvalancheMatrix`

methodology with a `repetitions `

parameter of two. And we discover that, the truth is, the outcomes are near-perfect. They’re virtually indistinguishable from a SAC-perfect mixing perform.

Be aware that this *isn’t* true of each mixing perform. With `KnuthMultiplicative`

, these 0’s and 100’s by no means go away due to the place they’re positioned.

You would possibly surprise if a mixing perform that precisely satisfies SAC even exists. As soon as once more the reply is sure. Determine 10 reveals a 4-bit mixing perform with this property. The reality desk of this perform is proven so you possibly can verify that SAC is definitely happy.

Determine 10

*A four-bit mixing perform that precisely satisfies the strict avalanche criterion.*

public void Combine() { uint[] tab = {8, 7, 0, 10, 1, 3, 5, 12, 11, 13, 15, 14, 2, 6, 9, 4}; state = tab[state]; }

*The reality desk for the above perform:*

Enter Output 0 0 0 0 0 8 1 0 0 0 1 0 0 0 1 7 0 1 1 1 2 0 0 1 0 0 0 0 0 0 3 0 0 1 1 10 1 0 1 0 4 0 1 0 0 1 0 0 0 1 5 0 1 0 1 3 0 0 1 1 6 0 1 1 0 5 0 1 0 1 7 0 1 1 1 12 1 1 0 0 8 1 0 0 0 11 1 0 1 0 9 1 0 0 1 13 1 1 0 1 10 1 0 1 0 15 1 1 1 1 11 1 0 1 1 14 1 1 1 0 12 1 1 0 0 2 0 0 1 0 13 1 1 0 1 6 0 1 1 0 14 1 1 1 0 9 1 0 0 1 15 1 1 1 1 4 0 1 0 0

We all know that we will get arbitrarily shut simply by concatenating a good-enough perform two or extra instances. The problem is to get nearer with out including extra computation time.

**Mixing Operate Search**

Warning! I can consider no sensible purpose why you would want “good” avalanche as a substitute of the superb avalanche achieved by the `Jenkins32`

integer hash. The remainder of this web page is motivated purely by mental curiosity.

To create the blending perform of Determine 10 I did a random search of various permutation tables till I got here throughout one which happy the SAC. That works for a 4-bit perform, however a lookup desk isn’t going to be possible for a 32-bit or bigger perform. So we will select to restrict our search to capabilities which can be a mixture of the reversible operations from the earlier web page. That’s the place these “magic numbers” are available in. We select them at random! Some values shall be good, some shall be unhealthy, however we’ve got a method to measure to see that are the most effective.

However a totally random search shall be too gradual. Even when we limit our search to capabilities of the identical kind as`Jenkins32`

, that’s nonetheless 31^{8} mixtures, or virtually 1,000,000 million. And most of these shall be full garbage.

Determine 11

*Outcomes of looking for coefficient vectors adjoining to known-good vectors. The beginning vector (12, 22, 4, 9, 10, 2, 7, 12) corresponds to the eight constants within the Jenkins32 mixing perform.*

Error Vector 0.0257 12 22 4 9 10 2 7 12 0.0185 12 22 4 9 10 2 7160.0116 12184 9 10 2 7 16 0.0112 12 18 4 9 10 2 7150.0105 12 18 4 9 10 2815 0.0088 12 1859 10 2 8 15 0.0085 12 18 5 9 1038 15 0.00831618 5 9 10 3 8 15 0.0080 16175 9 10 3 8 15 0.0052 16 17 5 9 10 3615 0.0052 16135 9 10 3 6 15 0.0048 16 13 5 9 10 3 6170.0047 16 13 5710 3 6 17 0.0044 16 13 5 7 1026 17 0.0039 16 13 5 7 10 2 6160.0039 16 1347 10 2 6 16 0.0038 16 13 4 7 10 2816 0.0024 16 13 4 7 1058 16

As a substitute what we’ll do is begin with a known-good perform and check out modifying every of the constants one by one to see if we will enhance on the earlier avalanche matrix. As a way to examine two capabilities to find out which is healthier, we’ll want a single quantity that summarizes the avalance matrix outcomes. I selected to make use of the “sum of squared errors”, which simply means I take the distinction between every matrix entry and the anticipated worth of 0.5, sq. it, after which sum all 1024 values collectively.

I used 100 thousand trials for the matrix calculation. In case you use too few trials there may be an excessive amount of “noise” within the outcomes and you may’t reliably examine values. The minimal outcome we will count on to get this manner is 0.00256, which is what you’d get if the every matrix worth had been distributed binomially, which is what you’d count on from a perform that satisfies SAC. `Jenkins32`

will get about 0.0257, about 10 instances the anticipated squared error.

Determine 11 reveals the trajectory of the most effective “search path” that I’ve been capable of finding utilizing `Jenkins32`

as the place to begin. It reveals that we’re capable of cut back the squared error basically equal to the theoretical minimal, minus some sampling noise.

## Evaluating Hash Features

Determine 12

*Summary base class for hash capabilities.*

public delegate byte[] GetRandomKey(); public summary class HashFunction { // PRNG for producing random keys static RandomNumberGenerator rng = new RNGCryptoServiceProvider(); static byte[] b = new byte[4]; // essential hashing working public summary uint ComputeHash(byte[] information); // generate a random key size personal static int GetRandomLength() { rng.GetBytes(b); uint s = (uint) (b[0] + 256U*b[1] + 65536U*b[2] + 16777216U*b[3]); double x = 1-s/(uint.MaxValue+1.0); return (int) Math.Flooring( Math.Sqrt(-800.0 * Math.Log(x))); } // generate a key with random octets public static byte[] GetUniformKey() { int size = GetRandomLength() + 2; byte[] key = new byte[length]; rng.GetBytes(key); return key; } // generate a key with "textual content" octets public static byte[] GetTextKey() { int size = GetRandomLength() + 4; byte[] key = new byte[length]; rng.GetBytes(key); for (int i=0; i<size; i++) key[i] = (byte) (65 + (key[i]*key[i]*26)/65026); return key; } // generate a key with sparse octets public static byte[] GetSparseKey() { int size = GetRandomLength() + 6; byte[] key = new byte[length]; rng.GetBytes(key); for (int i=0; i<size; i++) key[i] = (byte)(1<<(key[i]&7)); return key; } }

Be aware that the ComputeHash perform prototype isn’t adequate for all real-world hashing eventualities. It requires that every one the information is contained in a single octet array, whereas in actuality you might want to offer information in chunks or in a format aside from octets. However this definition is adequate for our analysis situation the place we management the keys.

The bottom class consists of performance for producing random keys for numerous assessments. This perform is unbiased of the conduct of any specific hash perform.

Within the previous part we seemed on the avalanche conduct of blending capabilities. Now we’re prepared to have a look at the conduct of full hash capabilities together with initialization, combining, mixing, and post-processing steps. We’ll consider the avalanche, distribution, and correlation efficiency of a number of current string hash capabilities, together with a hash with 96-bit inside state by Bob Jenkins, the Fowler/Noll/Vo (FNV) 32-bit hash, a simplistic hash for a baseline comparability, and a cryptographic hash.

First we’ll describe the assessments and supply outcomes for our baseline hash perform (outlined under), and in subsequent sections we’ll consider every hash perform individually. Lastly, we’ll summarize the outcomes for all of our take a look at candidates.

## What to Look For in a Hash Operate

Earlier than we do some testing, I need to level out that the true take a look at entails testing the hash perform *in situ*. Completely different functions have completely different hash necessities, and what works effectively in a single scenario might not work effectively in one other. What we’re going to have a look at is the generic conduct of hash capabilities, assuming we all know nothing concerning the keys which can be going to be hashed. This could inform us one thing concerning the worst-case efficiency of the capabilities.

Since we assume no data concerning the hash keys, we’ll use fully random keys. Keys shall be arrays of octets. We’ll use keys that include uniformly distributed octets, keys that simulate textual content, and keys which can be very bit-sparse (a single “1” bit per octet). Why not simply use random keys with uniformly distributed octets? As a result of a hash perform isn’t really wanted in that situation. If you understand forward of time that your keys are composed of fully random bits, you possibly can simply use a subsection of the important thing as your hash worth and be assured good distribution properties. Including “textual content” keys and sparse keys will illustrate a broader vary of real-world necessities.

Good *basic goal* hash capabilities ought to have these properties:

- The output needs to be uniformly distributed. Every sample of output bits needs to be equally possible. We’ll use a one-tailed χ
^{2}bucket take a look at for this. See this site by Amar Patel for a really clear clarification of the χ^{2 }take a look at, or this page for a extra formal clarification. - Each enter bit ought to have an effect on each output bit about 50% of the time for those who take into account all doable mixtures of the opposite enter bits. That is the avalanche situation. Conversely, each output bit needs to be affected by each enter bit. That is just like the “no-funnels” situation described by Bob Jenkins. We’ll want to switch our avalanche evaluator to deal with variable-sized enter bit vectors.
- There needs to be no correlations between pairs of output bits. If output bits are correlated, you’re not getting a full
*n*bits of output. We received’t take a look at for this as a result of it’s simple for all however probably the most unsuitable hash capabilities to keep away from these correlations. - It needs to be computationally infeasible to determine two keys that produce the identical hash worth or to determine a key that produces a given hash worth. It is a requirement for cryptographic hashes, that are for much longer than 32-bits. Because it’s simple to seek out collisions for even a perfect 32-bit hash, this isn’t a requirement that we’ll use.

Determine 13

*A rudimentary hash perform.*

public class SimpleHash : HashFunction { public override uint ComputeHash(byte[] information) { uint hash = 0; foreach (byte b in information) hash = (hash + b) * 0x50003; return hash; } }

## Take a look at Harness

We first have to create a take a look at harness for hash capabilities similar to we did for mixing capabilities. Determine 12 reveals the `HashFunction`

summary base class which we’ll use for all of the hashes we’ll examine. We’re going to imagine a 32-bit hash output for all of the hash capabilities.

Various kinds of keys require completely different lengths. To ensure that the χ^{2} assessments with 2^{16} buckets to be legitimate, all the keys have to include not less than 16 bits of data. The sparse keys solely include three bits of data in every octet, so we want not less than six octets to make sure 16 bits of data. The “textual content” keys include 4.34 bits per octet on common, so we want not less than 4 octets for that one. We solely want two octets for the uniformly distributed octets.

Determine 13 lists a rudimentary hash perform which we’ll use to work out our testing methodology. The fixed `0x50003`

was chosen for the flexibility for the “3” to combine key bits into the decrease half of the hash worth and the “5” to combine key bits into the higher half. We keep away from utilizing the identical quantity for each halves to attenuate correlations between every set of bits.

Determine 14

*χ ^{2} take a look at outcomes for the SimpleHash perform. Checks had been performed by bucketizing values obtained from the higher and decrease ends of the 32-bit hash values, for bucket counts 2^{1 }by way of 2^{16}.*

Uniform keys Textual content keys Sparse keys Bits Decrease Higher Decrease Higher Decrease Higher 1 0.480 1.000 0.396 0.203 0.322 0.396 2 0.585 0.678 0.835 0.932 0.873 0.864 3 0.766 0.694 0.377 0.862 0.139 0.496 4 0.982 0.527 0.607 0.463 0.225 0.886 5 0.847 0.469 0.683 0.921 0.391 0.371 6 0.943 0.624 0.773 0.679 0.117 0.860 7 0.415 0.854 0.161 0.184 0.318 0.883 8 0.682 0.687 0.886 0.335 0.248 0.919 9 0.647 0.906 0.302 0.640 0.475 0.904 10 0.565 0.740 0.779 0.362 0.421 0.481 11 0.693 0.875 0.268 0.755 0.814 0.111 12 0.499 0.476 0.443 0.846 0.903 0.794 13 0.140 0.852 0.234 0.856 0.640 0.665 14 0.390 0.436~~0.000~~0.624 0.129 0.756 15~~0.000 0.003 0.000~~0.079 0.113 0.276 16~~0.000 0.000 0.000~~0.031~~0.000~~0.108

~~Struck~~* values present failures on the 1% significance stage, and *indirect* values present failures on the 5% stage. This hash has extreme weaknesses for hash tables with 2*^{14}* or extra buckets, though this perform did acceptably effectively within the higher bits for sparse keys all the way in which as much as 2*^{16}* buckets. Subsequently it is a quick, easy, and easy-to-remember perform appropriate for small hash tables as much as 8192 buckets for all kinds of keys.*

## Testing Uniform Distribution

Hash capabilities used with hash tables of dimension 2^{n} would usually use solely the lower-order bits of a 32-bit hash perform output. However to be thorough in our assessments we’ll have a look at the distribution properties of each the low-order and high-order bits, as much as 16 bits lengthy.

For every take a look at we’ll name the hash perform repeatedly with random keys, and look to see which bucket we’re assigned. We’ll depend the hits to every bucket, and examine the outcome to the anticipated outcome for a very uniform distribution. We’ll use the one-tailed χ^{2} take a look at for this. (We don’t care if the distribution is “too uniform” since we’re throwing random keys at every perform. If we had been testing a random quantity generator we would need a two-tailed take a look at.) The χ^{2}levels of freedom for every measurement shall be 2^{m}-1, the place *m* is the bit-size of the sub-range that we’re testing.

The results of every χ^{2} take a look at shall be a p-value, which is a quantity from 0.0 to 1.0, and we might count on these to be uniformly distributed in that vary if the hash output is uniform. The p-value signifies the chance {that a} really uniform distribution can be worse that the noticed distribution by probability alone. For instance a p-value of 0.01 signifies {that a} really uniform distribution would solely be worse than the noticed distribution just one% of the time, which is unhealthy.

Even when we outline p=0.01 because the cut-off for a “unhealthy” distribution, a very uniform distribution would have a p-value worse than this 1% of the time, by definition. So we have to have a look at every failure in context to see if it’s possible simply random probability or whether or not it varieties a part of a sample. Generally I’ll re-run a take a look at a number of instances to find out which ends are important and which aren’t, and I solely current probably the most consultant outcomes. I wouldn’t do that in a scientific context, however this isn’t that.

Determine 15

*Avalanche conduct for SimpleHash perform. Every row represents one enter bit and the 32 squares in every row point out the affect on every hash bit by that enter bit. Inexperienced signifies acceptible avalanche conduct, orange is unacceptible, and crimson implies that there was no mixing in any respect (both 0% or 100% affect). The highest row corresponds to the LSB of the 0th-index key octet and the underside row corresponds to the MSB of the final key octet.*

*For 2-octet keys:*

*For four-octet keys:*

*For 256-octet keys:*

*This hash has completely horrible avalanche conduct. Some output bits aren’t combined in any respect. The truth that the LSB doesn’t get combined in any respect signifies that this hash wouldn’t present a uniform distribution if the LSBs of the keys weren’t already uniform.*

This total conduct is typical of multiplicative hashes except additional steps are taken throughout post-processing to enhance the blending conduct.

For every take a look at, we’ll generate sufficient random keys to fill every bucket with a mean of 100 keys. The important thing size shall be a random variable

the place *okay*+ground(sqrt(-800*ln(*x*)))*x* is a uniformly distributed random variable on (0, 1] and *okay* is the minimal key size required to supply 16 bits of information in the important thing, which varies by the kind of key. The perform `ground`

signifies rounding down, `sqrt`

signifies square-root, and `ln`

is the pure logarithm. We’ll want a minimal key size of not less than *okay* as a result of we’ll be doing χ^{2} assessments with as much as 2^{16} buckets, and the presence of some too-short keys would power non-uniform output in these instances and trigger these assessments to fail whatever the hash perform used.

Determine 14 reveals the outcomes for the `SimpleHash`

perform. It reveals weak spot in each the higher and decrease bits when use 14 or extra of the bits. The higher bits are just a little stronger–this half of the hash will get some “overflow” from multiplications that perform of the decrease half.

## Avalanche Testing

If a hash perform passes the uniform distribution take a look at, why do any of the opposite assessments matter? Whereas it’s true that the first use for many hashes is for hash desk lookup, that’s not the one use. Small hashes can be utilized for non-cryptographic doc fingerprinting, for instance. Additionally, different assessments can determine lessons of keys for which the uniform distribution would possibly fail even when the assessments with random keys succeed.

In contrast to the earlier part which pursued the Strict Avalanche Situation to paranoid ranges, we’ll take into account “adequate” avalanche to be when every enter bit impacts every output bit between 1/third and a couple of/3rds of the time. The blending perform shall be utilized repeatedly most often, which is able to solely enhance the avalanche conduct if it’s adequate to start out with. If single-round avalanche is borderline, we would take into account calling the blending perform one or two extra instances as a part of the post-processing step. This take a look at will inform is which capabilities would profit from that.

The definition of avalanche requires that we solely consider keys with uniformly distributed enter octets, so this take a look at received’t use GetTextKeys or GetSparseKeys. Additionally, it’s difficult to check each enter bit when keys get very lengthy, so we’re going to limit keys to particular lengths. We’ll study two-octet keys which is able to permit us to precisely calculate the avalanche matrix, in addition to four-octet keys for which we’ll have to do sampling simply as earlier than. Lastly we’ll use 256-octet keys to see the avalanche conduct when a number of rounds of the blending perform are used, and we’ll have a look at enter bits within the first and final octets.

The 16×32 and 32×32 avalanche matrices are cumbersome to current on an internet web page, so we’ll use a graphical abstract of the outcomes as a substitute. We’ll use inexperienced squares for bit mixtures that achieved avalanche, orange squares that didn’t, and crimson squares the place the enter bits had both no have an effect on or a direct have an effect on on the output (0% or 100% have an effect on).

Determine 15 reveals the outcomes of the avalanche assessments for the `SimpleHash`

perform. Subsequent we’ll have a look at another well-known hash capabilities.

## Fowler/Noll/Vo Hash

Determine fnv1

*Fowler/Noll/Vo (FNV) 32-bit hash perform.*

public class FNV : HashFunction { int shift; uint masks; // hash with out xor-folding public FNV() { shift = 0; masks = 0xFFFFFFFF; } // hash with xor-folding public FNV(int bits) { shift = 32 - bits; masks = (1U << shift) - 1U; } public override uint ComputeHash(byte[] information) { uint hash = 2166136261; foreach (byte b in information) hash = (hash * 16777619) ^ b; if (shift == 0) return hash; return (hash ^ (hash >> shift)) & masks; } }

This hash perform is described here. It’s a easy multiplicative hash with the addition of a post-processing step referred to as *xor-folding* to take away some linearity within the decrease bits. The FNV authors advocate utilizing xor-folding if the specified hash dimension will not be an influence of two, however they really imply to make use of xor-folding when the hash dimension will not be an influence of two or is lower than 32.

There are completely different initialization and multiplication constants to be used with 32-bit, 64-bit, 128-bit hashes, and so on., however we’ll solely study the 32-bit model of FNV right here. In our uniform distribution take a look at we’ll use xor-folding for the decrease bits since that’s the meant use of this hash, however we’ll study the higher bits as-is. Since our avalanche take a look at is a full 32-bit take a look at, we will’t use xor-folding there. This can permit us to determine lessons of keys for which utilizing xor-folding is a necessity.

The first appeals of this hash are its simplicity and its pace, however its pace depends on whether or not or not the CPU structure helps quick integer multiplication. The Jenkins shift-add-xor hashes are quicker on CPU architectures with out quick multiplication.

Determine fnv1 reveals the itemizing for the FNV 32-bit hash.

**Uniform Distribution Take a look at**

We study the distribution of numbers derived from decrease and higher bits of the hash output in sizes of 1 by way of 16 bits.

Determine fnv3 reveals the outcomes of this take a look at for the FNV hash. This take a look at signifies that the FNV 32-bit hash with xor-folding produces uniformly distributed values for hash tables which can be an influence of two, as much as not less than 2^{14}, when the important thing octets are uniformly distributed, distributed just like alphabetic textual content, or sparsely distributed.

Determine fnv2

*Avalanche conduct of the FNV 32-bit hash.*

*For 2-octet keys:*

*For four-octet keys:*

*For 256-octet keys:*

Determine fnv3

*χ ^{2} take a look at outcomes for FNV 32-bit hash, with xor-folding for the lower-bit assessments.*

Uniform keys Textual content keys Sparse keys Bits Decrease Higher Decrease Higher Decrease Higher 1 0.777 0.888 0.888 0.888 0.480 0.480 2 0.967 0.326 0.407 0.197 0.513 0.720 3 0.109 0.390 0.498 0.103 0.573 0.016 4 0.548 0.416 0.649 0.210 0.143 0.469 5 0.360 0.606 0.931 0.992 0.665 0.201 6 0.328 0.753 0.584 0.416 0.882 0.361 7 0.436 0.560 0.995 0.302 0.981 0.124 8 0.297 0.729 0.856 0.730 0.472 0.113 9 0.222 0.349 0.629 0.951 0.701 0.769 10 0.731 0.208 0.066 0.646 0.875 0.551 11 0.813 0.356 0.678 0.820 0.519 0.556 12 0.076 0.229 0.521 0.068 0.091 0.474 13 0.780 0.565 0.719 0.090 0.117 0.132 14 0.225 0.813 0.269 0.251 0.855 0.568 15 0.583 0.005 0.699 0.370 0.571 0.218 16 0.004 0.000 0.117 0.012 0.024 0.211

The higher bits aren’t uniformly distributed for those who use greater than 14 or 15 bits. Due to this, I don’t advocate utilizing this hash with hash tables bigger than 2^{14} buckets. The outcomes are acceptible as much as 2^{16} bits for textual content keys, so that you might be able to use it for that goal for those who rigorously take a look at the efficiency in your specific utility.

**Avalanche Take a look at**

We study the diffusion of enter bits into completely different bit positions within the output.

Determine fnv2 reveals the outcomes of this take a look at for the FNV hash. This take a look at signifies that the FNV hash has poor avalanche conduct, as do all easy multiplicative hashes. Because of this there shall be some lessons of keys for which the hash perform doesn’t produce uniform output, even for small bucket counts the place the χ^{2} assessments succeeded above.

Of specific concern are the 2 low-order bits. These are all the time only a easy linear perform of the 2 low-order bits of the keys octets. For instance, within the full 32-bit hash worth, the low-order bit will all the time simply be a easy XOR of the LSBs of the important thing octets and the LSB from the intialization fixed. Additionally of concern is that incontrovertible fact that the higher bits of the important thing octets should not have any affect on the low-order bits of the hash output (with out xor-folding). The MSB of every key octet doesn’t diffuse in any respect into all the decrease 8 bits of the hash worth. This means that you just *should* observe the xor-folding suggestions for all lessons of keys.

Additionally, the blending step of this hash is rarely utilized to the final octet of the important thing. This reveals clearly within the avalanche outcomes. The authors supply an alternate type of the hash the place the combining step is finished *earlier than* the blending step, and I like to recommend that you just undertake this different for those who use FNV. There is no such thing as a purpose to do in any other case, and I’m shocked that the authors don’t advocate this by default.

**Conclusion**

Determine fnv4

*Modified FNV with good avalanche conduct and uniform distribution with bigger hash sizes.*

public class ModifiedFNV : HashFunction { public override uint ComputeHash(byte[] information) { const uint p = 16777619; uint hash = 2166136261; foreach (byte b in information) hash = (hash ^ b) * p; hash += hash << 13; hash ^= hash >> 7; hash += hash << 3; hash ^= hash >> 17; hash += hash << 5; return hash; } }

I don’t advocate utilizing 32-bit FNV as a general-purpose hash as-is. It may well produce uniform output, however its suitability must be examined for every class of keys for which you plan to make use of it. It’s more likely to produce non-uniform output when you’ve got greater than 2^{14} or 2^{15} hash buckets. If the CPU structure doesn’t have quick integer multiplication, use a shift-add-xor hash as a substitute.

If you wish to use an FNV-style hash perform, I like to recommend utilizing the modified model listed in Determine fnv4. This model passes all of the uniform distribution assessments above and it achieves avalanche for each examined mixture of enter and output bits (inexperienced squares in all places). No xor-folding step is required.

The one distinction between this model and the unique model is that the blending steps happens *after* the combining step in every spherical, and it provides a post-processing step to combine the bits even additional. These two adjustments fully appropriate the avalanche conduct of the perform. Because of this, this model of FNV passes all the χ^{2} assessments above, all the way in which as much as 2^{16} buckets. I haven’t examined bigger sizes however I believe it will be OK there as effectively.