jump

package module
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2021 License: MIT Imports: 5 Imported by: 19

README

Jump Consistent Hash

Build Status Godoc

Go implementation of the jump consistent hash algorithm[1] by John Lamping and Eric Veach.

[1] http://arxiv.org/pdf/1406.2294v1.pdf

Usage

import jump "github.com/lithammer/go-jump-consistent-hash"

func main() {
    h := jump.Hash(256, 1024)  // h = 520
}

Includes a helper function for using a string as key instead of an uint64. This requires a hasher that computes the string into a format accepted by Hash(). Such a hasher that uses CRC-64 (ECMA) is also included for convenience.

h := jump.HashString("127.0.0.1", 8, jump.NewCRC64())  // h = 7

In reality though you probably want to use a Hasher so you won't have to repeat the bucket size and which key hasher used. It also uses more convenient types, like int instead of int32.

hasher := jump.New(8, jump.NewCRC64())
h := hasher.Hash("127.0.0.1")  // h = 7

If you want to use your own algorithm, you must implement the KeyHasher interface, which is a subset of the hash.Hash64 interface available in the standard library.

Here's an example of a custom KeyHasher that uses Google's FarmHash algorithm (the successor of CityHash) to compute the final key.

type FarmHash struct {
    buf bytes.Buffer
}

func (f *FarmHash) Write(p []byte) (n int, err error) {
    return f.buf.Write(p)
}

func (f *FarmHash) Reset() {
    f.buf.Reset()
}

func (f *FarmHash) Sum64() uint64 {
    // https://github.com/dgryski/go-farm
    return farm.Hash64(f.buf.Bytes())
}

hasher := jump.New(8, &FarmHash{})
h := hasher.Hash("127.0.0.1")  // h = 5

License

MIT

Documentation

Overview

Package jump implements the "jump consistent hash" algorithm.

Example

h := jump.Hash(256, 1024)  // h = 520

Reference C++ implementation[1]

int32_t JumpConsistentHash(uint64_t key, int32_t num_buckets) {
  int64_t b = -1, j = 0;
  while (j < num_buckets) {
    b = j;
    key = key * 2862933555777941757ULL + 1;
    j = (b + 1) * (double(1LL << 31) / double((key >> 33) + 1));
  }
  return b;
}

Explanation of the algorithm

Jump consistent hash works by computing when its output changes as the number of buckets increases. Let ch(key, num_buckets) be the consistent hash for the key when there are num_buckets buckets. Clearly, for any key, k, ch(k, 1) is 0, since there is only the one bucket. In order for the consistent hash function to balanced, ch(k, 2) will have to stay at 0 for half the keys, k, while it will have to jump to 1 for the other half. In general, ch(k, n+1) has to stay the same as ch(k, n) for n/(n+1) of the keys, and jump to n for the other 1/(n+1) of the keys.

Here are examples of the consistent hash values for three keys, k1, k2, and k3, as num_buckets goes up:

   │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │ 10 │ 11 │ 12 │ 13 │ 14
───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼────┼────┼────┼────┼────
k1 │ 0 │ 0 │ 2 │ 2 │ 4 │ 4 │ 4 │ 4 │ 4 │  4 │  4 │  4 │  4 │  4
───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼────┼────┼────┼────┼────
k2 │ 0 │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │ 7 │ 7 │  7 │  7 │  7 │  7 │  7
───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼────┼────┼────┼────┼────
k3 │ 0 │ 1 │ 1 │ 1 │ 1 │ 5 │ 5 │ 7 │ 7 │  7 │ 10 │ 10 │ 10 │ 10

A linear time algorithm can be defined by using the formula for the probability of ch(key, j) jumping when j increases. It essentially walks across a row of this table. Given a key and number of buckets, the algorithm considers each successive bucket, j, from 1 to num_buckets­1, and uses ch(key, j) to compute ch(key, j+1). At each bucket, j, it decides whether to keep ch(k, j+1) the same as ch(k, j), or to jump its value to j. In order to jump for the right fraction of keys, it uses a pseudo­random number generator with the key as its seed. To jump for 1/(j+1) of keys, it generates a uniform random number between 0.0 and 1.0, and jumps if the value is less than 1/(j+1). At the end of the loop, it has computed ch(k, num_buckets), which is the desired answer. In code:

int ch(int key, int num_buckets) {
  random.seed(key);
  int b = 0; // This will track ch(key,j+1).
  for (int j = 1; j < num_buckets; j++) {
    if (random.next() < 1.0 / (j + 1)) b = j;
  }
  return b;
}

We can convert this to a logarithmic time algorithm by exploiting that ch(key, j+1) is usually unchanged as j increases, only jumping occasionally. The algorithm will only compute the destinations of jumps ­­ the j’s for which ch(key, j+1) ≠ ch(key, j). Also notice that for these j’s, ch(key, j+1) = j. To develop the algorithm, we will treat ch(key, j) as a random variable, so that we can use the notation for random variables to analyze the fractions of keys for which various propositions are true. That will lead us to a closed form expression for a pseudo­random variable whose value gives the destination of the next jump.

Suppose that the algorithm is tracking the bucket numbers of the jumps for a particular key, k. And suppose that b was the destination of the last jump, that is, ch(k, b) ≠ ch(k, b+1), and ch(k, b+1) = b. Now, we want to find the next jump, the smallest j such that ch(k, j+1) ≠ ch(k, b+1), or equivalently, the largest j such that ch(k, j) = ch(k, b+1). We will make a pseudo­random variable whose value is that j. To get a probabilistic constraint on j, note that for any bucket number, i, we have j ≥ i if and only if the consistent hash hasn’t changed by i, that is, if and only if ch(k, i) = ch(k, b+1). Hence, the distribution of j must satisfy

P(j ≥ i) = P( ch(k, i) = ch(k, b+1) )

Fortunately, it is easy to compute that probability. Notice that since P( ch(k, 10) = ch(k, 11) ) is 10/11, and P( ch(k, 11) = ch(k, 12) ) is 11/12, then P( ch(k, 10) = ch(k, 12) ) is 10/11 * 11/12 = 10/12. In general, if n ≥ m, P( ch(k, n) = ch(k, m) ) = m / n. Thus for any i > b,

P(j ≥ i) = P( ch(k, i) = ch(k, b+1) ) = (b+1) / i .

Now, we generate a pseudo­random variable, r, (depending on k and j) that is uniformly distributed between 0 and 1. Since we want P(j ≥ i) = (b+1) / i, we set P(j ≥ i) iff r ≤ (b+1) / i. Solving the inequality for i yields P(j ≥ i) iff i ≤ (b+1) / r. Since i is a lower bound on j, j will equal the largest i for which P(j ≥ i), thus the largest i satisfying i ≤ (b+1) / r. Thus, by the definition of the floor function, j = floor((b+1) / r).

Using this formula, jump consistent hash finds ch(key, num_buckets) by choosing successive jump destinations until it finds a position at or past num_buckets. It then knows that the previous jump destination is the answer.

int ch(int key, int num_buckets) {
  random.seed(key);
  int b = -1; // bucket number before the previous jump
  int j = 0;  // bucket number before the current jump
    while (j < num_buckets) {
      b = j;
      r = random.next();
      j = floor((b + 1) / r);
  }
  return = b;
}

To turn this into the actual code of figure 1, we need to implement random. We want it to be fast, and yet to also to have well distributed successive values. We use a 64­bit linear congruential generator; the particular multiplier we use produces random numbers that are especially well distributed in higher dimensions (i.e., when successive random values are used to form tuples). We use the key as the seed. (For keys that don’t fit into 64 bits, a 64 bit hash of the key should be used.) The congruential generator updates the seed on each iteration, and the code derives a double from the current seed. Tests show that this generator has good speed and distribution.

It is worth noting that unlike the algorithm of Karger et al., jump consistent hash does not require the key to be hashed if it is already an integer. This is because jump consistent hash has an embedded pseudorandom number generator that essentially rehashes the key on every iteration. The hash is not especially good (i.e., linear congruential), but since it is applied repeatedly, additional hashing of the input key is not necessary.

[1] http://arxiv.org/pdf/1406.2294v1.pdf

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// CRC32 uses the 32-bit Cyclic Redundancy Check (CRC-32) with the IEEE
	// polynomial.
	NewCRC32 func() hash.Hash64 = func() hash.Hash64 { return &crc32Hasher{crc32.NewIEEE()} }
	// CRC64 uses the 64-bit Cyclic Redundancy Check (CRC-64) with the ECMA
	// polynomial.
	NewCRC64 func() hash.Hash64 = func() hash.Hash64 { return crc64.New(crc64.MakeTable(crc64.ECMA)) }
	// FNV1 uses the non-cryptographic hash function FNV-1.
	NewFNV1 func() hash.Hash64 = func() hash.Hash64 { return fnv.New64() }
	// FNV1a uses the non-cryptographic hash function FNV-1a.
	NewFNV1a func() hash.Hash64 = func() hash.Hash64 { return fnv.New64a() }

	// These are deprecated because they're not safe for concurrent use. Please
	// use the New* functions instead.
	CRC32 hash.Hash64 = &crc32Hasher{crc32.NewIEEE()}
	CRC64 hash.Hash64 = crc64.New(crc64.MakeTable(crc64.ECMA))
	FNV1  hash.Hash64 = fnv.New64()
	FNV1a hash.Hash64 = fnv.New64a()
)

KeyHashers available in the standard library for use with HashString() and Hasher.

Functions

func Hash

func Hash(key uint64, buckets int32) int32

Hash takes a 64 bit key and the number of buckets. It outputs a bucket number in the range [0, buckets). If the number of buckets is less than or equal to 0 then one 1 is used.

Example
fmt.Print(Hash(256, 1024))
Output:

520

func HashString added in v1.0.1

func HashString(key string, buckets int32, h KeyHasher) int32

HashString takes string as key instead of an int and uses a KeyHasher to generate a key compatible with Hash().

Example
fmt.Print(HashString("127.0.0.1", 8, NewCRC64()))
Output:

7

Types

type Hasher added in v1.0.1

type Hasher struct {
	// contains filtered or unexported fields
}

Hasher represents a jump consistent hasher using a string as key.

func New added in v1.0.1

func New(n int, h KeyHasher) *Hasher

New returns a new instance of of Hasher.

func (*Hasher) Hash added in v1.0.1

func (h *Hasher) Hash(key string) int

Hash returns the integer hash for the given key.

func (*Hasher) N added in v1.0.1

func (h *Hasher) N() int

N returns the number of buckets the hasher can assign to.

type KeyHasher added in v1.0.1

type KeyHasher interface {
	// Write (via the embedded io.Writer interface) adds more data to the
	// running hash.
	// It never returns an error.
	io.Writer

	// Reset resets the KeyHasher to its initial state.
	Reset()

	// Return the result of the added bytes (via io.Writer).
	Sum64() uint64
}

KeyHasher is a subset of hash.Hash64 in the standard library.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL