colsketch

package module
v0.0.0-...-ea58d95 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2023 License: MIT Imports: 2 Imported by: 0

README

colsketch

Background

Package colsketch provides a single special-purpose lossy compression code, designed for use as a "scan accelerator" for database storage. Such codes are not replacements for underlying values; rather they provide cheap approximate answers to predicates that may be sufficient to elide accessing the underlying data, similar to the way bloom filters can elide lookups, but supporting more general predicates (eg. tabulations, range-queries).

Put another way: rewriting a query on the underlying data values to a query on codes can produce false positives -- requiring a secondary query of the underlying data -- but no false negatives. And for half the codes in a given dictionary (the "exact" codes, assigned to high-frequency inputs), they also do not produce false positives.

The codes are "cheap" (i.e. actually useful for acceleration) for three reasons:

  1. They are small, so conserve memory bandwidth: 1 or 2 bytes per code, vs. 8 bytes for an underlying float/u64 value, or more for a string, high resolution timestamp, uuid or large-decimal type.

  2. They are simple integers, where the underlying data may be something more costly to process.

  3. They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a time, and a GPU scan can look at hundreds at a time.

The package is equally usable for numeric, textual or categorical data. All it needs is something ordered. It includes wrapper types for floating point.

The codes it produces have the following characteristics:

  1. Each code value is logically 8 or 16 bits (depending on the Mode enum). The user decides whether to operate with 8 or 16 bits: 8 bit codes should be used for memory-only scans, to elide 64-byte cache-line accesses; 16 bit codes should be used for disk scans, to elide 4k page accesses.

  2. Code value 0 is unused, so that subsequent compression can use it as a sentinel or missing-value code.

  3. All other codes alternate between even/exact (representing a specific value in the input) and odd/inexact (representing an open interval of possible input values). Values 1 and 0xff (or 0xffff, or whatever the final odd code is in the dictionary) thus encode one-sided lower and upper open intervals.

  4. Codes are assigned to cover a sample provided by the user, which is internally sorted and then partitioned into equal-sized bins, including duplicates. Then each run of duplicates within a bin is counted. The sample value with the longest run -- i.e. the highest-frequency sample value -- within each bin is given an (even) exact code. Then an (odd) inexact code is given to each open interval of sample values between sample values that were given exact codes. The provided sample should therefore be big enough to be representative of the total input; but if it is not representative, encoding still works, it just loses efficiency.

  5. The assigned codes imply order and preserve equality, specifically:

    • code(a) < code(b) implies a < b
    • a < b implies code(a) <= code(b)
    • a == b implies code(a) == code(b)

Reference

Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 857–872.

DOI: https://doi.org/10.1145/3183713.3196911

https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf

Based on https://github.com/graydon/ordbog

TODO

  • Port over property based tests
  • Port over random dist example as test
  • Ensure Dict constructed with samples with float NaNs behaves correctly
  • Write test that exercises the dictionary to construct a full sketch and leverages it for predicate evaluation
  • Benchmark
  • CI
  • SIMD?

Documentation

Overview

Package colsketch provides a single special-purpose lossy compression code, designed for use as a "scan accelerator" for database storage. Such codes are not replacements for underlying values; rather they provide cheap approximate answers to predicates that may be sufficient to elide accessing the underlying data, similar to the way bloom filters can elide lookups, but supporting more general predicates (eg. tabulations, range-queries).

Put another way: rewriting a query on the underlying data values to a query on codes can produce false positives -- requiring a secondary query of the underlying data -- but no false negatives. And for half the codes in a given dictionary (the "exact" codes, assigned to high-frequency inputs), they also do not produce false positives.

The codes are "cheap" (i.e. actually useful for acceleration) for three reasons:

  1. They are small, so conserve memory bandwidth: 1 or 2 bytes per code, vs. 8 bytes for an underlying float/u64 value, or more for a string, high resolution timestamp, uuid or large-decimal type.

  2. They are simple integers, where the underlying data may be something more costly to process.

  3. They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a time, and a GPU scan can look at hundreds at a time.

The package is equally usable for numeric, textual or categorical data. All it needs is something ordered. It includes wrapper types for floating point.

The codes it produces have the following characteristics:

  1. Each code value is logically 8 or 16 bits (depending on the `Mode` enum). The user decides whether to operate with 8 or 16 bits: 8 bit codes should be used for memory-only scans, to elide 64-byte cache-line accesses; 16 bit codes should be used for disk scans, to elide 4k page accesses.

  2. Code value 0 is unused, so that subsequent compression can use it as a sentinel or missing-value code.

  3. All other codes alternate between even/exact (representing a specific value in the input) and odd/inexact (representing an open interval of possible input values). Values 1 and 0xff (or 0xffff, or whatever the final odd code is in the dictionary) thus encode one-sided lower and upper open intervals.

  4. Codes are assigned to cover a _sample_ provided by the user, which is internally sorted and then partitioned into equal-sized bins, including duplicates. Then each run of duplicates within a bin is counted. The sample value with the longest run -- i.e. the highest-frequency sample value -- within each bin is given an (even) exact code. Then an (odd) inexact code is given to each open interval of sample values between sample values that were given exact codes. The provided sample should therefore be big enough to be representative of the total input; but if it is not representative, encoding still works, it just loses efficiency.

  5. The assigned codes imply order and preserve equality, specifically: - `code(a) < code(b)` implies `a < b` - `a < b` implies `code(a) <= code(b)` - `a == b` implies `code(a) == code(b)`

## Reference

Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 857–872.

DOI: <https://doi.org/10.1145/3183713.3196911>

<https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf>

Based on https://github.com/graydon/ordbog

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Code

type Code uint16

Code represents a dictionary code value.

func (Code) IsExact

func (c Code) IsExact() bool

IsExact returns true iff the code is an _exact_ code, i.e. a code which represents a single underlying value rather than a range of possible values. This is true iff the code is an even number.

type Dict

type Dict[T cmp.Ordered] struct {
	// contains filtered or unexported fields
}

Dict is dictionary over an underlying type `T` conforming to cmp.Ordered. The dictionary maps underlying values to Codes to use in a sketch, using the Encode method.

func NewDict

func NewDict[T cmp.Ordered](mode Mode, sample []T) Dict[T]

NewDict builds a dictionary with a given Mode over a provided sample.

func (*Dict[T]) Encode

func (d *Dict[T]) Encode(value T) Code

Encode looks up the code for a value of the underlying value type `T`.

func (*Dict[T]) Len

func (d *Dict[T]) Len() int

Len returns the number of codes in the dictionary.

type Mode

type Mode uint16

Mode indicates whether to build a small Dict of up to 255 values or a larger one of up to 65535 values.

const (
	// Byte builds a Dict with up to 255 codes ranging over `[1,255]`. This mode is
	// most appropriate when building a sketch that elides accesses to smaller
	// underlying storage blocks like 64-byte cache lines, where the small
	// repertoire of codes (and thus higher per-element chance of a false
	// positive) is offset by the small size of each storage block (and thus
	// small number of elements).
	Byte Mode = iota

	// Word builds a Dict with up to 65535 codes ranging over `[1,65535]`. This
	// mode is most appropriate when building a sketch that elides access to
	// larger underlying storage blocks like 4096-byte pages, where the larger
	// number of elements per storage block demands a comparatively low false
	// positive probability per element.
	Word
)

func (Mode) MaxExactCode

func (m Mode) MaxExactCode() Code

MaxExactCode returns the maximum exact code in the mode.

func (Mode) MaxInexactCode

func (m Mode) MaxInexactCode() Code

MaxInexactCode returns the maximum inexact code in the mode.

func (Mode) NumExactCodes

func (m Mode) NumExactCodes() int

NumExactCodes returns the count of exact codes in the mode.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL