colsketch

package module

v0.0.0-...-ea58d95 Latest Latest Go to latest Published: Aug 18, 2023 License: MIT Imports: 2 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/tsenart/colsketch

Links

Open Source Insights

README ¶

colsketch

Background

Package colsketch provides a single special-purpose lossy compression code, designed for use as a "scan accelerator" for database storage. Such codes are not replacements for underlying values; rather they provide cheap approximate answers to predicates that may be sufficient to elide accessing the underlying data, similar to the way bloom filters can elide lookups, but supporting more general predicates (eg. tabulations, range-queries).

Put another way: rewriting a query on the underlying data values to a query on codes can produce false positives -- requiring a secondary query of the underlying data -- but no false negatives. And for half the codes in a given dictionary (the "exact" codes, assigned to high-frequency inputs), they also do not produce false positives.

The codes are "cheap" (i.e. actually useful for acceleration) for three reasons:

They are small, so conserve memory bandwidth: 1 or 2 bytes per code, vs. 8 bytes for an underlying float/u64 value, or more for a string, high resolution timestamp, uuid or large-decimal type.
They are simple integers, where the underlying data may be something more costly to process.
They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a time, and a GPU scan can look at hundreds at a time.

The package is equally usable for numeric, textual or categorical data. All it needs is something ordered. It includes wrapper types for floating point.

The codes it produces have the following characteristics:

Each code value is logically 8 or 16 bits (depending on the Mode enum). The user decides whether to operate with 8 or 16 bits: 8 bit codes should be used for memory-only scans, to elide 64-byte cache-line accesses; 16 bit codes should be used for disk scans, to elide 4k page accesses.
Code value 0 is unused, so that subsequent compression can use it as a sentinel or missing-value code.
All other codes alternate between even/exact (representing a specific value in the input) and odd/inexact (representing an open interval of possible input values). Values 1 and 0xff (or 0xffff, or whatever the final odd code is in the dictionary) thus encode one-sided lower and upper open intervals.
Codes are assigned to cover a sample provided by the user, which is internally sorted and then partitioned into equal-sized bins, including duplicates. Then each run of duplicates within a bin is counted. The sample value with the longest run -- i.e. the highest-frequency sample value -- within each bin is given an (even) exact code. Then an (odd) inexact code is given to each open interval of sample values between sample values that were given exact codes. The provided sample should therefore be big enough to be representative of the total input; but if it is not representative, encoding still works, it just loses efficiency.
The assigned codes imply order and preserve equality, specifically:
- code(a) < code(b) implies a < b
- a < b implies code(a) <= code(b)
- a == b implies code(a) == code(b)

Reference

Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 857–872.

DOI: https://doi.org/10.1145/3183713.3196911

https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf

Based on https://github.com/graydon/ordbog

TODO

Port over property based tests
Port over random dist example as test
Ensure Dict constructed with samples with float NaNs behaves correctly
Write test that exercises the dictionary to construct a full sketch and leverages it for predicate evaluation
Benchmark
CI
SIMD?

Documentation ¶

Overview ¶

Package colsketch provides a single special-purpose lossy compression code, designed for use as a "scan accelerator" for database storage. Such codes are not replacements for underlying values; rather they provide cheap approximate answers to predicates that may be sufficient to elide accessing the underlying data, similar to the way bloom filters can elide lookups, but supporting more general predicates (eg. tabulations, range-queries).

Put another way: rewriting a query on the underlying data values to a query on codes can produce false positives -- requiring a secondary query of the underlying data -- but no false negatives. And for half the codes in a given dictionary (the "exact" codes, assigned to high-frequency inputs), they also do not produce false positives.

The codes are "cheap" (i.e. actually useful for acceleration) for three reasons:

They are small, so conserve memory bandwidth: 1 or 2 bytes per code, vs. 8 bytes for an underlying float/u64 value, or more for a string, high resolution timestamp, uuid or large-decimal type.
They are simple integers, where the underlying data may be something more costly to process.
They are SIMD-friendly: an AVX2 scan can look at 16 or 32 codes at a time, and a GPU scan can look at hundreds at a time.

The package is equally usable for numeric, textual or categorical data. All it needs is something ordered. It includes wrapper types for floating point.

The codes it produces have the following characteristics:

Each code value is logically 8 or 16 bits (depending on the `Mode` enum). The user decides whether to operate with 8 or 16 bits: 8 bit codes should be used for memory-only scans, to elide 64-byte cache-line accesses; 16 bit codes should be used for disk scans, to elide 4k page accesses.
Code value 0 is unused, so that subsequent compression can use it as a sentinel or missing-value code.
All other codes alternate between even/exact (representing a specific value in the input) and odd/inexact (representing an open interval of possible input values). Values 1 and 0xff (or 0xffff, or whatever the final odd code is in the dictionary) thus encode one-sided lower and upper open intervals.
Codes are assigned to cover a _sample_ provided by the user, which is internally sorted and then partitioned into equal-sized bins, including duplicates. Then each run of duplicates within a bin is counted. The sample value with the longest run -- i.e. the highest-frequency sample value -- within each bin is given an (even) exact code. Then an (odd) inexact code is given to each open interval of sample values between sample values that were given exact codes. The provided sample should therefore be big enough to be representative of the total input; but if it is not representative, encoding still works, it just loses efficiency.
The assigned codes imply order and preserve equality, specifically: - `code(a) < code(b)` implies `a < b` - `a < b` implies `code(a) <= code(b)` - `a == b` implies `code(a) == code(b)`

## Reference

Brian Hentschel, Michael S. Kester, and Stratos Idreos. 2018. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 857–872.

DOI: <https://doi.org/10.1145/3183713.3196911>

<https://stratos.seas.harvard.edu/files/stratos/files/sketches.pdf>

Based on https://github.com/graydon/ordbog

Index ¶

type Code
- func (c Code) IsExact() bool
type Dict
- func NewDict[T cmp.Ordered](mode Mode, sample []T) Dict[T]
- func (d *Dict[T]) Encode(value T) Code
- func (d *Dict[T]) Len() int
type Mode

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Code ¶

type Code uint16

Code represents a dictionary code value.

func (Code) IsExact ¶

func (c Code) IsExact() bool

IsExact returns true iff the code is an _exact_ code, i.e. a code which represents a single underlying value rather than a range of possible values. This is true iff the code is an even number.

type Dict ¶

type Dict[T cmp.Ordered] struct {
	// contains filtered or unexported fields
}

Dict is dictionary over an underlying type `T` conforming to cmp.Ordered. The dictionary maps underlying values to Codes to use in a sketch, using the Encode method.

func NewDict ¶

func NewDict[T cmp.Ordered](mode Mode, sample []T) Dict[T]

NewDict builds a dictionary with a given Mode over a provided sample.

func (*Dict[T]) Encode ¶

func (d *Dict[T]) Encode(value T) Code

Encode looks up the code for a value of the underlying value type `T`.

func (*Dict[T]) Len ¶

func (d *Dict[T]) Len() int

Len returns the number of codes in the dictionary.

type Mode ¶

type Mode uint16

Mode indicates whether to build a small Dict of up to 255 values or a larger one of up to 65535 values.

const (
	// Byte builds a Dict with up to 255 codes ranging over `[1,255]`. This mode is
	// most appropriate when building a sketch that elides accesses to smaller
	// underlying storage blocks like 64-byte cache lines, where the small
	// repertoire of codes (and thus higher per-element chance of a false
	// positive) is offset by the small size of each storage block (and thus
	// small number of elements).
	Byte Mode = iota

	// Word builds a Dict with up to 65535 codes ranging over `[1,65535]`. This
	// mode is most appropriate when building a sketch that elides access to
	// larger underlying storage blocks like 4096-byte pages, where the larger
	// number of elements per storage block demands a comparatively low false
	// positive probability per element.
	Word
)

func (Mode) MaxExactCode ¶

func (m Mode) MaxExactCode() Code

MaxExactCode returns the maximum exact code in the mode.

func (Mode) MaxInexactCode ¶

func (m Mode) MaxInexactCode() Code

MaxInexactCode returns the maximum inexact code in the mode.

func (Mode) NumExactCodes ¶

func (m Mode) NumExactCodes() int

NumExactCodes returns the count of exact codes in the mode.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL