colblk

package
v2.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 22, 2025 License: BSD-3-Clause Imports: 23 Imported by: 0

Documentation

Overview

Package colblk implements a columnar block format.

The columnar block format organizes data into columns. The number of columns and their data types are dependent on the use and are configurable.

Format

Every columnar block begins with a header describing the structure and schema of the block. Then columns values are encoded in sequence. The block ends with a single padding byte.

The block header begins with: - Version number (1 byte) - The number of columns in the block (2 bytes) - The number of rows in the block (4 bytes)

Then follows a column-header for each column. Each column header encodes the data type (1 byte) and the offset into the block where the column data begins (4 bytes).

+-----------+
| Vers (1B) |
+-------------------+--------------------------------+
| # columns (2B)    | # of rows (4B)                 |
+-----------+-------+---------------------+----------+
| Type (1B) | Page offset (4B)                | Col 0
+-----------+---------------------------------+
| Type (1B) | Page offset (4B)                | Col 1
+-----------+---------------------------------+
| ...	    | ...                             |
+-----------+---------------------------------+
| Type (1B) | Page offset (4B)                | Col n-1
+-----------+----------------------------------
|  column 0 data                                ...
+----------------------------------------------
|  column 1 data                                ...
+----------------------------------------------
| ...
+----------------------------------------------
|  column n-1 data                              ...
+-------------+--------------------------------
| Unused (1B) |
+-------------+

The encoding of column data itself is dependent on the data type.

The trailing padding byte exists to allow columns to represent the end of column data using a pointer to the byte after the end of the column. The padding byte ensures that the pointer will always fall within the allocated memory. Without the padding byte, such a pointer would be invalid according to Go's pointer rules.

Alignment

Block buffers are required to be word-aligned, during encoding and decoding. This ensures that if any individual column or piece of data requires word-alignment, the writer can align the offset into the block buffer appropriately to ensure that the data is word-aligned.

Keyspan blocks

Blocks encoding key spans (range deletions, range keys) decompose the fields of keyspan.Key into columns. Key spans are always stored fragmented, such that all overlapping keys have identical bounds. When encoding key spans to a columnar block, we take advantage of this fragmentation to store the set of unique user key span boundaries in a separate column. This column does not have the same number of rows as the other columns. Each individual fragment stores the index of its start boundary user key within the user key column.

For example, consider the three unfragmented spans:

a-e:{(#0,RANGEDEL)}
b-d:{(#100,RANGEDEL)}
f-g:{(#20,RANGEDEL)}

Once fragmented, these spans become five keyspan.Keys:

a-b:{(#0,RANGEDEL)}
b-d:{(#100,RANGEDEL), (#0,RANGEDEL)}
d-e:{(#0,RANGEDEL)}
f-g:{(#20,RANGEDEL)}

When encoded within a columnar keyspan block, the boundary columns (user key and start indices) would hold six rows:

+-----------------+-----------------+
| User key        | Start index     |
+-----------------+-----------------+
| a               | 0               |
+-----------------+-----------------+
| b               | 1               |
+-----------------+-----------------+
| d               | 3               |
+-----------------+-----------------+
| e               | 4               |
+-----------------+-----------------+
| f               | 4               |
+-----------------+-----------------+
| g               | 5               |
+-----------------+-----------------+

The remaining keyspan.Key columns would look like:

+-----------------+-----------------+-----------------+
| Trailer         | Suffix          | Value           |
+-----------------+-----------------+-----------------+
| (#0,RANGEDEL)   | -               | -               | (0)
+-----------------+-----------------+-----------------+
| (#100,RANGEDEL) | -               | -               | (1)
+-----------------+-----------------+-----------------+
| (#0,RANGEDEL)   | -               | -               | (2)
+-----------------+-----------------+-----------------+
| (#0,RANGEDEL)   | -               | -               | (3)
+-----------------+-----------------+-----------------+
| (#20,RANGEDEL)  | -               | -               | (4)
+-----------------+-----------------+-----------------+

This encoding does not explicitly encode the mapping of keyspan.Key to boundary keys. Rather each boundary key encodes the index where keys beginning at a boundary >= the key begin. Readers look up the key start index for the start boundary (s) and the end boundary (e). Any keys within indexes [s,e) have the corresponding bounds.

Both range deletions and range keys are encoded with the same schema. Range deletion keyspan.Keys never contain suffixes or values. When one of these columns is encoded, the RawBytes encoding uses uintEncodingAllZero to avoid encoding N offsets. Each of these empty columns is encoded in just 1 byte of column data.

Index

Constants

View Source
const KeySeekerMetadataSize = 176

KeySeekerMetadataSize is chosen to fit the CockroachDB key seeker implementation.

View Source
const UintEncodingRowThreshold = 8

UintEncodingRowThreshold is the threshold under which the number of rows can affect the best encoding. This happens when the constant 8 bytes for the delta base doesn't make up for saving a byte or two in the per-row encoding.

Variables

This section is empty.

Functions

func AssertKeyCompare

func AssertKeyCompare(comparer *base.Comparer, a, b []byte, kcmp KeyComparison)

AssertKeyCompare compares two keys using the provided comparer, ensuring the provided KeyComparison accurately describing the result. Panics if the assertion does not hold.

func Clone

func Clone[V any](a Array[V], n int) []V

Clone clones the first n elements of the array a.

func DecodeColumn

func DecodeColumn[V any](
	d *BlockDecoder, col int, rows int, dataType DataType, decodeFunc DecodeFunc[V],
) V

DecodeColumn decodes the col'th column of the provided reader's block as a column of dataType using decodeFunc.

func FinishBlock

func FinishBlock(rows int, writers []ColumnWriter) []byte

FinishBlock writes the columnar block to a heap-allocated byte slice. FinishBlock assumes all columns have the same number of rows. If that's not the case, the caller should manually construct their own block.

func InitDataBlockMetadata

func InitDataBlockMetadata(schema *KeySchema, md *block.Metadata, data []byte) (err error)

InitDataBlockMetadata initializes the metadata for a data block.

func InitIndexBlockMetadata

func InitIndexBlockMetadata(md *block.Metadata, data []byte) (err error)

InitIndexBlockMetadata initializes the metadata for an index block.

func InitKeyspanBlockMetadata

func InitKeyspanBlockMetadata(md *block.Metadata, data []byte) (err error)

InitKeyspanBlockMetadata initializes the metadata for a rangedel or range key block.

Types

type Array

type Array[V any] interface {
	// At returns the i'th value in the array.
	At(i int) V
}

An Array provides indexed access to an array of values.

type Bitmap

type Bitmap struct {
	// contains filtered or unexported fields
}

Bitmap is a bitmap structure built on a []uint64. A bitmap utilizes ~1 physical bit/logical bit (~0.125 bytes/row). The bitmap is encoded into an 8-byte aligned array of 64-bit words which is (nRows+63)/64 words in length.

A summary bitmap is stored after the primary bitmap in which each bit in the summary bitmap corresponds to 1 word in the primary bitmap. A bit is set in the summary bitmap if the corresponding word in the primary bitmap is non-zero. The summary bitmap accelerates predecessor and successor operations.

func DecodeBitmap

func DecodeBitmap(b []byte, off uint32, bitCount int) (bitmap Bitmap, endOffset uint32)

DecodeBitmap decodes the structure of a Bitmap and returns a Bitmap that reads from b supporting bitCount logical bits. No bounds checking is performed, so the caller must guarantee the bitmap is appropriately sized and the provided bitCount correctly identifies the number of bits in the bitmap.

func (Bitmap) At

func (b Bitmap) At(i int) bool

At returns true if the bit at position i is set and false otherwise.

func (Bitmap) SeekSetBitGE

func (b Bitmap) SeekSetBitGE(i int) int

SeekSetBitGE returns the next bit greater than or equal to i set in the bitmap. The i parameter must be ≥ 0. Returns the number of bits represented by the bitmap if no next bit is set (or if i >= bitCount).

func (Bitmap) SeekSetBitLE

func (b Bitmap) SeekSetBitLE(i int) int

SeekSetBitLE returns the previous bit less than or equal to i set in the bitmap. The i parameter must be in [0, bitCount). Returns -1 if no previous bit is set.

func (Bitmap) SeekUnsetBitGE

func (b Bitmap) SeekUnsetBitGE(i int) int

SeekUnsetBitGE returns the next bit greater than or equal to i that is unset in the bitmap. The i parameter must be in [0, bitCount). Returns the number of bits represented by the bitmap if no next bit is unset.

func (Bitmap) SeekUnsetBitLE

func (b Bitmap) SeekUnsetBitLE(i int) int

SeekUnsetBitLE returns the previous bit less than or equal to i set in the bitmap. The i parameter must be in [0, bitCount). Returns -1 if no previous bit is unset.

func (Bitmap) String

func (b Bitmap) String() string

String returns a string representation of the entire bitmap.

type BitmapBuilder

type BitmapBuilder struct {
	// contains filtered or unexported fields
}

BitmapBuilder constructs a Bitmap. Bits default to false.

func (*BitmapBuilder) DataType

func (b *BitmapBuilder) DataType(int) DataType

DataType implements the ColumnWriter interface.

func (*BitmapBuilder) Finish

func (b *BitmapBuilder) Finish(col, nRows int, offset uint32, buf []byte) uint32

Finish finalizes the bitmap, computing the per-word summary bitmap and writing the resulting data to buf at offset.

func (*BitmapBuilder) Invert

func (b *BitmapBuilder) Invert(nRows int)

Invert inverts the bitmap, setting all bits that are not set and clearing all bits that are set. If the bitmap's tail is sparse and is not large enough to represent nRows rows, it's first materialized.

Note that Invert can affect the Size of the bitmap. Use InvertedSize() if you intend to invert the bitmap before finishing.

func (*BitmapBuilder) InvertedSize

func (b *BitmapBuilder) InvertedSize(rows int, offset uint32) uint32

InvertedSize returns the size of the encoded bitmap, assuming Invert will be called.

func (*BitmapBuilder) NumColumns

func (b *BitmapBuilder) NumColumns() int

NumColumns implements the ColumnWriter interface.

func (*BitmapBuilder) Reset

func (b *BitmapBuilder) Reset()

Reset resets the bitmap to the empty state.

func (*BitmapBuilder) Set

func (b *BitmapBuilder) Set(i int)

Set sets the bit at position i to true.

func (*BitmapBuilder) Size

func (b *BitmapBuilder) Size(rows int, offset uint32) uint32

Size implements the ColumnWriter interface.

func (*BitmapBuilder) WriteDebug

func (b *BitmapBuilder) WriteDebug(w io.Writer, rows int)

WriteDebug implements the ColumnWriter interface.

type BlockDecoder

type BlockDecoder struct {
	// contains filtered or unexported fields
}

A BlockDecoder holds metadata for accessing the columns of a columnar block.

func DecodeBlock

func DecodeBlock(data []byte, customHeaderSize uint32) BlockDecoder

DecodeBlock decodes the header of the provided columnar block and returns a new BlockDecoder configured to read from the block. The caller must ensure that the data is formatted as to the block layout specification.

func (*BlockDecoder) Bitmap

func (d *BlockDecoder) Bitmap(col int) Bitmap

Bitmap retrieves the col'th column as a bitmap. The column must be of type DataTypeBool.

func (*BlockDecoder) DataType

func (d *BlockDecoder) DataType(col int) DataType

DataType returns the data type of the col'th column. Every column's data type is encoded within the block header.

func (*BlockDecoder) FormattedString

func (d *BlockDecoder) FormattedString() string

FormattedString returns a formatted representation of the block's binary data.

func (*BlockDecoder) Init

func (d *BlockDecoder) Init(data []byte, customHeaderSize uint32)

Init initializes a BlockDecoder with the data contained in the provided block.

func (*BlockDecoder) PrefixBytes

func (d *BlockDecoder) PrefixBytes(col int) PrefixBytes

PrefixBytes retrieves the col'th column as a prefix-compressed byte slice column. The column must be of type DataTypePrefixBytes.

func (*BlockDecoder) RawBytes

func (d *BlockDecoder) RawBytes(col int) RawBytes

RawBytes retrieves the col'th column as a column of byte slices. The column must be of type DataTypeBytes.

func (*BlockDecoder) Rows

func (d *BlockDecoder) Rows() int

Rows returns the number of rows in the block, as indicated by the block header.

func (*BlockDecoder) Uints

func (d *BlockDecoder) Uints(col int) UnsafeUints

Uints retrieves the col'th column as a column of uints. The column must be of type DataTypeUint.

type ColumnWriter

type ColumnWriter interface {
	Encoder
	// NumColumns returns the number of columns the ColumnWriter will encode.
	NumColumns() int
	// DataType returns the data type of the col'th column.
	DataType(col int) DataType
	// Finish serializes the column at the specified index, writing the column's
	// data to buf at offset, and returning the offset at which the next column
	// should be encoded.
	//
	// The supplied buf must have enough space at the provided offset to fit the
	// column. The caller may use Size() to calculate the exact size required.
	// The caller passes the number of rows they want to serialize. All
	// implementations of Finish must support cases where rows is the number of
	// rows the caller has set, or one less. Some implementations may be more
	// permissive.
	//
	// The provided column index must be less than NumColumns(). Finish is
	// called for each index < NumColumns() in order.
	//
	// The provided buf must be word-aligned (at offset 0). If a column writer
	// requires a particularly alignment, it's responsible for padding offset
	// appropriately first.
	Finish(col, rows int, offset uint32, buf []byte) (nextOffset uint32)
}

ColumnWriter is an interface implemented by column encoders that accumulate a column's values and then serialize them.

type DataBlockDecoder

type DataBlockDecoder struct {
	// contains filtered or unexported fields
}

A DataBlockDecoder holds state for interpreting a columnar data block. It may be shared among multiple DataBlockIters.

func (*DataBlockDecoder) BlockDecoder

func (d *DataBlockDecoder) BlockDecoder() *BlockDecoder

BlockDecoder returns a pointer to the underlying BlockDecoder.

func (*DataBlockDecoder) Describe

func (d *DataBlockDecoder) Describe(f *binfmt.Formatter, tp treeprinter.Node)

Describe descirbes the binary format of the data block, assuming f.Offset() is positioned at the beginning of the same data block described by d.

func (*DataBlockDecoder) Init

func (d *DataBlockDecoder) Init(schema *KeySchema, data []byte)

Init initializes the data block reader with the given serialized data block.

func (*DataBlockDecoder) KeySchemaHeader

func (d *DataBlockDecoder) KeySchemaHeader() []byte

KeySchemaHeader returns the KeySchema-specific header.

func (*DataBlockDecoder) PrefixChanged

func (d *DataBlockDecoder) PrefixChanged() Bitmap

PrefixChanged returns the prefix-changed bitmap.

func (*DataBlockDecoder) Validate

func (d *DataBlockDecoder) Validate(comparer *base.Comparer, keySchema *KeySchema) error

Validate validates invariants that should hold across all data blocks.

type DataBlockEncoder

type DataBlockEncoder struct {
	Schema    *KeySchema
	KeyWriter KeyWriter
	// contains filtered or unexported fields
}

DataBlockEncoder encodes columnar data blocks using a user-defined schema.

func (*DataBlockEncoder) Add

func (w *DataBlockEncoder) Add(
	ikey base.InternalKey,
	value []byte,
	valuePrefix block.ValuePrefix,
	kcmp KeyComparison,
	isObsolete bool,
)

Add adds the provided key to the data block. Keys must be added in order. The caller must supply a KeyComparison containing the comparison of the key to the previously-added key, obtainable through

KeyWriter.ComparePrev(ikey.UserKey)

The caller is required to pass this in because in expected use cases, the caller will also require the same information.

func (*DataBlockEncoder) Finish

func (w *DataBlockEncoder) Finish(rows, size int) (finished []byte, lastKey base.InternalKey)

Finish serializes the pending data block, including the first [rows] rows. The value of [rows] must be Rows() or Rows()-1. The provided size must be the size of the data block with the provided row count (i.e., the return value of [Size] when DataBlockEncoder.Rows() = [rows]).

Finish the returns the serialized, uncompressed data block and the InternalKey of the last key contained within the data block. The memory of the lastKey's UserKey is owned by the DataBlockEncoder. The caller must copy it if they require it to outlive a Reset of the writer.

func (*DataBlockEncoder) Init

func (w *DataBlockEncoder) Init(schema *KeySchema)

Init initializes the data block writer.

func (*DataBlockEncoder) MaterializeLastUserKey

func (w *DataBlockEncoder) MaterializeLastUserKey(appendTo []byte) []byte

MaterializeLastUserKey materializes the last added user key.

func (*DataBlockEncoder) Reset

func (w *DataBlockEncoder) Reset()

Reset resets the data block writer to its initial state, retaining buffers.

func (*DataBlockEncoder) Rows

func (w *DataBlockEncoder) Rows() int

Rows returns the number of rows in the current pending data block.

func (*DataBlockEncoder) Size

func (w *DataBlockEncoder) Size() int

Size returns the size of the current pending data block.

func (*DataBlockEncoder) String

func (w *DataBlockEncoder) String() string

String outputs a human-readable summary of internal DataBlockEncoder state.

type DataBlockIter

type DataBlockIter struct {
	// contains filtered or unexported fields
}

DataBlockIter iterates over a columnar data block.

func (*DataBlockIter) Close

func (i *DataBlockIter) Close() error

Close implements the base.InternalIterator interface.

func (*DataBlockIter) DebugTree

func (i *DataBlockIter) DebugTree(tp treeprinter.Node)

DebugTree is part of the InternalIterator interface.

func (*DataBlockIter) Error

func (i *DataBlockIter) Error() error

Error implements the base.InternalIterator interface. A DataBlockIter is infallible and always returns a nil error.

func (*DataBlockIter) First

func (i *DataBlockIter) First() *base.InternalKV

First implements the base.InternalIterator interface.

func (*DataBlockIter) Handle

func (i *DataBlockIter) Handle() block.BufferHandle

Handle returns the handle to the block.

func (*DataBlockIter) Init

func (i *DataBlockIter) Init(d *DataBlockDecoder, transforms block.IterTransforms) error

Init initializes the data block iterator, configuring it to read from the provided decoder.

func (*DataBlockIter) InitHandle

func (i *DataBlockIter) InitHandle(
	comparer *base.Comparer, h block.BufferHandle, transforms block.IterTransforms,
) error

InitHandle initializes the block from the provided buffer handle. InitHandle assumes that the block's metadata was initialized using InitDataBlockMetadata().

func (*DataBlockIter) InitOnce

func (i *DataBlockIter) InitOnce(
	keySchema *KeySchema,
	comparer *base.Comparer,
	getLazyValuer block.GetLazyValueForPrefixAndValueHandler,
)

InitOnce configures the data block iterator's key schema and lazy value handler. The iterator must be initialized with a block before it can be used. It may be reinitialized with new blocks without calling InitOnce again.

func (*DataBlockIter) Invalidate

func (i *DataBlockIter) Invalidate()

Invalidate invalidates the block iterator, removing references to the block it was initialized with. The iterator may continue to be used after a call to Invalidate, but all positioning methods should return false. Valid() must also return false.

func (*DataBlockIter) IsDataInvalidated

func (i *DataBlockIter) IsDataInvalidated() bool

IsDataInvalidated returns true when the iterator has been invalidated using an Invalidate call.

func (*DataBlockIter) IsLowerBound

func (i *DataBlockIter) IsLowerBound(k []byte) bool

IsLowerBound implements the block.DataBlockIterator interface.

func (*DataBlockIter) KV

func (i *DataBlockIter) KV() *base.InternalKV

KV returns the key-value pair at the current iterator position. The iterator must be positioned over a valid KV.

func (*DataBlockIter) Last

func (i *DataBlockIter) Last() *base.InternalKV

Last implements the base.InternalIterator interface.

func (*DataBlockIter) Next

func (i *DataBlockIter) Next() *base.InternalKV

Next advances to the next KV pair in the block.

func (*DataBlockIter) NextPrefix

func (i *DataBlockIter) NextPrefix(_ []byte) *base.InternalKV

NextPrefix moves the iterator to the next row with a different prefix than the key at the current iterator position.

The columnar block implementation uses a newPrefix bitmap to identify the next row with a differing prefix from the current row's key. If newPrefix[i] is set then row's i key prefix is different that row i-1. The bitmap is organized as a slice of 64-bit words. If a 64-bit word in the bitmap is zero then all of the rows corresponding to the bits in that word have the same prefix and we can skip ahead. If a row is non-zero a small bit of bit shifting and masking combined with bits.TrailingZeros64 can identify the next bit that is set after the current row. The bitmap uses 1 bit/row (0.125 bytes/row). A 32KB block containing 1300 rows (25 bytes/row) would need a bitmap of 21 64-bit words. Even in the worst case where every word is 0 this bitmap can be scanned in ~20 ns (1 ns/word) leading to a total NextPrefix time of ~30 ns if a row is found and decodeRow are called. In more normal cases, NextPrefix takes ~15% longer that a single Next call.

For comparison, the rowblk nextPrefixV3 optimizations work by setting a bit in the value prefix byte that indicates that the current key has the same prefix as the previous key. Additionally, a bit is stolen from the restart table entries indicating whether a restart table entry has the same key prefix as the previous entry. Checking the value prefix byte bit requires locating that byte which requires decoding 3 varints per key/value pair.

func (*DataBlockIter) Prev

func (i *DataBlockIter) Prev() *base.InternalKV

Prev moves the iterator to the previous KV pair in the block.

func (*DataBlockIter) SeekGE

func (i *DataBlockIter) SeekGE(key []byte, flags base.SeekGEFlags) *base.InternalKV

SeekGE implements the base.InternalIterator interface.

func (*DataBlockIter) SeekLT

func (i *DataBlockIter) SeekLT(key []byte, _ base.SeekLTFlags) *base.InternalKV

SeekLT implements the base.InternalIterator interface.

func (*DataBlockIter) SeekPrefixGE

func (i *DataBlockIter) SeekPrefixGE(prefix, key []byte, flags base.SeekGEFlags) *base.InternalKV

SeekPrefixGE implements the base.InternalIterator interface.

func (*DataBlockIter) SetBounds

func (i *DataBlockIter) SetBounds(lower, upper []byte)

SetBounds implements the base.InternalIterator interface.

func (*DataBlockIter) SetContext

func (i *DataBlockIter) SetContext(_ context.Context)

SetContext implements the base.InternalIterator interface.

func (*DataBlockIter) String

func (i *DataBlockIter) String() string

String implements the base.InternalIterator interface.

func (*DataBlockIter) Valid

func (i *DataBlockIter) Valid() bool

Valid returns true if the iterator is currently positioned at a valid KV.

type DataBlockRewriter

type DataBlockRewriter struct {
	KeySchema *KeySchema
	// contains filtered or unexported fields
}

DataBlockRewriter rewrites data blocks. See RewriteSuffixes.

func NewDataBlockRewriter

func NewDataBlockRewriter(keySchema *KeySchema, comparer *base.Comparer) *DataBlockRewriter

NewDataBlockRewriter creates a block rewriter.

func (*DataBlockRewriter) RewriteSuffixes

func (rw *DataBlockRewriter) RewriteSuffixes(
	input []byte, from []byte, to []byte,
) (start, end base.InternalKey, rewritten []byte, err error)

RewriteSuffixes rewrites the input block. It expects the input block to only contain keys with the suffix `from`. It rewrites the block to contain the same keys with the suffix `to`.

RewriteSuffixes returns the start and end keys of the rewritten block, and the finished rewritten block. The returned start and end keys have indefinite lifetimes. The returned rewritten block is owned by the DataBlockRewriter. If it must be retained beyond the next call to RewriteSuffixes, the caller should make a copy.

Note that the input slice must be 8-byte aligned.

type DataType

type DataType uint8

DataType describes the logical type of a column's values. Some data types have multiple possible physical representations. Encoding a column may choose between possible physical representations depending on the distribution of values and the size of the resulting physical representation.

const (
	// DataTypeInvalid represents an unset or invalid data type.
	DataTypeInvalid DataType = 0
	// DataTypeBool is a data type encoding a bool per row.
	DataTypeBool DataType = 1
	// DataTypeUint is a data type encoding a fixed 8 bits per row.
	DataTypeUint DataType = 2
	// DataTypeBytes is a data type encoding a variable-length byte string per
	// row.
	DataTypeBytes DataType = 3
	// DataTypePrefixBytes is a data type encoding variable-length,
	// lexicographically-sorted byte strings, with prefix compression.
	DataTypePrefixBytes DataType = 4
)

func (DataType) String

func (t DataType) String() string

String returns a human-readable string representation of the data type.

type DecodeFunc

type DecodeFunc[T any] func(buf []byte, offset uint32, rows int) (decoded T, nextOffset uint32)

A DecodeFunc decodes a data structure from a byte slice, returning an accessor for the data and the offset of the first byte after the structure. The rows argument must be number of logical rows encoded within the data structure.

type Encoder

type Encoder interface {
	// Reset clears the ColumnWriter's internal state, preparing it for reuse.
	Reset()
	// Size returns the size required to encode the column's current values.
	//
	// The `rows` argument must be the current number of logical rows in the
	// column.  Some implementations support defaults, and these implementations
	// rely on the caller to inform them the current number of logical rows. The
	// provided `rows` must be greater than or equal to the largest row set + 1.
	// In other words, Size does not support determining the size of a column's
	// earlier size before additional rows were added.
	Size(rows int, offset uint32) uint32
	// WriteDebug writes a human-readable description of the current column
	// state to the provided writer.
	WriteDebug(w io.Writer, rows int)
}

Encoder is an interface implemented by column encoders.

type Header struct {
	Version Version
	// Columns holds the number of columns encoded within the block.
	Columns uint16
	// Rows holds the number of rows encoded within the block.
	Rows uint32
}

Header holds the metadata extracted from a columnar block's header.

func DecodeHeader

func DecodeHeader(data []byte) Header

DecodeHeader reads the block header from the provided serialized columnar block.

func (Header) Encode

func (h Header) Encode(buf []byte)

Encode encodes the header to the provided buf. The length of buf must be at least 7 bytes.

func (Header) String

func (h Header) String() string

String implements the fmt.Stringer interface, returning a human-readable representation of the block header.

type IndexBlockDecoder

type IndexBlockDecoder struct {
	// contains filtered or unexported fields
}

An IndexBlockDecoder reads columnar index blocks.

func (*IndexBlockDecoder) DebugString

func (r *IndexBlockDecoder) DebugString() string

DebugString prints a human-readable explanation of the keyspan block's binary representation.

func (*IndexBlockDecoder) Describe

func (r *IndexBlockDecoder) Describe(f *binfmt.Formatter, tp treeprinter.Node)

Describe describes the binary format of the index block, assuming f.Offset() is positioned at the beginning of the same index block described by r.

func (*IndexBlockDecoder) Init

func (r *IndexBlockDecoder) Init(data []byte)

Init initializes the index block decoder with the given serialized index block.

type IndexBlockWriter

type IndexBlockWriter struct {
	// contains filtered or unexported fields
}

IndexBlockWriter writes columnar index blocks. The writer is used for both first-level and second-level index blocks. The index block schema consists of three primary columns:

  • Separators: a user key that is ≥ the largest user key in the corresponding entry, and ≤ the smallest user key in the next entry. Note that this allows consecutive separators to be equal. This is possible when snapshots required we preserve duplicate user keys at different sequence numbers.
  • Offsets: the offset of the end of the corresponding block.
  • Lengths: the length in bytes of the corresponding block.
  • Block properties: a slice encoding arbitrary user-defined block properties.

TODO(jackson): Consider splitting separators into prefixes and suffixes (even without user-defined columns). This would allow us to use prefix compression for the prefix. Separators should typically be suffixless unless two KVs with the same prefix straddle a block boundary. We would need to use a buffer to materialize the separator key when we need to use it outside the context of seeking within the block.

func (*IndexBlockWriter) AddBlockHandle

func (w *IndexBlockWriter) AddBlockHandle(
	separator []byte, handle block.Handle, blockProperties []byte,
) int

AddBlockHandle adds a new separator and end offset of a data block to the index block. Add returns the index of the row.

AddBlockHandle should only be used for first-level index blocks.

func (*IndexBlockWriter) Finish

func (w *IndexBlockWriter) Finish(rows int) []byte

Finish serializes the pending index block, including the first [rows] rows. The value of [rows] must be Rows() or Rows()-1.

func (*IndexBlockWriter) Init

func (w *IndexBlockWriter) Init()

Init initializes the index block writer.

func (*IndexBlockWriter) Reset

func (w *IndexBlockWriter) Reset()

Reset resets the index block writer to its initial state, retaining buffers.

func (*IndexBlockWriter) Rows

func (w *IndexBlockWriter) Rows() int

Rows returns the number of entries in the index block so far.

func (*IndexBlockWriter) Size

func (w *IndexBlockWriter) Size() int

Size returns the size of the pending index block.

func (*IndexBlockWriter) UnsafeSeparator

func (w *IndexBlockWriter) UnsafeSeparator(i int) []byte

UnsafeSeparator returns the separator of the i'th entry.

type IndexIter

type IndexIter struct {
	// contains filtered or unexported fields
}

IndexIter is an iterator over the block entries in an index block.

func (*IndexIter) BlockHandleWithProperties

func (i *IndexIter) BlockHandleWithProperties() (block.HandleWithProperties, error)

BlockHandleWithProperties decodes the block handle with any encoded properties at the iterator's current position.

func (*IndexIter) Close

func (i *IndexIter) Close() error

Close closes the iterator, releasing any resources it holds.

func (*IndexIter) First

func (i *IndexIter) First() bool

First seeks index iterator to the first block entry. It returns false if the index block is empty.

func (*IndexIter) Handle

func (i *IndexIter) Handle() block.BufferHandle

Handle returns the underlying block buffer handle, if the iterator was initialized with one.

func (*IndexIter) Init

func (i *IndexIter) Init(
	comparer *base.Comparer, blk []byte, transforms block.IterTransforms,
) error

Init initializes an iterator from the provided block data slice.

func (*IndexIter) InitHandle

func (i *IndexIter) InitHandle(
	comparer *base.Comparer, blk block.BufferHandle, transforms block.IterTransforms,
) error

InitHandle initializes an iterator from the provided block handle.

func (*IndexIter) InitWithDecoder

func (i *IndexIter) InitWithDecoder(
	comparer *base.Comparer, d *IndexBlockDecoder, transforms block.IterTransforms,
)

InitWithDecoder initializes an index iterator from the provided decoder.

func (*IndexIter) Invalidate

func (i *IndexIter) Invalidate()

Invalidate invalidates the block iterator, removing references to the block it was initialized with.

func (*IndexIter) IsDataInvalidated

func (i *IndexIter) IsDataInvalidated() bool

IsDataInvalidated returns true when the iterator has been invalidated using an Invalidate call. NB: this is different from Valid.

func (*IndexIter) Last

func (i *IndexIter) Last() bool

Last seeks index iterator to the last block entry. It returns false if the index block is empty.

func (*IndexIter) Next

func (i *IndexIter) Next() bool

Next steps the index iterator to the next block entry. It returns false if the index block is exhausted in the forward direction. A call to Next while already exhausted in the forward direction is a no-op.

func (*IndexIter) Prev

func (i *IndexIter) Prev() bool

Prev steps the index iterator to the previous block entry. It returns false if the index block is exhausted in the reverse direction. A call to Prev while already exhausted in the reverse direction is a no-op.

func (*IndexIter) RowIndex

func (i *IndexIter) RowIndex() int

RowIndex returns the index of the block entry at the iterator's current position.

func (*IndexIter) SeekGE

func (i *IndexIter) SeekGE(key []byte) bool

SeekGE seeks the index iterator to the first block entry with a separator key greater or equal to the given key. It returns false if the seek key is greater than all index block separators.

func (*IndexIter) Separator

func (i *IndexIter) Separator() []byte

Separator returns the separator at the iterator's current position. The iterator must be positioned at a valid row.

func (*IndexIter) SeparatorGT

func (i *IndexIter) SeparatorGT(key []byte, inclusively bool) bool

SeparatorGT returns true if the separator at the iterator's current position is strictly greater than (or equal, if orEqual=true) the provided key.

func (*IndexIter) SeparatorLT

func (i *IndexIter) SeparatorLT(key []byte) bool

SeparatorLT returns true if the separator at the iterator's current position is strictly less than the provided key.

func (*IndexIter) Valid

func (i *IndexIter) Valid() bool

Valid returns true if the iterator is currently positioned at a valid block handle.

type KeyComparison

type KeyComparison struct {
	// PrefixLen is the length of the prefix of the key. It's the outcome of
	// calling base.Split on the key.
	PrefixLen int32
	// CommonPrefixLen is the length of the physical (byte-wise) prefix of the
	// logical prefix that is shared with the other key. For example, for
	// "apple@1" and "applied@3" the value is 4 (the length of "appl"). For
	// "apple@1" and "apple@10" the value is 5 (the length of "apple"), because
	// the shared bytes within the suffix are not included.
	CommonPrefixLen int32
	// UserKeyComparison is the comparison of the user keys of the two keys.
	// Should be equivalent to
	//
	//   Compare(key, otherKey)
	UserKeyComparison int32
}

KeyComparison holds information about a key and its comparison to another a key.

func (KeyComparison) PrefixEqual

func (kcmp KeyComparison) PrefixEqual() bool

PrefixEqual returns true if the key comparison determined that the keys have equal prefixes.

func (KeyComparison) String

func (kcmp KeyComparison) String() string

String returns a string representation of the KeyComparison.

type KeySchema

type KeySchema struct {
	Name string
	// KeySchema implementations can optionally make use a fixed-sized custom
	// header inside each block.
	HeaderSize   uint32
	ColumnTypes  []DataType
	NewKeyWriter func() KeyWriter

	// InitKeySeekerMetadata initializes the provided KeySeekerMetadata. This
	// happens once when a block enters the block cache and can be used to save
	// computation in NewKeySeeker.
	InitKeySeekerMetadata func(meta *KeySeekerMetadata, d *DataBlockDecoder)

	// KeySeeker returns a KeySeeker using metadata that was previously
	// initialized with InitKeySeekerMetadata. The returned key seeker can be an
	// unsafe cast of the metadata itself.
	KeySeeker func(meta *KeySeekerMetadata) KeySeeker
}

KeySchema defines the schema of a user key, as defined by the user's application.

TODO(jackson): Consider making this KVSchema. It feels like there's an opportunity to generalize the ShortAttribute so that when a value is stored out-of-band, the DataBlockEncoder calls user-provided code to store the short attributes inlined within the data block. For inlined-values, the user-defined value columns would be implicitly null.

func DefaultKeySchema

func DefaultKeySchema(comparer *base.Comparer, prefixBundleSize int) KeySchema

DefaultKeySchema returns the default key schema that decomposes a user key into its prefix and suffix. Prefixes are sorted in lexicographical order.

type KeySeeker

type KeySeeker interface {
	// IsLowerBound returns true if all keys in the data block (after suffix
	// replacement if syntheticSuffix is not empty) are >= the given key. If the
	// data block contains no keys, returns true.
	IsLowerBound(k []byte, syntheticSuffix []byte) bool
	// SeekGE returns the index of the first row with a key greater than or equal
	// to [key], and whether that row has the same prefix as [key].
	//
	// If the caller externally knows a bound on where the key is located, it
	// may indicate it through [boundRow] and [searchDir]. A [searchDir] value
	// of -1 indicates that the sought row must be at an index ≤ [boundRow]. A
	// [searchDir] value of +1 indicates that the sought row must be at an index
	// ≥ [boundRow]. Implementations may use this information to constrain the
	// search. See (base.SeekGEFlags).TrySeekUsingNext for context on when this
	// may be set in practice.
	SeekGE(key []byte, boundRow int, searchDir int8) (row int, equalPrefix bool)
	// MaterializeUserKey materializes the user key of the specified row,
	// returning a slice of the materialized user key.
	//
	// The provided keyIter must have a buffer large enough to hold the key.
	//
	// The prevRow parameter is the row MaterializeUserKey was last invoked with
	// (or a negative number if not applicable). Implementations may take
	// advantage of that knowledge to reduce work.
	MaterializeUserKey(keyIter *PrefixBytesIter, prevRow, row int) []byte
	// MaterializeUserKeyWithSyntheticSuffix is a variant of MaterializeUserKey
	// where the suffix is replaced.
	//
	// The provided keyIter must have a buffer large enough to hold the key after
	// suffix replacement.
	//
	// The prevRow parameter is the row MaterializeUserKeyWithSyntheticSuffix was
	// last invoked with (or a negative number if not applicable). Implementations
	// may take advantage of that knowledge to reduce work.
	MaterializeUserKeyWithSyntheticSuffix(
		keyIter *PrefixBytesIter, syntheticSuffix []byte, prevRow, row int,
	) []byte
}

KeySeeker iterates over the keys in a columnar data block.

Users of Pebble who define their own key schema must implement KeySeeker to seek over their decomposed keys.

KeySeeker implementations must be safe for concurrent use by multiple goroutines. In practice, multiple DataBlockIterators may use the same KeySeeker.

type KeySeekerMetadata

type KeySeekerMetadata [KeySeekerMetadataSize]byte

KeySeekerMetadata is an in-memory buffer that stores metadata for a block. It is allocated together with the buffer storing the block and is initialized once when the block is read from disk. It is always 8-byte aligned.

Portions of this buffer can be cast to the structures we need (through unsafe.Pointer), but note that any pointers in these structures will be invisible to the GC. Pointers to the block's data buffer are ok, since the metadata and the data have the same lifetime (sharing the underlying allocation).

KeySeekerMetadata is stored inside block.Metadata.

type KeyWriter

type KeyWriter interface {
	ColumnWriter
	// ComparePrev compares the provided user to the previously-written user
	// key. The returned KeyComparison's UserKeyComparison field is equivalent
	// to Compare(key, prevKey) where prevKey is the last key passed to
	// WriteKey.
	//
	// If no key has been written yet, ComparePrev returns a KeyComparison with
	// PrefixLen set and UserKeyComparison=1.
	ComparePrev(key []byte) KeyComparison
	// WriteKey writes a user key into the KeyWriter's columns. The
	// keyPrefixLenSharedWithPrev parameter takes the number of bytes prefixing
	// the key's logical prefix (as defined by (base.Comparer).Split) that the
	// previously-written key's prefix shares.
	//
	// WriteKey is guaranteed to be called sequentially with increasing row
	// indexes, beginning at zero.
	WriteKey(row int, key []byte, keyPrefixLen, keyPrefixLenSharedWithPrev int32)
	// MaterializeKey appends the zero-indexed row'th key written to dst,
	// returning the result.
	MaterializeKey(dst []byte, row int) []byte
	// FinishHeader serializes an internal header of exactly KeySchema.HeaderSize bytes.
	FinishHeader(dst []byte)
}

A KeyWriter maintains ColumnWriters for a data block for writing user keys into the database-specific key schema. Users may define their own key schema and implement KeyWriter to encode keys into custom columns that are aware of the structure of user keys.

type KeyspanBlockWriter

type KeyspanBlockWriter struct {
	// contains filtered or unexported fields
}

A KeyspanBlockWriter writes keyspan blocks. See the colblk package documentation for more details on the schema.

func (*KeyspanBlockWriter) AddSpan

func (w *KeyspanBlockWriter) AddSpan(s keyspan.Span)

AddSpan appends a new Span to the pending block. Spans must already be fragmented (non-overlapping) and added in sorted order.

func (*KeyspanBlockWriter) Finish

func (w *KeyspanBlockWriter) Finish() []byte

Finish finalizes the pending block and returns the encoded block.

func (*KeyspanBlockWriter) Init

func (w *KeyspanBlockWriter) Init(equal base.Equal)

Init initializes a keyspan block writer.

func (*KeyspanBlockWriter) KeyCount

func (w *KeyspanBlockWriter) KeyCount() int

KeyCount returns the count of keyspan.Keys written to the writer.

func (*KeyspanBlockWriter) Reset

func (w *KeyspanBlockWriter) Reset()

Reset resets the keyspan block writer to an empty state, retaining memory for reuse.

func (*KeyspanBlockWriter) Size

func (w *KeyspanBlockWriter) Size() int

Size returns the size of the pending block.

func (*KeyspanBlockWriter) String

func (w *KeyspanBlockWriter) String() string

String returns a string representation of the pending block's state.

func (*KeyspanBlockWriter) UnsafeBoundaryKeys

func (w *KeyspanBlockWriter) UnsafeBoundaryKeys() (smallest, largest base.InternalKey)

UnsafeBoundaryKeys returns the smallest and largest keys written to the keyspan block so far. The returned internal keys have user keys that point directly into the block writer's memory and must not be mutated.

func (*KeyspanBlockWriter) UnsafeLastSpan

func (w *KeyspanBlockWriter) UnsafeLastSpan() (
	start, end []byte,
	largestTrailer base.InternalKeyTrailer,
)

UnsafeLastSpan returns the start and end user keys of the last span written to the block and the trailer of its largest key. The returned keys point directly into the block writer's memory and must not be mutated.

type KeyspanDecoder

type KeyspanDecoder struct {
	// contains filtered or unexported fields
}

A KeyspanDecoder exposes facilities for decoding a keyspan block. A KeyspanDecoder is safe for concurrent use after initialization.

func (*KeyspanDecoder) DebugString

func (d *KeyspanDecoder) DebugString() string

DebugString prints a human-readable explanation of the keyspan block's binary representation.

func (*KeyspanDecoder) Describe

func (d *KeyspanDecoder) Describe(f *binfmt.Formatter, tp treeprinter.Node)

Describe describes the binary format of the keyspan block, assuming f.Offset() is positioned at the beginning of the same keyspan block described by r.

func (*KeyspanDecoder) Init

func (d *KeyspanDecoder) Init(data []byte)

Init initializes the keyspan decoder with the given block data.

type KeyspanIter

type KeyspanIter struct {
	// contains filtered or unexported fields
}

A KeyspanIter is an iterator over a columnar keyspan block. It implements the keyspan.FragmentIterator interface.

func NewKeyspanIter

func NewKeyspanIter(
	cmp base.Compare, h block.BufferHandle, transforms block.FragmentIterTransforms,
) *KeyspanIter

NewKeyspanIter constructs a new iterator over a keyspan columnar block.

func (*KeyspanIter) Close

func (i *KeyspanIter) Close()

Close closes the iterator.

func (*KeyspanIter) DebugTree

func (i *KeyspanIter) DebugTree(tp treeprinter.Node)

DebugTree is part of the FragmentIterator interface.

func (*KeyspanIter) First

func (i *KeyspanIter) First() (*keyspan.Span, error)

First moves the iterator to the first span.

func (*KeyspanIter) Last

func (i *KeyspanIter) Last() (*keyspan.Span, error)

Last moves the iterator to the last span.

func (*KeyspanIter) Next

func (i *KeyspanIter) Next() (*keyspan.Span, error)

Next moves the iterator to the next span.

func (*KeyspanIter) Prev

func (i *KeyspanIter) Prev() (*keyspan.Span, error)

Prev moves the iterator to the previous span.

func (*KeyspanIter) SeekGE

func (i *KeyspanIter) SeekGE(key []byte) (*keyspan.Span, error)

SeekGE moves the iterator to the first span covering a key greater than or equal to the given key. This is equivalent to seeking to the first span with an end key greater than the given key.

func (*KeyspanIter) SeekLT

func (i *KeyspanIter) SeekLT(key []byte) (*keyspan.Span, error)

SeekLT moves the iterator to the last span covering a key less than the given key. This is equivalent to seeking to the last span with a start key less than the given key.

func (*KeyspanIter) SetContext

func (i *KeyspanIter) SetContext(context.Context)

SetContext implements keyspan.FragmentIterator.

func (*KeyspanIter) WrapChildren

func (i *KeyspanIter) WrapChildren(keyspan.WrapFn)

WrapChildren implements keyspan.FragmentIterator.

type PrefixBytes

type PrefixBytes struct {
	// contains filtered or unexported fields
}

PrefixBytes holds an array of lexicographically ordered byte slices. It provides prefix compression. Prefix compression applies strongly to two cases in CockroachDB: removal of the "[/tenantID]/tableID/indexID" prefix that is present on all table data keys, and multiple versions of a key that are distinguished only by different timestamp suffixes. With columnar blocks enabling the timestamp to be placed in a separate column, the multiple version problem becomes one of efficiently handling exact duplicate keys. PrefixBytes builds off of the RawBytes encoding, introducing additional slices for encoding (n+bundleSize-1)/bundleSize bundle prefixes and 1 block-level shared prefix for the column.

Unlike the original prefix compression performed by rowblk (inherited from LevelDB and RocksDB), PrefixBytes does not perform all prefix compression relative to the previous key. Rather it performs prefix compression relative to the first key of a key's bundle. This can result in less compression, but simplifies reverse iteration and allows iteration to be largely stateless.

To understand the PrefixBytes layout, we'll work through an example using these 15 keys:

   0123456789
 0 aaabbbc
 1 aaabbbcc
 2 aaabbbcde
 3 aaabbbce
 4 aaabbbdee
 5 aaabbbdee
 6 aaabbbdee
 7 aaabbbeff
 8 aaabbe
 9 aaabbeef
10 aaabbeef
11 aaabc
12 aabcceef
13 aabcceef
14 aabcceef

The total length of these keys is 119 bytes. There are 3 keys which occur multiple times (rows 4-6, 9-10, 12-14) which models multiple versions of the same MVCC key in CockroachDB. There is a shared prefix to all of the keys which models the "[/tenantID]/tableID/indexID" present on CockroachDB table data keys. There are other shared prefixes which model identical values in table key columns.

The table below shows the components of the KeyBytes encoding for these 15 keys when using a bundle size of 4 which results in 4 bundles. The 15 keys are encoded into 20 slices: 1 block prefix, 4 bundle prefixes, and 15 suffixes. The first slice in the table is the block prefix that is shared by all keys in the block. The first slice in each bundle is the bundle prefix which is shared by all keys in the bundle.

 idx   | row   | end offset | data
-------+-------+------------+----------
     0 |       |          2 | aa
     1 |       |          7 | ..abbbc
     2 |     0 |          7 | .......
     3 |     1 |          8 | .......c
     4 |     2 |         10 | .......de
     5 |     3 |         11 | .......e
     6 |       |         15 | ..abbb
     7 |     4 |         18 | ......dee
     8 |     5 |         18 | .........
     9 |     6 |         18 | .........
    10 |     7 |         21 | ......eff
    11 |       |         23 | ..ab
    12 |     8 |         25 | ....be
    13 |     9 |         29 | ....beef
    14 |    10 |         29 | ........
    15 |    11 |         30 | ....c
    16 |       |         36 | ..bcceef
    17 |    12 |         36 | ........
    18 |    13 |         36 | ........
    19 |    14 |         36 | ........

The 'end offset' column in the table encodes the exclusive offset within the string data section where each of the slices end. Each slice starts at the previous slice's end offset. The first slice (the block prefix)'s start offset is implicitly zero. Note that this differs from the plain RawBytes encoding which always stores a zero offset at the beginning of the offsets array to avoid special-casing the first slice. The block prefix already requires special-casing, so materializing the zero start offset is not needed.

The table above defines 20 slices: the 1 block key prefix, the 4 bundle key prefixes and the 15 key suffixes. Offset[0] is the length of the first slice which is always anchored at data[0]. The data columns display the portion of the data array the slice covers. For row slices, an empty suffix column indicates that the slice is identical to the slice at the previous index which is indicated by the slice's offset being equal to the previous slice's offset. Due to the lexicographic sorting, the key at row i can't be a prefix of the key at row i-1 or it would have sorted before the key at row i-1. And if the key differs then only the differing bytes will be part of the suffix and not contained in the bundle prefix.

The end result of this encoding is that we can store the 119 bytes of the 15 keys plus their start and end offsets (which would naively consume 15*4=60 bytes for at least the key lengths) in 61 bytes (36 bytes of data + 4 bytes of offset constant + 20 bytes of offset delta data + 1 byte of bundle size).

Physical representation

+==================================================================+
|                        Bundle size (1 byte)                      |
|                                                                  |
| The bundle size indicates how many keys prefix compression may   |
| apply across. Every bundleSize keys, prefix compression restarts.|
| The bundleSize is required to be a power of two,  and this 1-    |
| byte prefix stores log2(bundleSize).                             |
+==================================================================+
|                            RawBytes                              |
|                                                                  |
| A modified RawBytes encoding is used to store the data slices. A |
| PrefixBytes column storing n keys will encode                    |
|                                                                  |
|                        1 block prefix                            |
|                               +                                  |
|          (n + bundleSize-1)/bundleSize bundle prefixes           |
|                               +                                  |
|                         n row suffixes                           |
|                                                                  |
| slices. Unlike the RawBytes encoding, the first offset encoded   |
| is not guaranteed to be zero. In the PrefixBytes encoding, the   |
| first offset encodes the length of the column-wide prefix. The   |
| column-wide prefix is stored in slice(0, offset(0)).             |
|                                                                  |
|  +------------------------------------------------------------+  |
|  |                       Offset table                         |  |
|  |                                                            |  |
|  | A Uint32 column encoding offsets into the string data,     |  |
|  | possibly delta8 or delta16 encoded. When a delta encoding  |  |
|  | is used, the base constant is always zero.                 |  |
|  +------------------------------------------------------------+  |
|  | offsetDelta[0] | offsetDelta[1] | ... | offsetDelta[m]     |  |
|  +------------------------------------------------------------+  |
|  | prefix-compressed string data                              |  |
|  | ...                                                        |  |
|  +------------------------------------------------------------+  |
+==================================================================+

TODO(jackson): Consider stealing the low bit of the offset for a flag indicating that a key is a duplicate and then using the remaining bits to encode the relative index of the duplicated key's end offset. This would avoid the O(bundle size) scan in the case of duplicate keys, but at the cost of complicating logic to look up a bundle prefix (which may need to follow a duplicate key's relative index to uncover the start offset of the bundle prefix).

Reads

This encoding provides O(1) access to any row by calculating the bundle for the row (see bundleOffsetIndexForRow), then the per-row's suffix (see rowSuffixIndex). If the per-row suffix's end offset equals the previous offset, then the row is a duplicate key and we need to step backward until we find a non-empty slice or the start of the bundle (a variable number of steps, but bounded by the bundle size).

Forward iteration can easily reuse the previous row's key with a check on whether the row's slice is empty. Reverse iteration within a run of equal keys can reuse the next row's key. When reverse iteration steps backward from a non-empty slice onto an empty slice, it must continue backward until a non-empty slice is found (just as in absolute positioning) to discover the row suffix that is duplicated.

The Seek{GE,LT} routines first binary search on the first key of each bundle which can be retrieved without data movement because the bundle prefix is immediately adjacent to it in the data array. We can slightly optimize the binary search by skipping over all of the keys in the bundle on prefix mismatches.

func DecodePrefixBytes

func DecodePrefixBytes(
	b []byte, offset uint32, count int,
) (prefixBytes PrefixBytes, endOffset uint32)

DecodePrefixBytes decodes the structure of a PrefixBytes, constructing an accessor for an array of lexicographically sorted byte slices constructed by PrefixBytesBuilder. Count must be the number of logical slices within the array.

func (PrefixBytes) At

func (b PrefixBytes) At(i int) []byte

At returns the i'th []byte slice in the PrefixBytes. At must allocate, so callers should prefer accessing a slice's constituent components through SharedPrefix, BundlePrefix and RowSuffix.

func (*PrefixBytes) BundleCount

func (b *PrefixBytes) BundleCount() int

BundleCount returns the count of bundles within the PrefixBytes.

func (*PrefixBytes) BundlePrefix

func (b *PrefixBytes) BundlePrefix(i int) []byte

BundlePrefix returns the prefix of the i-th bundle in the column. The provided i must be in the range [0, BundleCount()). The returned slice should not be mutated.

func (*PrefixBytes) RowBundlePrefix

func (b *PrefixBytes) RowBundlePrefix(row int) []byte

RowBundlePrefix takes a row index and returns a []byte of the prefix shared among all the keys in the row's bundle, but without the block-level shared prefix for the column. The returned slice should not be mutated.

func (*PrefixBytes) RowSuffix

func (b *PrefixBytes) RowSuffix(row int) []byte

RowSuffix returns a []byte of the suffix unique to the row. A row's full key is the result of concatenating SharedPrefix(), BundlePrefix() and RowSuffix().

The returned slice should not be mutated.

func (*PrefixBytes) Rows

func (b *PrefixBytes) Rows() int

Rows returns the count of rows whose keys are encoded within the PrefixBytes.

func (*PrefixBytes) Search

func (b *PrefixBytes) Search(k []byte) (rowIndex int, isEqual bool)

Search searches for the first key in the PrefixBytes that is greater than or equal to k, returning the index of the key and whether an equal key was found. If multiple keys are equal, the index of the first such key is returned. If all keys are < k, Search returns Rows() for the row index.

func (*PrefixBytes) SetAt

func (b *PrefixBytes) SetAt(it *PrefixBytesIter, i int)

SetAt updates the provided PrefixBytesIter to hold the i'th []byte slice in the PrefixBytes. The PrefixBytesIter's buffer must be sufficiently large to hold the i'th []byte slice, and the caller is required to statically ensure this.

func (*PrefixBytes) SetNext

func (b *PrefixBytes) SetNext(it *PrefixBytesIter)

SetNext updates the provided PrefixBytesIter to hold the next []byte slice in the PrefixBytes. SetNext requires the provided iter to currently hold a slice and for a subsequent slice to exist within the PrefixBytes. The PrefixBytesIter's buffer must be sufficiently large to hold the next []byte slice, and the caller is required to statically ensure this.

func (*PrefixBytes) SharedPrefix

func (b *PrefixBytes) SharedPrefix() []byte

SharedPrefix return a []byte of the shared prefix that was extracted from all of the values in the Bytes vector. The returned slice should not be mutated.

func (*PrefixBytes) UnsafeFirstSlice

func (b *PrefixBytes) UnsafeFirstSlice() []byte

UnsafeFirstSlice returns first slice in the PrefixBytes. The returned slice points directly into the PrefixBytes buffer and must not be mutated.

type PrefixBytesBuilder

type PrefixBytesBuilder struct {
	// contains filtered or unexported fields
}

PrefixBytesBuilder encodes a column of lexicographically-sorted byte slices, applying prefix compression to reduce the encoded size.

func (*PrefixBytesBuilder) DataType

func (b *PrefixBytesBuilder) DataType(int) DataType

DataType implements ColumnWriter.

func (*PrefixBytesBuilder) Finish

func (b *PrefixBytesBuilder) Finish(
	col int, rows int, offset uint32, buf []byte,
) (endOffset uint32)

Finish writes the serialized byte slices to buf starting at offset. The buf slice must be sufficiently large to store the serialized output. The caller should use [Size] to size buf appropriately before calling Finish.

Finish only supports values of [rows] equal to the number of keys set on the builder, or one less.

func (*PrefixBytesBuilder) Init

func (b *PrefixBytesBuilder) Init(bundleSize int)

Init initializes the PrefixBytesBuilder with the specified bundle size. The builder will produce a prefix-compressed column of data type DataTypePrefixBytes. The [bundleSize] indicates the number of keys that form a "bundle," across which prefix-compression is applied. All keys in the column will share a column-wide prefix if there is one.

func (*PrefixBytesBuilder) NumColumns

func (b *PrefixBytesBuilder) NumColumns() int

NumColumns implements ColumnWriter.

func (*PrefixBytesBuilder) Put

func (b *PrefixBytesBuilder) Put(key []byte, bytesSharedWithPrev int)

Put adds the provided key to the column. The provided key must be lexicographically greater than or equal to the previous key added to the builder.

The provided bytesSharedWithPrev must be the length of the byte prefix the provided key shares with the previous key. The caller is required to provide this because in the primary expected use, the caller will already need to compute it for the purpose of determining whether successive keys share the same prefix.

func (*PrefixBytesBuilder) Reset

func (b *PrefixBytesBuilder) Reset()

Reset resets the builder to an empty state, preserving the existing bundle size.

func (*PrefixBytesBuilder) Rows

func (b *PrefixBytesBuilder) Rows() int

Rows returns the number of keys added to the builder.

func (*PrefixBytesBuilder) Size

func (b *PrefixBytesBuilder) Size(rows int, offset uint32) uint32

Size computes the size required to encode the byte slices beginning at the provided offset. The offset is required to ensure proper alignment. The returned uint32 is the offset of the first byte after the end of the encoded data. To compute the size in bytes, subtract the [offset] passed into Size from the returned offset.

func (*PrefixBytesBuilder) UnsafeGet

func (b *PrefixBytesBuilder) UnsafeGet(i int) []byte

UnsafeGet returns the zero-indexed i'th key added to the builder through Put. UnsafeGet may only be used to retrieve the Rows()-1'th or Rows()-2'th keys. If called with a different i value, UnsafeGet panics. The keys returned by UnsafeGet are guaranteed to be stable until Finish or Reset is called. The caller must not mutate the returned slice.

func (*PrefixBytesBuilder) WriteDebug

func (b *PrefixBytesBuilder) WriteDebug(w io.Writer, rows int)

WriteDebug implements the Encoder interface.

type PrefixBytesIter

type PrefixBytesIter struct {
	// Buf is used for materializing a user key. It is preallocated to the maximum
	// key length in the data block.
	Buf []byte
	// contains filtered or unexported fields
}

PrefixBytesIter is an iterator and associated buffers for PrefixBytes. It provides a means for efficiently iterating over the []byte slices contained within a PrefixBytes, avoiding unnecessary copying when portions of slices are shared.

func (*PrefixBytesIter) Init

func (i *PrefixBytesIter) Init(maxKeyLength int, syntheticPrefix block.SyntheticPrefix)

Init initializes the prefix bytes iterator; maxKeyLength must be large enough to fit any key in the block after applying any synthetic prefix and/or suffix.

type RawBytes

type RawBytes struct {
	// contains filtered or unexported fields
}

RawBytes holds an array of byte slices, stored as a concatenated data section and a series of offsets for each slice. Byte slices within RawBytes are stored in their entirety without any compression, ensuring stability without copying.

Representation

An array of N byte slices encodes N+1 offsets. The beginning of the data representation holds an offsets table, in the same encoding as a DataTypeUint32 column. The integer offsets may be encoded using smaller width integers to save space if all offsets fit within an 8-bit or 16-bit uint. Each offset is relative to the beginning of the string data section (after the offset table).

The use of UintEncoding conserves space in the common case. In the context of CockroachDB, the vast majority of offsets will fit in 16-bits when using 32 KiB blocks (the size in use by CockroachDB). However, a single value larger than 65535 bytes requires an offset too large to fit within 16 bits, in which case offsets will be encoded as 32-bit integers.

+-------------------------------------------------------------------+
|        a uint offsets table, usually encoded with 16-bits,        |
|                possibly padded for alignment                      |
|                      (see UintEncoding)                           |
+-------------------------------------------------------------------+
|                           String Data                             |
|  abcabcada....                                                    |
+-------------------------------------------------------------------+

The UintEncoding bits of the ColumnEncoding for a RawBytes column describes the encoding of the offset table.

func DecodeRawBytes

func DecodeRawBytes(b []byte, offset uint32, count int) (rawBytes RawBytes, endOffset uint32)

DecodeRawBytes decodes the structure of a RawBytes, constructing an accessor for an array of byte slices constructed by RawBytesBuilder. Count must be the number of byte slices within the array.

func (RawBytes) At

func (b RawBytes) At(i int) []byte

At returns the []byte at index i. The returned slice should not be mutated.

func (*RawBytes) Slices

func (b *RawBytes) Slices() int

Slices returns the number of []byte slices encoded within the RawBytes.

type RawBytesBuilder

type RawBytesBuilder struct {
	// contains filtered or unexported fields
}

RawBytesBuilder encodes a column of byte slices.

func (*RawBytesBuilder) DataType

func (b *RawBytesBuilder) DataType(int) DataType

DataType implements ColumnWriter.

func (*RawBytesBuilder) Finish

func (b *RawBytesBuilder) Finish(col, rows int, offset uint32, buf []byte) uint32

Finish writes the serialized byte slices to buf starting at offset. The buf slice must be sufficiently large to store the serialized output. The caller should use [Size] to size buf appropriately before calling Finish.

func (*RawBytesBuilder) Init

func (b *RawBytesBuilder) Init()

Init initializes the builder for first-time use.

func (*RawBytesBuilder) NumColumns

func (b *RawBytesBuilder) NumColumns() int

NumColumns implements ColumnWriter.

func (*RawBytesBuilder) Put

func (b *RawBytesBuilder) Put(s []byte)

Put appends the provided byte slice to the builder.

func (*RawBytesBuilder) PutConcat

func (b *RawBytesBuilder) PutConcat(s1, s2 []byte)

PutConcat appends a single byte slice formed by the concatenation of the two byte slice arguments.

func (*RawBytesBuilder) Reset

func (b *RawBytesBuilder) Reset()

Reset resets the builder to an empty state.

func (*RawBytesBuilder) Rows

func (b *RawBytesBuilder) Rows() int

Rows returns the count of slices that have been added to the builder.

func (*RawBytesBuilder) Size

func (b *RawBytesBuilder) Size(rows int, offset uint32) uint32

Size computes the size required to encode the byte slices beginning in a buffer at the provided offset. The offset is required to ensure proper alignment. The returned uint32 is the offset of the first byte after the end of the encoded data. To compute the size in bytes, subtract the [offset] passed into Size from the returned offset.

func (*RawBytesBuilder) UnsafeGet

func (b *RawBytesBuilder) UnsafeGet(i int) []byte

UnsafeGet returns the i'th slice added to the builder. The returned slice is owned by the builder and must not be mutated.

func (*RawBytesBuilder) WriteDebug

func (b *RawBytesBuilder) WriteDebug(w io.Writer, rows int)

WriteDebug implements Encoder.

type Uint

type Uint interface {
	~uint8 | ~uint16 | ~uint32 | ~uint64
}

Uint is a constraint that permits any unsigned integer type with an explicit size.

type UintBuilder

type UintBuilder struct {
	// contains filtered or unexported fields
}

UintBuilder builds a column of unsigned integers. It uses the smallest possible UintEncoding, depending on the values.

func (*UintBuilder) DataType

func (b *UintBuilder) DataType(int) DataType

DataType implements ColumnWriter.

func (*UintBuilder) Finish

func (b *UintBuilder) Finish(col, rows int, offset uint32, buf []byte) uint32

Finish implements ColumnWriter, serializing the column into offset [offset] of [buf].

func (*UintBuilder) Get

func (b *UintBuilder) Get(row int) uint64

Get gets the value of the provided row index. The provided row must have been Set or the builder must have been initialized with InitWithDefault.

func (*UintBuilder) Init

func (b *UintBuilder) Init()

Init initializes the UintBuilder.

func (*UintBuilder) InitWithDefault

func (b *UintBuilder) InitWithDefault()

InitWithDefault initializes the UintBuilder. Any rows that are not explicitly set are assumed to be zero. For the purpose of determining whether a delta encoding is possible, the column is assumed to contain at least 1 default value.

InitWithDefault may be preferrable when a nonzero value is uncommon, and the caller can avoid explicitly Set-ing every zero value.

func (*UintBuilder) NumColumns

func (b *UintBuilder) NumColumns() int

NumColumns implements ColumnWriter.

func (*UintBuilder) Reset

func (b *UintBuilder) Reset()

Reset implements ColumnWriter and resets the builder, reusing existing allocated memory.

func (*UintBuilder) Set

func (b *UintBuilder) Set(row int, v uint64)

Set sets the value of the provided row index to v.

func (*UintBuilder) Size

func (b *UintBuilder) Size(rows int, offset uint32) uint32

Size implements ColumnWriter and returns the size of the column if its first [rows] rows were serialized, serializing the column into offset [offset].

func (*UintBuilder) WriteDebug

func (b *UintBuilder) WriteDebug(w io.Writer, rows int)

WriteDebug implements Encoder.

type UintEncoding

type UintEncoding uint8

UintEncoding indicates how unsigned integers (of at most 64 bits) are encoded. It has two components:

  • the low bits indicate how many bytes per integer are used, with allowed values 0, 1, 2, 4, or 8.
  • whether we are using a delta encoding, meaning that a base (64-bit) value is encoded separately and each encoded value is a delta from that base. Delta encoding is never necessary when we use 8 bytes per integer.

Note that 0-byte encodings imply that all values are equal (either to the base value if we are using a delta encoding, otherwise to 0).

The UintEncoding byte is serialized to the uint column before the column data.

func DetermineUintEncoding

func DetermineUintEncoding(minValue, maxValue uint64, numRows int) UintEncoding

DetermineUintEncoding returns the best valid encoding that can be used to represent numRows integers in the range [minValue, maxValue].

DetermineUintEncoding returns the same result for any value of rows >= UintEncodingRowThreshold.

func (UintEncoding) IsDelta

func (e UintEncoding) IsDelta() bool

IsDelta returns true if it is a delta encoding.

func (UintEncoding) IsValid

func (e UintEncoding) IsValid() bool

IsValid returns true if the encoding is valid.

func (UintEncoding) String

func (e UintEncoding) String() string

String implements fmt.Stringer.

func (UintEncoding) Width

func (e UintEncoding) Width() int

Width returns the number of bytes used per integer. It can be 0, 1, 2, 4, or 8.

type UnsafeBuf

type UnsafeBuf struct {
	// contains filtered or unexported fields
}

UnsafeBuf provides a buffer without bounds checking. Every buf has a len and capacity.

func (*UnsafeBuf) Alloc

func (b *UnsafeBuf) Alloc(n int)

Alloc allocates a buffer of size n, without zeroing its contents or copying previous buffer contents.

func (*UnsafeBuf) UnsafeSlice

func (b *UnsafeBuf) UnsafeSlice() []byte

UnsafeSlice returns the current contents of the buf.

type UnsafeOffsets

type UnsafeOffsets struct {
	// contains filtered or unexported fields
}

UnsafeOffsets is a specialization of UnsafeInts (providing the same functionality) which is optimized when the integers are offsets inside a column block. It can only be used with 0, 1, 2, or 4 byte encoding without delta.

func DecodeUnsafeOffsets

func DecodeUnsafeOffsets(b []byte, off uint32, rows int) (_ UnsafeOffsets, endOffset uint32)

DecodeUnsafeOffsets decodes the structure of a slice of offsets from a byte slice.

func (UnsafeOffsets) At

func (s UnsafeOffsets) At(i int) uint32

At returns the `i`-th offset.

func (UnsafeOffsets) At2

func (s UnsafeOffsets) At2(i int) (uint32, uint32)

At2 returns the `i`-th and `i+1`-th offsets.

type UnsafeRawSlice

type UnsafeRawSlice[T constraints.Integer] struct {
	// contains filtered or unexported fields
}

UnsafeRawSlice maintains a pointer to a slice of elements of type T. UnsafeRawSlice provides no bounds checking.

func (UnsafeRawSlice[T]) At

func (s UnsafeRawSlice[T]) At(i int) T

At returns the `i`-th element of the slice.

func (UnsafeRawSlice[T]) Slice

func (s UnsafeRawSlice[T]) Slice(len int) []T

Slice returns a go []T slice containing the first `len` elements of the unsafe slice.

type UnsafeUints

type UnsafeUints struct {
	// contains filtered or unexported fields
}

UnsafeUints exposes a read-only view of integers from a column, transparently decoding data based on the UintEncoding.

See UintEncoding and UintBuilder.

func DecodeUnsafeUints

func DecodeUnsafeUints(b []byte, off uint32, rows int) (_ UnsafeUints, endOffset uint32)

DecodeUnsafeUints decodes the structure of a slice of uints from a byte slice.

func (UnsafeUints) At

func (s UnsafeUints) At(i int) uint64

At returns the `i`-th element.

type Version

type Version uint8

Version indicates the version of the columnar block format encoded. The version byte is always the first byte within the block. This ensures that readers can switch on the version before reading the rest of the block.

const (
	// Version1 is the first version of the columnar block format.
	Version1 Version = 0x01
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL