sstable

package
v0.0.0-...-417737b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2022 License: BSD-3-Clause Imports: 29 Imported by: 0

Documentation

Overview

Package sstable implements readers and writers of pebble tables.

Tables are either opened for reading or created for writing but not both.

A reader can create iterators, which allow seeking and next/prev iteration. There may be multiple key/value pairs that have the same key and different sequence numbers.

A reader can be used concurrently. Multiple goroutines can call NewIter concurrently, and each iterator can run concurrently with other iterators. However, any particular iterator should not be used concurrently, and iterators should not be used once a reader is closed.

A writer writes key/value pairs in increasing key order, and cannot be used concurrently. A table cannot be read until the writer has finished.

Readers and writers can be created with various options. Passing a nil Options pointer is valid and means to use the default values.

One such option is to define the 'less than' ordering for keys. The default Comparer uses the natural ordering consistent with bytes.Compare. The same ordering should be used for reading and writing a table.

To return the value for a key:

r := table.NewReader(file, options)
defer r.Close()
i := r.NewIter(nil, nil)
defer i.Close()
ikey, value := r.SeekGE(key)
if options.Comparer.Compare(ikey.UserKey, key) != 0 {
  // not found
} else {
  // value is the first record containing key
}

To count the number of entries in a table:

i, n := r.NewIter(nil, nil), 0
for key, value := i.First(); key != nil; key, value = i.Next() {
	n++
}
if err := i.Close(); err != nil {
	return 0, err
}
return n, nil

To write a table with three entries:

w := table.NewWriter(file, options)
if err := w.Set([]byte("apple"), []byte("red")); err != nil {
	w.Close()
	return err
}
if err := w.Set([]byte("banana"), []byte("yellow")); err != nil {
	w.Close()
	return err
}
if err := w.Set([]byte("cherry"), []byte("red")); err != nil {
	w.Close()
	return err
}
return w.Close()

Index

Constants

View Source
const (
	InternalKeyKindDelete          = base.InternalKeyKindDelete
	InternalKeyKindSet             = base.InternalKeyKindSet
	InternalKeyKindMerge           = base.InternalKeyKindMerge
	InternalKeyKindLogData         = base.InternalKeyKindLogData
	InternalKeyKindRangeDelete     = base.InternalKeyKindRangeDelete
	InternalKeyKindMax             = base.InternalKeyKindMax
	InternalKeyKindInvalid         = base.InternalKeyKindInvalid
	InternalKeySeqNumBatch         = base.InternalKeySeqNumBatch
	InternalKeySeqNumMax           = base.InternalKeySeqNumMax
	InternalKeyRangeDeleteSentinel = base.InternalKeyRangeDeleteSentinel
)

These constants are part of the file format, and should not be changed.

View Source
const (
	TableFilter = base.TableFilter
)

Exported TableFilter constants.

Variables

View Source
var DefaultComparer = base.DefaultComparer

DefaultComparer exports the base.DefaultComparer variable.

Functions

This section is empty.

Types

type AbbreviatedKey

type AbbreviatedKey = base.AbbreviatedKey

AbbreviatedKey exports the base.AbbreviatedKey type.

type BlockHandle

type BlockHandle struct {
	Offset, Length uint64
}

BlockHandle is the file offset and length of a block.

type BlockHandleWithProperties

type BlockHandleWithProperties struct {
	BlockHandle
	Props []byte
}

BlockHandleWithProperties is used for data blocks and first/lower level index blocks, since they can be annotated using BlockPropertyCollectors.

type BlockIntervalCollector

type BlockIntervalCollector struct {
	// contains filtered or unexported fields
}

BlockIntervalCollector is a helper implementation of BlockPropertyCollector for users who want to represent a set of the form [lower,upper) where both lower and upper are uint64, and lower <= upper.

The set is encoded as: - Two varint integers, (lower,upper-lower), when upper-lower > 0 - Nil, when upper-lower=0

Users must not expect this to preserve differences between empty sets -- they will all get turned into the semantically equivalent [0,0).

A BlockIntervalCollector that collects over point and range keys needs to have both the point and range DataBlockIntervalCollector specified, since point and range keys are fed to the BlockIntervalCollector in an interleaved fashion, independently of one another. This also implies that the DataBlockIntervalCollectors for point and range keys should be references to independent instances, rather than references to the same collector, as point and range keys are tracked independently.

func (*BlockIntervalCollector) Add

func (b *BlockIntervalCollector) Add(key InternalKey, value []byte) error

Add implements the BlockPropertyCollector interface.

func (*BlockIntervalCollector) AddPrevDataBlockToIndexBlock

func (b *BlockIntervalCollector) AddPrevDataBlockToIndexBlock()

AddPrevDataBlockToIndexBlock implements the BlockPropertyCollector interface.

func (*BlockIntervalCollector) FinishDataBlock

func (b *BlockIntervalCollector) FinishDataBlock(buf []byte) ([]byte, error)

FinishDataBlock implements the BlockPropertyCollector interface.

func (*BlockIntervalCollector) FinishIndexBlock

func (b *BlockIntervalCollector) FinishIndexBlock(buf []byte) ([]byte, error)

FinishIndexBlock implements the BlockPropertyCollector interface.

func (*BlockIntervalCollector) FinishTable

func (b *BlockIntervalCollector) FinishTable(buf []byte) ([]byte, error)

FinishTable implements the BlockPropertyCollector interface.

func (*BlockIntervalCollector) Name

func (b *BlockIntervalCollector) Name() string

Name implements the BlockPropertyCollector interface.

type BlockPropertiesFilterer

type BlockPropertiesFilterer struct {
	// contains filtered or unexported fields
}

BlockPropertiesFilterer provides filtering support when reading an sstable in the context of an iterator that has a slice of BlockPropertyFilters. After the call to NewBlockPropertiesFilterer, the caller must call IntersectsUserPropsAndFinishInit to check if the sstable intersects with the filters. If it does intersect, this function also finishes initializing the BlockPropertiesFilterer using the shortIDs for the relevant filters. Subsequent checks for relevance of a block should use the intersects method.

func NewBlockPropertiesFilterer

func NewBlockPropertiesFilterer(filters []BlockPropertyFilter) *BlockPropertiesFilterer

NewBlockPropertiesFilterer returns a partially initialized filterer. To complete initialization, call IntersectsUserPropsAndFinishInit.

func (*BlockPropertiesFilterer) IntersectsUserPropsAndFinishInit

func (f *BlockPropertiesFilterer) IntersectsUserPropsAndFinishInit(
	userProperties map[string]string,
) (bool, error)

IntersectsUserPropsAndFinishInit is called with the user properties map for the sstable and returns whether the sstable intersects the filters. It additionally initializes the shortIDToFiltersIndex for the filters that are relevant to this sstable.

type BlockPropertyCollector

type BlockPropertyCollector interface {
	// Name returns the name of the block property collector.
	Name() string
	// Add is called with each new entry added to a data block in the sstable.
	// The callee can assume that these are in sorted order.
	Add(key InternalKey, value []byte) error
	// FinishDataBlock is called when all the entries have been added to a
	// data block. Subsequent Add calls will be for the next data block. It
	// returns the property value for the finished block.
	FinishDataBlock(buf []byte) ([]byte, error)
	// AddPrevDataBlockToIndexBlock adds the entry corresponding to the
	// previous FinishDataBlock to the current index block.
	AddPrevDataBlockToIndexBlock()
	// FinishIndexBlock is called when an index block, containing all the
	// key-value pairs since the last FinishIndexBlock, will no longer see new
	// entries. It returns the property value for the index block.
	FinishIndexBlock(buf []byte) ([]byte, error)
	// FinishTable is called when the sstable is finished, and returns the
	// property value for the sstable.
	FinishTable(buf []byte) ([]byte, error)
}

BlockPropertyCollector is used when writing a sstable.

  • All calls to Add are included in the next FinishDataBlock, after which the next data block is expected to start.
  • The index entry generated for the data block, which contains the return value from FinishDataBlock, is not immediately included in the current index block. It is included when AddPrevDataBlockToIndexBlock is called. An alternative would be to return an opaque handle from FinishDataBlock and pass it to a new AddToIndexBlock method, which requires more plumbing, and passing of an interface{} results in a undesirable heap allocation. AddPrevDataBlockToIndexBlock must be called before keys are added to the new data block.

func NewBlockIntervalCollector

func NewBlockIntervalCollector(
	name string, pointCollector, rangeCollector DataBlockIntervalCollector,
) BlockPropertyCollector

NewBlockIntervalCollector constructs a BlockIntervalCollector with the given name. The BlockIntervalCollector makes use of the given point and range key DataBlockIntervalCollectors when encountering point and range keys, respectively.

The caller may pass a nil DataBlockIntervalCollector for one of the point or range key collectors, in which case keys of those types will be ignored. This allows for flexible construction of BlockIntervalCollectors that operate on just point keys, just range keys, or both point and range keys.

If both point and range keys are to be tracked, two independent collectors should be provided, rather than the same collector passed in twice (see the comment on BlockIntervalCollector for more detail)

type BlockPropertyFilter

type BlockPropertyFilter = base.BlockPropertyFilter

BlockPropertyFilter is used in an Iterator to filter sstables and blocks within the sstable. It should not maintain any per-sstable state, and must be thread-safe.

func NewBlockIntervalFilter

func NewBlockIntervalFilter(name string, lower uint64, upper uint64) BlockPropertyFilter

NewBlockIntervalFilter constructs a BlockPropertyFilter that filters blocks based on an interval property collected by BlockIntervalCollector and the given [lower, upper) bounds. The given name specifies the BlockIntervalCollector's properties to read.

type ChecksumType

type ChecksumType byte

ChecksumType specifies the checksum used for blocks.

const (
	ChecksumTypeNone     ChecksumType = 0
	ChecksumTypeCRC32c   ChecksumType = 1
	ChecksumTypeXXHash   ChecksumType = 2
	ChecksumTypeXXHash64 ChecksumType = 3
)

The available checksum types.

func (ChecksumType) String

func (t ChecksumType) String() string

String implements fmt.Stringer.

type Compare

type Compare = base.Compare

Compare exports the base.Compare type.

type Comparer

type Comparer = base.Comparer

Comparer exports the base.Comparer type.

type Comparers

type Comparers map[string]*Comparer

Comparers is a map from comparer name to comparer. It is used for debugging tools which may be used on multiple databases configured with different comparers. Comparers implements the OpenOption interface and can be passed as a parameter to NewReader.

type Compression

type Compression int

Compression is the per-block compression algorithm to use.

const (
	DefaultCompression Compression = iota
	NoCompression
	SnappyCompression
	ZstdCompression
	NCompression
)

The available compression types.

func (Compression) String

func (c Compression) String() string

type DataBlockIntervalCollector

type DataBlockIntervalCollector interface {
	// Add is called with each new entry added to a data block in the sstable.
	// The callee can assume that these are in sorted order.
	Add(key InternalKey, value []byte) error
	// FinishDataBlock is called when all the entries have been added to a
	// data block. Subsequent Add calls will be for the next data block. It
	// returns the [lower, upper) for the finished block.
	FinishDataBlock() (lower uint64, upper uint64, err error)
}

DataBlockIntervalCollector is the interface used by BlockIntervalCollector that contains the actual logic pertaining to the property. It only maintains state for the current data block, and resets that state in FinishDataBlock. This interface can be used to reduce parsing costs.

type Equal

type Equal = base.Equal

Equal exports the base.Equal type.

type FileReopenOpt

type FileReopenOpt struct {
	FS       vfs.FS
	Filename string
}

FileReopenOpt is specified if this reader is allowed to reopen additional file descriptors for this file. Used to take advantage of OS-level readahead.

type FilterMetrics

type FilterMetrics struct {
	// The number of hits for the filter policy. This is the
	// number of times the filter policy was successfully used to avoid access
	// of a data block.
	Hits int64
	// The number of misses for the filter policy. This is the number of times
	// the filter policy was checked but was unable to filter an access of a data
	// block.
	Misses int64
}

FilterMetrics holds metrics for the filter policy.

type FilterPolicy

type FilterPolicy = base.FilterPolicy

FilterPolicy exports the base.FilterPolicy type.

type FilterType

type FilterType = base.FilterType

FilterType exports the base.FilterType type.

type FilterWriter

type FilterWriter = base.FilterWriter

FilterWriter exports the base.FilterWriter type.

type InternalKey

type InternalKey = base.InternalKey

InternalKey exports the base.InternalKey type.

type InternalKeyKind

type InternalKeyKind = base.InternalKeyKind

InternalKeyKind exports the base.InternalKeyKind type.

type Iterator

type Iterator interface {
	base.InternalIterator

	SetCloseHook(fn func(i Iterator) error)
}

Iterator iterates over an entire table of data.

type Layout

type Layout struct {
	Data       []BlockHandleWithProperties
	Index      []BlockHandle
	TopIndex   BlockHandle
	Filter     BlockHandle
	RangeDel   BlockHandle
	RangeKey   BlockHandle
	Properties BlockHandle
	MetaIndex  BlockHandle
	Footer     BlockHandle
}

Layout describes the block organization of an sstable.

func (*Layout) Describe

func (l *Layout) Describe(
	w io.Writer, verbose bool, r *Reader, fmtRecord func(key *base.InternalKey, value []byte),
)

Describe returns a description of the layout. If the verbose parameter is true, details of the structure of each block are returned as well.

type Merger

type Merger = base.Merger

Merger exports the base.Merger type.

type Mergers

type Mergers map[string]*Merger

Mergers is a map from merger name to merger. It is used for debugging tools which may be used on multiple databases configured with different mergers. Mergers implements the OpenOption interface and can be passed as a parameter to NewReader.

type PreviousPointKeyOpt

type PreviousPointKeyOpt struct {
	// contains filtered or unexported fields
}

PreviousPointKeyOpt is a WriterOption that provides access to the last point key written to the writer while building a sstable.

func (PreviousPointKeyOpt) UnsafeKey

func (o PreviousPointKeyOpt) UnsafeKey() base.InternalKey

UnsafeKey returns the last point key written to the writer to which this option was passed during creation. The returned key points directly into a buffer belonging to the Writer. The value's lifetime ends the next time a point key is added to the Writer. Invariant: UnsafeKey isn't and shouldn't be called after the Writer is closed.

type Properties

type Properties struct {
	// ID of column family for this SST file, corresponding to the CF identified
	// by column_family_name.
	ColumnFamilyID uint64 `prop:"rocksdb.column.family.id"`
	// Name of the column family with which this SST file is associated. Empty if
	// the column family is unknown.
	ColumnFamilyName string `prop:"rocksdb.column.family.name"`
	// The name of the comparer used in this table.
	ComparerName string `prop:"rocksdb.comparator"`
	// The compression algorithm used to compress blocks.
	CompressionName string `prop:"rocksdb.compression"`
	// The compression options used to compress blocks.
	CompressionOptions string `prop:"rocksdb.compression_options"`
	// The time when the SST file was created. Since SST files are immutable,
	// this is equivalent to last modified time.
	CreationTime uint64 `prop:"rocksdb.creation.time"`
	// The total size of all data blocks.
	DataSize uint64 `prop:"rocksdb.data.size"`
	// The external sstable version format. Version 2 is the one RocksDB has been
	// using since 5.13. RocksDB only uses the global sequence number for an
	// sstable if this property has been set.
	ExternalFormatVersion uint32 `prop:"rocksdb.external_sst_file.version"`
	// Actual SST file creation time. 0 means unknown.
	FileCreationTime uint64 `prop:"rocksdb.file.creation.time"`
	// The name of the filter policy used in this table. Empty if no filter
	// policy is used.
	FilterPolicyName string `prop:"rocksdb.filter.policy"`
	// The size of filter block.
	FilterSize uint64 `prop:"rocksdb.filter.size"`
	// If 0, key is variable length. Otherwise number of bytes for each key.
	FixedKeyLen uint64 `prop:"rocksdb.fixed.key.length"`
	// Format version, reserved for backward compatibility.
	FormatVersion uint64 `prop:"rocksdb.format.version"`
	// The global sequence number to use for all entries in the table. Present if
	// the table was created externally and ingested whole.
	GlobalSeqNum uint64 `prop:"rocksdb.external_sst_file.global_seqno"`
	// Whether the index key is user key or an internal key.
	IndexKeyIsUserKey uint64 `prop:"rocksdb.index.key.is.user.key"`
	// Total number of index partitions if kTwoLevelIndexSearch is used.
	IndexPartitions uint64 `prop:"rocksdb.index.partitions"`
	// The size of index block.
	IndexSize uint64 `prop:"rocksdb.index.size"`
	// The index type. TODO(peter): add a more detailed description.
	IndexType uint32 `prop:"rocksdb.block.based.table.index.type"`
	// Whether delta encoding is used to encode the index values.
	IndexValueIsDeltaEncoded uint64 `prop:"rocksdb.index.value.is.delta.encoded"`
	// The name of the merger used in this table. Empty if no merger is used.
	MergerName string `prop:"rocksdb.merge.operator"`
	// The number of blocks in this table.
	NumDataBlocks uint64 `prop:"rocksdb.num.data.blocks"`
	// The number of deletion entries in this table, including both point and
	// range deletions.
	NumDeletions uint64 `prop:"rocksdb.deleted.keys"`
	// The number of entries in this table.
	NumEntries uint64 `prop:"rocksdb.num.entries"`
	// The number of merge operands in the table.
	NumMergeOperands uint64 `prop:"rocksdb.merge.operands"`
	// The number of range deletions in this table.
	NumRangeDeletions uint64 `prop:"rocksdb.num.range-deletions"`
	// The number of RANGEKEYDELs in this table.
	NumRangeKeyDels uint64 `prop:"pebble.num.range-key-dels"`
	// The number of RANGEKEYSETs in this table.
	NumRangeKeySets uint64 `prop:"pebble.num.range-key-sets"`
	// The number of RANGEKEYUNSETs in this table.
	NumRangeKeyUnsets uint64 `prop:"pebble.num.range-key-unsets"`
	// Timestamp of the earliest key. 0 if unknown.
	OldestKeyTime uint64 `prop:"rocksdb.oldest.key.time"`
	// The name of the prefix extractor used in this table. Empty if no prefix
	// extractor is used.
	PrefixExtractorName string `prop:"rocksdb.prefix.extractor.name"`
	// If filtering is enabled, was the filter created on the key prefix.
	PrefixFiltering bool `prop:"rocksdb.block.based.table.prefix.filtering"`
	// A comma separated list of names of the property collectors used in this
	// table.
	PropertyCollectorNames string `prop:"rocksdb.property.collectors"`
	// Total raw key size.
	RawKeySize uint64 `prop:"rocksdb.raw.key.size"`
	// Total raw rangekey key size.
	RawRangeKeyKeySize uint64 `prop:"pebble.raw.range-key.key.size"`
	// Total raw rangekey value size.
	RawRangeKeyValueSize uint64 `prop:"pebble.raw.range-key.value.size"`
	// Total raw value size.
	RawValueSize uint64 `prop:"rocksdb.raw.value.size"`
	// Size of the top-level index if kTwoLevelIndexSearch is used.
	TopLevelIndexSize uint64 `prop:"rocksdb.top-level.index.size"`
	// User collected properties.
	UserProperties map[string]string
	// If filtering is enabled, was the filter created on the whole key.
	WholeKeyFiltering bool `prop:"rocksdb.block.based.table.whole.key.filtering"`

	// Loaded set indicating which fields have been loaded from disk. Indexed by
	// the field's byte offset within the struct
	// (reflect.StructField.Offset). Only set if the properties have been loaded
	// from a file. Only exported for testing purposes.
	Loaded map[uintptr]struct{}
}

Properties holds the sstable property values. The properties are automatically populated during sstable creation and load from the properties meta block when an sstable is opened.

func (*Properties) NumPointDeletions

func (p *Properties) NumPointDeletions() uint64

NumPointDeletions returns the number of point deletions in this table.

func (*Properties) NumRangeKeys

func (p *Properties) NumRangeKeys() uint64

NumRangeKeys returns a count of the number of range keys in this table.

func (*Properties) String

func (p *Properties) String() string

type ReadableFile

type ReadableFile interface {
	io.ReaderAt
	io.Closer
	Stat() (os.FileInfo, error)
}

ReadableFile describes subset of vfs.File required for reading SSTs.

type Reader

type Reader struct {
	Compare   Compare
	FormatKey base.FormatKey
	Split     Split

	Properties Properties
	// contains filtered or unexported fields
}

Reader is a table reader.

func NewMemReader

func NewMemReader(sst []byte, o ReaderOptions) (*Reader, error)

NewMemReader opens a reader over the SST stored in the passed []byte.

func NewReader

func NewReader(f ReadableFile, o ReaderOptions, extraOpts ...ReaderOption) (*Reader, error)

NewReader returns a new table reader for the file. Closing the reader will close the file.

func (*Reader) Close

func (r *Reader) Close() error

Close implements DB.Close, as documented in the pebble package.

func (*Reader) EstimateDiskUsage

func (r *Reader) EstimateDiskUsage(start, end []byte) (uint64, error)

EstimateDiskUsage returns the total size of data blocks overlapping the range `[start, end]`. Even if a data block partially overlaps, or we cannot determine overlap due to abbreviated index keys, the full data block size is included in the estimation. This function does not account for any metablock space usage. Assumes there is at least partial overlap, i.e., `[start, end]` falls neither completely before nor completely after the file's range.

TODO(ajkr): account for metablock space usage. Perhaps look at the fraction of data blocks overlapped and add that same fraction of the metadata blocks to the estimate.

func (*Reader) Layout

func (r *Reader) Layout() (*Layout, error)

Layout returns the layout (block organization) for an sstable.

func (*Reader) NewCompactionIter

func (r *Reader) NewCompactionIter(bytesIterated *uint64) (Iterator, error)

NewCompactionIter returns an iterator similar to NewIter but it also increments the number of bytes iterated. If an error occurs, NewCompactionIter cleans up after itself and returns a nil iterator.

func (*Reader) NewIter

func (r *Reader) NewIter(lower, upper []byte) (Iterator, error)

NewIter returns an iterator for the contents of the table. If an error occurs, NewIter cleans up after itself and returns a nil iterator.

func (*Reader) NewIterWithBlockPropertyFilters

func (r *Reader) NewIterWithBlockPropertyFilters(
	lower, upper []byte, filterer *BlockPropertiesFilterer, useFilterBlock bool,
) (Iterator, error)

NewIterWithBlockPropertyFilters returns an iterator for the contents of the table. If an error occurs, NewIterWithBlockPropertyFilters cleans up after itself and returns a nil iterator.

func (*Reader) NewRawRangeDelIter

func (r *Reader) NewRawRangeDelIter() (keyspan.FragmentIterator, error)

NewRawRangeDelIter returns an internal iterator for the contents of the range-del block for the table. Returns nil if the table does not contain any range deletions.

func (*Reader) NewRawRangeKeyIter

func (r *Reader) NewRawRangeKeyIter() (keyspan.FragmentIterator, error)

NewRawRangeKeyIter returns an internal iterator for the contents of the range-key block for the table. Returns nil if the table does not contain any range keys.

func (*Reader) TableFormat

func (r *Reader) TableFormat() (TableFormat, error)

TableFormat returns the format version for the table.

func (*Reader) ValidateBlockChecksums

func (r *Reader) ValidateBlockChecksums() error

ValidateBlockChecksums validates the checksums for each block in the SSTable.

type ReaderOption

type ReaderOption interface {
	// contains filtered or unexported methods
}

ReaderOption provide an interface to do work on Reader while it is being opened.

type ReaderOptions

type ReaderOptions struct {
	// Cache is used to cache uncompressed blocks from sstables.
	//
	// The default cache size is a zero-size cache.
	Cache *cache.Cache

	// Comparer defines a total ordering over the space of []byte keys: a 'less
	// than' relationship. The same comparison algorithm must be used for reads
	// and writes over the lifetime of the DB.
	//
	// The default value uses the same ordering as bytes.Compare.
	Comparer *Comparer

	// Filters is a map from filter policy name to filter policy. It is used for
	// debugging tools which may be used on multiple databases configured with
	// different filter policies. It is not necessary to populate this filters
	// map during normal usage of a DB.
	Filters map[string]FilterPolicy

	// Merger defines the associative merge operation to use for merging values
	// written with {Batch,DB}.Merge. The MergerName is checked for consistency
	// with the value stored in the sstable when it was written.
	MergerName string
}

ReaderOptions holds the parameters needed for reading an sstable.

type Separator

type Separator = base.Separator

Separator exports the base.Separator type.

type Split

type Split = base.Split

Split exports the base.Split type.

type Successor

type Successor = base.Successor

Successor exports the base.Successor type.

type SuffixReplaceableBlockCollector

type SuffixReplaceableBlockCollector interface {
	// UpdateKeySuffixes is called when a block is updated to change the suffix of
	// all keys in the block, and is passed the old value for that prop, if any,
	// for that block as well as the old and new suffix.
	UpdateKeySuffixes(oldProp []byte, oldSuffix, newSuffix []byte) error
}

SuffixReplaceableBlockCollector is an extension to the BlockPropertyCollector interface that allows a block property collector to indicate the it supports being *updated* during suffix replacement, i.e. when an existing SST in which all keys have the same key suffix is updated to have a new suffix.

A collector which supports being updated in such cases must be able to derive its updated value from its old value and the change being made to the suffix, without needing to be passed each updated K/V.

For example, a collector that only inspects values would can simply copy its previously computed property as-is, since key-suffix replacement does not change values, while a collector that depends only on key suffixes, like one which collected mvcc-timestamp bounds from timestamp-suffixed keys, can just set its new bounds from the new suffix, as it is common to all keys, without needing to recompute it from every key.

An implementation of DataBlockIntervalCollector can also implement this interface, in which case the BlockPropertyCollector returned by passing it to NewBlockIntervalCollector will also implement this interface automatically.

type SuffixReplaceableTableCollector

type SuffixReplaceableTableCollector interface {
	// UpdateKeySuffixes is called when a table is updated to change the suffix of
	// all keys in the table, and is passed the old value for that prop, if any,
	// for that table as well as the old and new suffix.
	UpdateKeySuffixes(oldProps map[string]string, oldSuffix, newSuffix []byte) error
}

SuffixReplaceableTableCollector is an extension to the TablePropertyCollector interface that allows a table property collector to indicate that it supports being *updated* during suffix replacement, i.e. when an existing SST in which all keys have the same key suffix is updated to have a new suffix.

A collector which supports being updated in such cases must be able to derive its updated value from its old value and the change being made to the suffix, without needing to be passed each updated K/V.

For example, a collector that only inspects values can simply copy its previously computed property as-is, since key-suffix replacement does not change values, while a collector that depends only on key suffixes, like one which collected mvcc-timestamp bounds from timestamp-suffixed keys, can just set its new bounds from the new suffix, as it is common to all keys, without needing to recompute it from every key.

type TableFormat

type TableFormat uint32

TableFormat specifies the format version for sstables. The legacy LevelDB format is format version 1.

const (
	TableFormatUnspecified TableFormat = iota
	TableFormatLevelDB
	TableFormatRocksDBv2
	TableFormatPebblev1 // Block properties.
	TableFormatPebblev2 // Range keys.

	TableFormatMax = TableFormatPebblev2
)

The available table formats, representing the tuple (magic number, version number). Note that these values are not (and should not) be serialized to disk. The ordering should follow the order the versions were introduced to Pebble (i.e. the history is linear).

func ParseTableFormat

func ParseTableFormat(magic []byte, version uint32) (TableFormat, error)

ParseTableFormat parses the given magic bytes and version into its corresponding internal TableFormat.

func (TableFormat) AsTuple

func (f TableFormat) AsTuple() (string, uint32)

AsTuple returns the TableFormat's (Magic String, Version) tuple.

func (TableFormat) String

func (f TableFormat) String() string

String returns the TableFormat (Magic String,Version) tuple.

type TablePropertyCollector

type TablePropertyCollector interface {
	// Add is called with each new entry added to the sstable. While the sstable
	// is itself sorted by key, do not assume that the entries are added in any
	// order. In particular, the ordering of point entries and range tombstones
	// is unspecified.
	Add(key InternalKey, value []byte) error

	// Finish is called when all entries have been added to the sstable. The
	// collected properties (if any) should be added to the specified map. Note
	// that in case of an error during sstable construction, Finish may not be
	// called.
	Finish(userProps map[string]string) error

	// The name of the property collector.
	Name() string
}

TablePropertyCollector provides a hook for collecting user-defined properties based on the keys and values stored in an sstable. A new TablePropertyCollector is created for an sstable when the sstable is being written.

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

Writer is a table writer.

func NewWriter

func NewWriter(f writeCloseSyncer, o WriterOptions, extraOpts ...WriterOption) *Writer

NewWriter returns a new table writer for the file. Closing the writer will close the file.

func (*Writer) Add

func (w *Writer) Add(key InternalKey, value []byte) error

Add adds a key/value pair to the table being written. For a given Writer, the keys passed to Add must be in increasing order. The exception to this rule is range deletion tombstones. Range deletion tombstones need to be added ordered by their start key, but they can be added out of order from point entries. Additionally, range deletion tombstones must be fragmented (i.e. by keyspan.Fragmenter).

func (*Writer) AddRangeKey

func (w *Writer) AddRangeKey(key InternalKey, value []byte) error

AddRangeKey adds a range key set, unset, or delete key/value pair to the table being written.

Range keys must be supplied in strictly ascending order of start key (i.e. user key ascending, sequence number descending, and key type descending). Ranges added must also be supplied in fragmented span order - i.e. other than spans that are perfectly aligned (same start and end keys), spans may not overlap. Range keys may be added out of order relative to point keys and range deletions.

func (*Writer) Close

func (w *Writer) Close() (err error)

Close finishes writing the table and closes the underlying file that the table was written to.

func (*Writer) Delete

func (w *Writer) Delete(key []byte) error

Delete deletes the value for the given key. The sequence number is set to 0. Intended for use to externally construct an sstable before ingestion into a DB.

TODO(peter): untested

func (*Writer) DeleteRange

func (w *Writer) DeleteRange(start, end []byte) error

DeleteRange deletes all of the keys (and values) in the range [start,end) (inclusive on start, exclusive on end). The sequence number is set to 0. Intended for use to externally construct an sstable before ingestion into a DB.

TODO(peter): untested

func (*Writer) EstimatedSize

func (w *Writer) EstimatedSize() uint64

EstimatedSize returns the estimated size of the sstable being written if a call to Finish() was made without adding additional keys.

func (*Writer) Merge

func (w *Writer) Merge(key, value []byte) error

Merge adds an action to the DB that merges the value at key with the new value. The details of the merge are dependent upon the configured merge operator. The sequence number is set to 0. Intended for use to externally construct an sstable before ingestion into a DB.

TODO(peter): untested

func (*Writer) Metadata

func (w *Writer) Metadata() (*WriterMetadata, error)

Metadata returns the metadata for the finished sstable. Only valid to call after the sstable has been finished.

func (*Writer) RangeKeyDelete

func (w *Writer) RangeKeyDelete(start, end []byte) error

RangeKeyDelete deletes a range between start (inclusive) and end (exclusive).

Keys must be added to the table in increasing order of start key. Spans are not required to be fragmented.

func (*Writer) RangeKeySet

func (w *Writer) RangeKeySet(start, end, suffix, value []byte) error

RangeKeySet sets a range between start (inclusive) and end (exclusive) with the given suffix to the given value.

Keys must be added to the table in increasing order of start key. Spans are not required to be fragmented.

func (*Writer) RangeKeyUnset

func (w *Writer) RangeKeyUnset(start, end, suffix []byte) error

RangeKeyUnset un-sets a range between start (inclusive) and end (exclusive) with the given suffix.

Keys must be added to the table in increasing order of start key. Spans are not required to be fragmented.

func (*Writer) Set

func (w *Writer) Set(key, value []byte) error

Set sets the value for the given key. The sequence number is set to 0. Intended for use to externally construct an sstable before ingestion into a DB.

TODO(peter): untested

type WriterMetadata

type WriterMetadata struct {
	Size          uint64
	SmallestPoint InternalKey
	// LargestPoint, LargestRangeKey, LargestRangeDel should not be accessed
	// before Writer.Close is called, because they may only be set on
	// Writer.Close.
	LargestPoint     InternalKey
	SmallestRangeDel InternalKey
	LargestRangeDel  InternalKey
	SmallestRangeKey InternalKey
	LargestRangeKey  InternalKey
	HasPointKeys     bool
	HasRangeDelKeys  bool
	HasRangeKeys     bool
	SmallestSeqNum   uint64
	LargestSeqNum    uint64
	Properties       Properties
}

WriterMetadata holds info about a finished sstable.

func RewriteKeySuffixes

func RewriteKeySuffixes(
	sst []byte,
	rOpts ReaderOptions,
	out writeCloseSyncer,
	o WriterOptions,
	from, to []byte,
	concurrency int,
) (*WriterMetadata, error)

RewriteKeySuffixes copies the content of the passed SSTable bytes to a new sstable, written to `out`, in which the suffix `from` has is replaced with `to` in every key. The input sstable must consist of only Sets and every key must have `from` as its suffix as determined by the Split function of the Comparer in the passed WriterOptions.

Data blocks are rewritten in parallel by `concurrency` workers and then assembled into a final SST. Filters are copied from the original SST without modification as they are not affected by the suffix, while block and table properties are only minimally recomputed.

Any block and table property collectors configured in the WriterOptions must implement SuffixReplaceableTableCollector/SuffixReplaceableBlockCollector.

func RewriteKeySuffixesViaWriter

func RewriteKeySuffixesViaWriter(
	r *Reader, out writeCloseSyncer, o WriterOptions, from, to []byte,
) (*WriterMetadata, error)

RewriteKeySuffixesViaWriter is similar to RewriteKeySuffixes but uses just a single loop over the Reader that writes each key to the Writer with the new suffix. The is significantly slower than the parallelized rewriter, and does more work to rederive filters, props, etc, however re-doing that work makes it less restrictive -- props no longer need to

func (*WriterMetadata) SetLargestPointKey

func (m *WriterMetadata) SetLargestPointKey(k InternalKey)

SetLargestPointKey sets the largest point key to the given key. NB: this method set the "absolute" largest point key. Any existing key is overridden.

func (*WriterMetadata) SetLargestRangeDelKey

func (m *WriterMetadata) SetLargestRangeDelKey(k InternalKey)

SetLargestRangeDelKey sets the largest rangedel key to the given key. NB: this method set the "absolute" largest rangedel key. Any existing key is overridden.

func (*WriterMetadata) SetLargestRangeKey

func (m *WriterMetadata) SetLargestRangeKey(k InternalKey)

SetLargestRangeKey sets the largest range key to the given key. NB: this method set the "absolute" largest range key. Any existing key is overridden.

func (*WriterMetadata) SetSmallestPointKey

func (m *WriterMetadata) SetSmallestPointKey(k InternalKey)

SetSmallestPointKey sets the smallest point key to the given key. NB: this method set the "absolute" smallest point key. Any existing key is overridden.

func (*WriterMetadata) SetSmallestRangeDelKey

func (m *WriterMetadata) SetSmallestRangeDelKey(k InternalKey)

SetSmallestRangeDelKey sets the smallest rangedel key to the given key. NB: this method set the "absolute" smallest rangedel key. Any existing key is overridden.

func (*WriterMetadata) SetSmallestRangeKey

func (m *WriterMetadata) SetSmallestRangeKey(k InternalKey)

SetSmallestRangeKey sets the smallest range key to the given key. NB: this method set the "absolute" smallest range key. Any existing key is overridden.

type WriterOption

type WriterOption interface {
	// contains filtered or unexported methods
}

WriterOption provide an interface to do work on Writer while it is being opened.

type WriterOptions

type WriterOptions struct {
	// BlockRestartInterval is the number of keys between restart points
	// for delta encoding of keys.
	//
	// The default value is 16.
	BlockRestartInterval int

	// BlockSize is the target uncompressed size in bytes of each table block.
	//
	// The default value is 4096.
	BlockSize int

	// BlockSizeThreshold finishes a block if the block size is larger than the
	// specified percentage of the target block size and adding the next entry
	// would cause the block to be larger than the target block size.
	//
	// The default value is 90
	BlockSizeThreshold int

	// Cache is used to cache uncompressed blocks from sstables.
	//
	// The default is a nil cache.
	Cache *cache.Cache

	// Comparer defines a total ordering over the space of []byte keys: a 'less
	// than' relationship. The same comparison algorithm must be used for reads
	// and writes over the lifetime of the DB.
	//
	// The default value uses the same ordering as bytes.Compare.
	Comparer *Comparer

	// Compression defines the per-block compression to use.
	//
	// The default value (DefaultCompression) uses snappy compression.
	Compression Compression

	// FilterPolicy defines a filter algorithm (such as a Bloom filter) that can
	// reduce disk reads for Get calls.
	//
	// One such implementation is bloom.FilterPolicy(10) from the pebble/bloom
	// package.
	//
	// The default value means to use no filter.
	FilterPolicy FilterPolicy

	// FilterType defines whether an existing filter policy is applied at a
	// block-level or table-level. Block-level filters use less memory to create,
	// but are slower to access as a check for the key in the index must first be
	// performed to locate the filter block. A table-level filter will require
	// memory proportional to the number of keys in an sstable to create, but
	// avoids the index lookup when determining if a key is present. Table-level
	// filters should be preferred except under constrained memory situations.
	FilterType FilterType

	// IndexBlockSize is the target uncompressed size in bytes of each index
	// block. When the index block size is larger than this target, two-level
	// indexes are automatically enabled. Setting this option to a large value
	// (such as math.MaxInt32) disables the automatic creation of two-level
	// indexes.
	//
	// The default value is the value of BlockSize.
	IndexBlockSize int

	// Merger defines the associative merge operation to use for merging values
	// written with {Batch,DB}.Merge. The MergerName is checked for consistency
	// with the value stored in the sstable when it was written.
	MergerName string

	// TableFormat specifies the format version for writing sstables. The default
	// is TableFormatRocksDBv2 which creates RocksDB compatible sstables. Use
	// TableFormatLevelDB to create LevelDB compatible sstable which can be used
	// by a wider range of tools and libraries.
	TableFormat TableFormat

	// TablePropertyCollectors is a list of TablePropertyCollector creation
	// functions. A new TablePropertyCollector is created for each sstable built
	// and lives for the lifetime of the table.
	TablePropertyCollectors []func() TablePropertyCollector

	// BlockPropertyCollectors is a list of BlockPropertyCollector creation
	// functions. A new BlockPropertyCollector is created for each sstable
	// built and lives for the lifetime of writing that table.
	BlockPropertyCollectors []func() BlockPropertyCollector

	// Checksum specifies which checksum to use.
	Checksum ChecksumType

	// Parallelism is used to indicate that the sstable Writer is allowed to
	// compress data blocks and write datablocks to disk in parallel with the
	// Writer client goroutine.
	Parallelism bool
}

WriterOptions holds the parameters used to control building an sstable.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL