engine

package
v0.0.0-...-024101a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 25, 2015 License: Apache-2.0 Imports: 19 Imported by: 0

Documentation

Overview

Package engine provides low-level storage. It interacts with storage backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At one level higher, MVCC provides multi-version concurrency control capability on top of an Engine instance.

The Engine interface provides an API for key-value stores. InMem implements an in-memory engine using a sorted map. RocksDB implements an engine for data stored to local disk using RocksDB, a variant of LevelDB.

MVCC provides a multi-version concurrency control system on top of an engine. MVCC is the basis for Cockroach's support for distributed transactions. It is intended for direct use from storage.Range objects.

Notes on MVCC architecture

Each MVCC value contains a metadata key/value pair and one or more version key/value pairs. The MVCC metadata key is the actual key for the value, binary encoded using the SQL binary encoding scheme which contains a sentinel byte of 0x25, following by a 7-bit encoding of the key data with 1s in the high bit and terminated by a nil byte. The MVCC metadata value is of type proto.MVCCMetadata and contains the most recent version timestamp and an optional proto.Transaction message. If set, the most recent version of the MVCC value is a transactional "intent". It also contains some information on the size of the most recent version's key and value for efficient stat counter computations.

Each MVCC version key/value pair has a key which is also binary-encoded, but is suffixed with a decreasing, big-endian encoding of the timestamp (8 bytes for the nanosecond wall time, followed by 4 bytes for the logical time). The MVCC version value is a message of type proto.MVCCValue which indicates whether the version is a deletion timestamp and if not, contains a proto.Value object which holds the actual value. The decreasing encoding on the timestamp sorts the most recent version directly after the metadata key. This increases the likelihood that an Engine.Get() of the MVCC metadata will get the same block containing the most recent version, even if there are many versions. We rely on getting the MVCC metadata key/value and then using it to directly get the MVCC version using the metadata's most recent version timestamp. This avoids using an expensive merge iterator to scan the most recent version. It also allows us to leverage RocksDB's bloom filters.

The 7-bit binary encoding used on the MVCC keys allows arbitrary keys to be stored in the map (no restrictions on intermediate nil-bytes, for example), while still sorting lexicographically and guaranteeing that all timestamp-suffixed MVCC version keys sort consecutively with the metadata key. It should be noted that the 7-bit binary encoding is distasteful and we'd like to substitute it with something which preserves at least 7-bit ascii visibility, but has the same sort properties. We considered using RocksDB's custom key comparator functionality, but the attendant risks seemed too great. What risks? Mostly that RocksDB is unlikely to have tested custom key comparators with their more advanced (and ever-growing) functionality. Further, bugs in our code (both C++ and Go) related to the custom comparator seemed more likely to be painful than just dealing with the 7-bit binary encoding.

We considered inlining the most recent MVCC version in the MVCCMetadata. This would reduce the storage overhead of storing the same key twice (which is small due to block compression), and the runtime overhead of two separate DB lookups. On the other hand, all writes that create a new version of an existing key would incur a double write as the previous value is moved out of the MVCCMetadata into its versioned key. Preliminary benchmarks have not shown enough performance improvement to justify this change, although we may revisit this decision if it turns out that multiple versions of the same key are rare in practice.

However, we do allow inlining in order to use the MVCC interface to store non-versioned values. It turns out that not everything which Cockroach needs to store would be efficient or possible using MVCC. Examples include transaction records, response cache entries, stats counters, time series data, and system-local config values. However, supporting a mix of encodings is problematic in terms of resulting complexity. So Cockroach treats an MVCC timestamp of zero to mean an inlined, non-versioned value. These values are replaced if they exist on a Put operation and are cleared from the engine on a delete. Importantly, zero-timestamped MVCC values may be merged, as is necessary for stats counters and time series data.

Index

Constants

This section is empty.

Variables

View Source
var (
	// StatLiveBytes counts how many bytes are "live", including bytes
	// from both keys and values. Live rows include only non-deleted
	// keys and only the most recent value.
	StatLiveBytes = proto.Key("live-bytes")
	// StatKeyBytes counts how many bytes are used to store all keys,
	// including bytes from deleted keys. Key bytes are re-counted for
	// each versioned value.
	StatKeyBytes = proto.Key("key-bytes")
	// StatValBytes counts how many bytes are used to store all values,
	// including all historical versions and deleted tombstones.
	StatValBytes = proto.Key("val-bytes")
	// StatIntentBytes counts how many bytes are used to store values
	// which are unresolved intents. Includes bytes used for both intent
	// keys and values.
	StatIntentBytes = proto.Key("intent-bytes")
	// StatLiveCount counts how many keys are "live". This includes only
	// non-deleted keys.
	StatLiveCount = proto.Key("live-count")
	// StatKeyCount counts the total number of keys, including both live
	// and deleted keys.
	StatKeyCount = proto.Key("key-count")
	// StatValCount counts the total number of values, including all
	// historical versions and deleted tombstones.
	StatValCount = proto.Key("val-count")
	// StatIntentCount counts the number of unresolved intents.
	StatIntentCount = proto.Key("intent-count")
	// StatIntentAge counts the total age of unresolved intents.
	StatIntentAge = proto.Key("intent-age")
	// StatGCBytesAge counts the total age of gc'able bytes.
	StatGCBytesAge = proto.Key("gc-age")
	// StatLastUpdateNanos counts nanoseconds since the unix epoch for
	// the last update to the intent / GC'able bytes ages. This really
	// is tracking the wall time as at last update, but is a merged
	// stat, with successive counts of elapsed nanos being added at each
	// stat computation.
	StatLastUpdateNanos = proto.Key("update-nanos")
)

Constants for stat key construction.

View Source
var (
	// KeyMaxLength is the maximum key length in bytes. This value is
	// somewhat arbitrary. It is chosen high enough to allow most
	// conceivable use cases while also still being comfortably short of
	// a limit which would affect the performance of the system, both
	// from performance of key comparisons and from memory usage for
	// things like the timestamp cache, lookup cache, and command queue.
	KeyMaxLength = proto.KeyMaxLength

	// KeyMin is a minimum key value which sorts before all other keys.
	KeyMin = proto.KeyMin
	// KeyMax is a maximum key value which sorts after all other keys.
	KeyMax = proto.KeyMax

	// MVCCKeyMax is a maximum mvcc-encoded key value which sorts after
	// all other keys.
	MVCCKeyMax = MVCCEncodeKey(KeyMax)

	// KeyLocalPrefix is the prefix for keys which hold data local to a
	// RocksDB instance, such as store and range-specific metadata which
	// must not pollute the user key space, but must be collocate with
	// the store and/or ranges which they refer to. Storing this
	// information in the normal system keyspace would place the data on
	// an arbitrary set of stores, with no guarantee of collocation.
	// Local data includes store metadata, range metadata, response
	// cache values, transaction records, range-spanning binary tree
	// node pointers, and message queues.
	//
	// The local key prefix has been deliberately chosen to sort before
	// the KeySystemPrefix, because these local keys are not addressable
	// via the meta range addressing indexes.
	//
	// Some local data are not replicated, such as the store's 'ident'
	// record. Most local data are replicated, such as response cache
	// entries and transaction rows, but are not addressable as normal
	// MVCC values as part of transactions. Finally, some local data are
	// stored as MVCC values and are addressable as part of distributed
	// transactions, such as range metadata, range-spanning binary tree
	// node pointers, and message queues.
	KeyLocalPrefix = proto.Key("\x00\x00\x00")

	// KeyLocalSuffixLength specifies the length in bytes of all local
	// key suffixes.
	KeyLocalSuffixLength = 4

	// KeyLocalStorePrefix is the prefix identifying per-store data.
	KeyLocalStorePrefix = MakeKey(KeyLocalPrefix, proto.Key("s"))
	// KeyLocalStoreIdentSuffix stores an immutable identifier for this
	// store, created when the store is first bootstrapped.
	KeyLocalStoreIdentSuffix = proto.Key("iden")
	// KeyLocalStoreStatSuffix is the suffix for store statistics.
	KeyLocalStoreStatSuffix = proto.Key("sst-")

	// KeyLocalRangeIDPrefix is the prefix identifying per-range data
	// indexed by Raft ID. The Raft ID is appended to this prefix,
	// encoded using EncodeUvarint. The specific sort of per-range
	// metadata is identified by one of the suffixes listed below, along
	// with potentially additional encoded key info, such as a command
	// ID in the case of response cache entry.
	//
	// NOTE: KeyLocalRangeIDPrefix must be kept in sync with the value
	// in storage/engine/db.cc.
	KeyLocalRangeIDPrefix = MakeKey(KeyLocalPrefix, proto.Key("i"))
	// KeyLocalRaftLogSuffix is the suffix for the raft log.
	KeyLocalRaftLogSuffix = proto.Key("rftl")
	// KeyLocalRaftHardStateSuffix is the Suffix for the raft HardState.
	KeyLocalRaftHardStateSuffix = proto.Key("rfth")
	// KeyLocalRaftTruncatedStateSuffix is the suffix for the RaftTruncatedState.
	KeyLocalRaftTruncatedStateSuffix = proto.Key("rftt")
	// KeyLocalRaftAppliedIndexSuffix is the suffix for the raft applied index.
	KeyLocalRaftAppliedIndexSuffix = proto.Key("rfta")
	// KeyLocalRangeGCMetadataSuffix is the suffix for a range's GC metadata.
	KeyLocalRangeGCMetadataSuffix = proto.Key("rgcm")
	// KeyLocalRangeLastVerificationTimestampSuffix is the suffix for a range's
	// last verification timestamp (for checking integrity of on-disk data).
	KeyLocalRangeLastVerificationTimestampSuffix = proto.Key("rlvt")
	// KeyLocalRangeStatSuffix is the suffix for range statistics.
	KeyLocalRangeStatSuffix = proto.Key("rst-")
	// KeyLocalResponseCacheSuffix is the suffix for keys storing
	// command responses used to guarantee idempotency (see
	// ResponseCache).
	// NOTE: if this value changes, it must be updated in C++
	// (storage/engine/db.cc).
	KeyLocalResponseCacheSuffix = proto.Key("res-")

	// KeyLocalRangeKeyPrefix is the prefix identifying per-range data
	// indexed by range key (either start key, or some key in the
	// range). The key is appended to this prefix, encoded using
	// EncodeBytes. The specific sort of per-range metadata is
	// identified by one of the suffixes listed below, along with
	// potentially additional encoded key info, such as the txn UUID in
	// the case of a transaction record.
	//
	// NOTE: KeyLocalRangeKeyPrefix must be kept in sync with the value
	// in storage/engine/db.cc.
	KeyLocalRangeKeyPrefix = MakeKey(KeyLocalPrefix, proto.Key("k"))
	// KeyLocalRangeDescriptorSuffix is the suffix for keys storing
	// range descriptors. The value is a struct of type RangeDescriptor.
	KeyLocalRangeDescriptorSuffix = proto.Key("rdsc")
	// KeyLocalRangeTreeNodeSuffix is the suffix for keys storing
	// range tree nodes.  The value is a struct of type RangeTreeNode.
	KeyLocalRangeTreeNodeSuffix = proto.Key("rtn-")
	// KeyLocalTransactionSuffix specifies the key suffix for
	// transaction records. The additional detail is the transaction id.
	// NOTE: if this value changes, it must be updated in C++
	// (storage/engine/db.cc).
	KeyLocalTransactionSuffix = proto.Key("txn-")

	// KeyLocalMax is the end of the local key range.
	KeyLocalMax = KeyLocalPrefix.PrefixEnd()

	// KeySystemPrefix indicates the beginning of the key range for
	// global, system data which are replicated across the cluster.
	KeySystemPrefix = proto.Key("\x00")
	KeySystemMax    = proto.Key("\x01")

	// KeyMetaPrefix is the prefix for range metadata keys. Notice that
	// an extra null character in the prefix causes all range addressing
	// records to sort before any system tables which they might describe.
	KeyMetaPrefix = MakeKey(KeySystemPrefix, proto.Key("\x00meta"))
	// KeyMeta1Prefix is the first level of key addressing. The value is a
	// RangeDescriptor struct.
	KeyMeta1Prefix = MakeKey(KeyMetaPrefix, proto.Key("1"))
	// KeyMeta2Prefix is the second level of key addressing. The value is a
	// RangeDescriptor struct.
	KeyMeta2Prefix = MakeKey(KeyMetaPrefix, proto.Key("2"))

	// KeyMetaMax is the end of the range of addressing keys.
	KeyMetaMax = MakeKey(KeySystemPrefix, proto.Key("\x01"))

	// KeyConfigAccountingPrefix specifies the key prefix for accounting
	// configurations. The suffix is the affected key prefix.
	KeyConfigAccountingPrefix = MakeKey(KeySystemPrefix, proto.Key("acct"))
	// KeyConfigPermissionPrefix specifies the key prefix for accounting
	// configurations. The suffix is the affected key prefix.
	KeyConfigPermissionPrefix = MakeKey(KeySystemPrefix, proto.Key("perm"))
	// KeyConfigZonePrefix specifies the key prefix for zone
	// configurations. The suffix is the affected key prefix.
	KeyConfigZonePrefix = MakeKey(KeySystemPrefix, proto.Key("zone"))
	// KeyNodeIDGenerator is the global node ID generator sequence.
	KeyNodeIDGenerator = MakeKey(KeySystemPrefix, proto.Key("node-idgen"))
	// KeyRaftIDGenerator is the global Raft consensus group ID generator sequence.
	KeyRaftIDGenerator = MakeKey(KeySystemPrefix, proto.Key("raft-idgen"))
	// KeySchemaPrefix specifies key prefixes for schema definitions.
	KeySchemaPrefix = MakeKey(KeySystemPrefix, proto.Key("schema"))
	// KeyStoreIDGeneratorPrefix specifies key prefixes for sequence
	// generators, one per node, for store IDs.
	KeyStoreIDGeneratorPrefix = MakeKey(KeySystemPrefix, proto.Key("store-idgen-"))
	// KeyRangeTreeRoot specifies the root range in the range tree.
	KeyRangeTreeRoot = MakeKey(KeySystemPrefix, proto.Key("range-tree-root"))
)

Constants for system-reserved keys in the KV map.

Functions

func ClearRange

func ClearRange(engine Engine, start, end proto.EncodedKey) (int, error)

ClearRange removes a set of entries, from start (inclusive) to end (exclusive). This function returns the number of entries removed. Either all entries within the range will be deleted, or none, and an error will be returned. Note that this function actually removes entries from the storage engine, rather than inserting tombstones, as with deletion through the MVCC.

func DecodeRaftStateKey

func DecodeRaftStateKey(key proto.Key) int64

DecodeRaftStateKey extracts the Raft ID from a RaftStateKey.

func DecodeRangeKey

func DecodeRangeKey(key proto.Key) (startKey, suffix, detail proto.Key)

DecodeRangeKey decodes the range key into range start key, suffix and optional detail (may be nil).

func Increment

func Increment(engine Engine, key proto.EncodedKey, inc int64) (int64, error)

Increment fetches the varint encoded int64 value specified by key and adds "inc" to it then re-encodes as varint. The newly incremented value is returned.

func IsValidSplitKey

func IsValidSplitKey(key proto.Key) bool

IsValidSplitKey returns whether the key is a valid split key. Certain key ranges cannot be split; split keys chosen within any of these ranges are considered invalid.

  • \x00\x00meta1 < SplitKey < \x00\x00meta2
  • \x00acct < SplitKey < \x00accu
  • \x00perm < SplitKey < \x00pern
  • \x00zone < SplitKey < \x00zonf

func KeyAddress

func KeyAddress(k proto.Key) proto.Key

KeyAddress returns the address for the key, used to lookup the range containing the key. In the normal case, this is simply the key's value. However, for local keys, such as transaction records, range-spanning binary tree node pointers, and message queues, the address is the trailing suffix of the key, with the local key prefix removed. In this way, local keys address to the same range as non-local keys, but are stored separately so that they don't collide with user-space or global system keys.

However, not all local keys are addressable in the global map. Only range local keys incorporating a range key (start key or transaction key) are addressable (e.g. range metadata and txn records). Range local keys incorporating the Raft ID are not (e.g. response cache entries, and range stats).

func MVCCComputeGCBytesAge

func MVCCComputeGCBytesAge(bytes, ageSeconds int64) int64

MVCCComputeGCBytesAge comptues the value to assign to the specified number of bytes, at the given age (in seconds).

func MVCCConditionalPut

func MVCCConditionalPut(engine Engine, ms *MVCCStats, key proto.Key, timestamp proto.Timestamp, value proto.Value,
	expValue *proto.Value, txn *proto.Transaction) error

MVCCConditionalPut sets the value for a specified key only if the expected value matches. If not, the return a ConditionFailedError containing the actual value.

func MVCCDecodeKey

func MVCCDecodeKey(encodedKey proto.EncodedKey) (proto.Key, proto.Timestamp, bool)

MVCCDecodeKey decodes encodedKey by binary decoding the leading bytes of encodedKey. If there are no remaining bytes, returns the decoded key, an empty timestamp, and false, to indicate the key is for an MVCC metadata or a raw value. Otherwise, there must be exactly 12 trailing bytes and they're decoded into a timestamp. The decoded key, timestamp and true are returned to indicate the key is for an MVCC versioned value.

func MVCCDelete

func MVCCDelete(engine Engine, ms *MVCCStats, key proto.Key, timestamp proto.Timestamp,
	txn *proto.Transaction) error

MVCCDelete marks the key deleted so that it will not be returned in future get responses.

func MVCCDeleteRange

func MVCCDeleteRange(engine Engine, ms *MVCCStats, key, endKey proto.Key, max int64, timestamp proto.Timestamp, txn *proto.Transaction) (int64, error)

MVCCDeleteRange deletes the range of key/value pairs specified by start and end keys. Specify max=0 for unbounded deletes.

func MVCCEncodeKey

func MVCCEncodeKey(key proto.Key) proto.EncodedKey

MVCCEncodeKey makes an MVCC key for storing MVCC metadata or for storing raw values directly. Use MVCCEncodeVersionValue for storing timestamped version values.

func MVCCEncodeVersionKey

func MVCCEncodeVersionKey(key proto.Key, timestamp proto.Timestamp) proto.EncodedKey

MVCCEncodeVersionKey makes an MVCC version key, which consists of a binary-encoding of key, followed by a decreasing encoding of the timestamp, so that more recent versions sort first.

func MVCCFindSplitKey

func MVCCFindSplitKey(engine Engine, raftID int64, key, endKey proto.Key) (proto.Key, error)

MVCCFindSplitKey suggests a split key from the given user-space key range that aims to roughly cut into half the total number of bytes used (in raw key and value byte strings) in both subranges. Specify a snapshot engine to safely invoke this method in a goroutine.

The split key will never be chosen from the key ranges listed in illegalSplitKeyRanges.

func MVCCGarbageCollect

func MVCCGarbageCollect(engine Engine, ms *MVCCStats, keys []proto.InternalGCRequest_GCKey, timestamp proto.Timestamp) error

MVCCGarbageCollect creates an iterator on the engine. In parallel it iterates through the keys listed for garbage collection by the keys slice. The engine iterator is seeked in turn to each listed key, clearing all values with timestamps <= to expiration.

func MVCCGet

func MVCCGet(engine Engine, key proto.Key, timestamp proto.Timestamp,
	txn *proto.Transaction) (*proto.Value, error)

MVCCGet returns the value for the key specified in the request, while satisfying the given timestamp condition. The key may contain arbitrary bytes. If no value for the key exists, or it has been deleted, returns nil for value.

The values of multiple versions for the given key should be organized as follows: ... keyA : MVCCMetatata of keyA keyA_Timestamp_n : value of version_n keyA_Timestamp_n-1 : value of version_n-1 ... keyA_Timestamp_0 : value of version_0 keyB : MVCCMetadata of keyB ...

func MVCCGetProto

func MVCCGetProto(engine Engine, key proto.Key, timestamp proto.Timestamp, txn *proto.Transaction, msg gogoproto.Message) (bool, error)

MVCCGetProto fetches the value at the specified key and unmarshals it using a protobuf decoder. Returns true on success or false if the key was not found.

func MVCCGetRangeSize

func MVCCGetRangeSize(engine Engine, raftID int64) (int64, error)

MVCCGetRangeSize returns the size of the range, equal to the sum of the key and value stats.

func MVCCGetRangeStat

func MVCCGetRangeStat(engine Engine, raftID int64, stat proto.Key) (int64, error)

MVCCGetRangeStat returns the value for the specified range stat, by Raft ID and stat name.

func MVCCGetRangeStats

func MVCCGetRangeStats(engine Engine, raftID int64, ms *MVCCStats) error

MVCCGetRangeStats reads stat counters for the specified range and sets the values in the supplied MVCCStats struct.

func MVCCIncrement

func MVCCIncrement(engine Engine, ms *MVCCStats, key proto.Key, timestamp proto.Timestamp, txn *proto.Transaction, inc int64) (int64, error)

MVCCIncrement fetches the value for key, and assuming the value is an "integer" type, increments it by inc and stores the new value. The newly incremented value is returned.

func MVCCIterateCommitted

func MVCCIterateCommitted(engine Engine, key, endKey proto.Key, f func(proto.KeyValue) (bool, error)) error

MVCCIterateCommitted iterates over the key range specified by start and end keys, returning only the most recently committed version of each key/value pair. Intents are ignored. If a key has an intent but no earlier, committed version, nothing is returned. At each step of the iteration, f() is invoked with the current key/value pair. If f returns true (done) or an error, the iteration stops and the error is propagated.

func MVCCMerge

func MVCCMerge(engine Engine, ms *MVCCStats, key proto.Key, value proto.Value) error

MVCCMerge implements a merge operation. Merge adds integer values, concatenates undifferentiated byte slice values, and efficiently combines time series observations if the proto.Value tag value indicates the value byte slice is of type _CR_TS (the internal cockroach time series data tag).

func MVCCMergeRangeStat

func MVCCMergeRangeStat(engine Engine, raftID int64, stat proto.Key, statVal int64) error

MVCCMergeRangeStat flushes the specified stat to merge counters via the provided engine instance.

func MVCCPut

func MVCCPut(engine Engine, ms *MVCCStats, key proto.Key, timestamp proto.Timestamp,
	value proto.Value, txn *proto.Transaction) error

MVCCPut sets the value for a specified key. It will save the value with different versions according to its timestamp and update the key metadata. We assume the range will check for an existing write intent before executing any Put action at the MVCC level.

If the timestamp is specifed as proto.ZeroTimestamp, the value is inlined instead of being written as a timestamp-versioned value. A zero timestamp write to a key precludes a subsequent write using a non-zero timestamp and vice versa. Inlined values require only a single row and never accumulate more than a single value. Successive zero timestamp writes to a key replace the value and deletes clear the value. In addition, zero timestamp values may be merged.

func MVCCPutProto

func MVCCPutProto(engine Engine, ms *MVCCStats, key proto.Key, timestamp proto.Timestamp, txn *proto.Transaction, msg gogoproto.Message) error

MVCCPutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp.

func MVCCResolveWriteIntent

func MVCCResolveWriteIntent(engine Engine, ms *MVCCStats, key proto.Key, timestamp proto.Timestamp, txn *proto.Transaction) error

MVCCResolveWriteIntent either commits or aborts (rolls back) an extant write intent for a given txn according to commit parameter. ResolveWriteIntent will skip write intents of other txns.

Transaction epochs deserve a bit of explanation. The epoch for a transaction is incremented on transaction retry. Transaction retry is different from abort. Retries occur in SSI transactions when the commit timestamp is not equal to the proposed transaction timestamp. This might be because writes to different keys had to use higher timestamps than expected because of existing, committed value, or because reads pushed the transaction's commit timestamp forward. Retries also occur in the event that the txn tries to push another txn in order to write an intent but fails (i.e. it has lower priority).

Because successive retries of a transaction may end up writing to different keys, the epochs serve to classify which intents get committed in the event the transaction succeeds (all those with epoch matching the commit epoch), and which intents get aborted, even if the transaction succeeds.

func MVCCResolveWriteIntentRange

func MVCCResolveWriteIntentRange(engine Engine, ms *MVCCStats, key, endKey proto.Key, max int64, timestamp proto.Timestamp, txn *proto.Transaction) (int64, error)

MVCCResolveWriteIntentRange commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns. Specify max=0 for unbounded resolves.

func MVCCScan

func MVCCScan(engine Engine, key, endKey proto.Key, max int64, timestamp proto.Timestamp, txn *proto.Transaction) ([]proto.KeyValue, error)

MVCCScan scans the key range specified by start key through end key up to some maximum number of results. Specify max=0 for unbounded scans.

func MVCCSetRangeStat

func MVCCSetRangeStat(engine Engine, raftID int64, stat proto.Key, statVal int64) error

MVCCSetRangeStat sets the value for the specified range stat, by Raft ID and stat name.

func MakeKey

func MakeKey(keys ...proto.Key) proto.Key

MakeKey makes a new key which is the concatenation of the given inputs, in order.

func MakeRangeIDKey

func MakeRangeIDKey(raftID int64, suffix, detail proto.Key) proto.Key

MakeRangeIDKey creates a range-local key based on the range's Raft ID, metadata key suffix, and optional detail (e.g. the encoded command ID for a response cache entry, etc.).

func MakeRangeKey

func MakeRangeKey(key, suffix, detail proto.Key) proto.Key

MakeRangeKey creates a range-local key based on the range start key, metadata key suffix, and optional detail (e.g. the transaction UUID for a txn record, etc.).

func MakeStoreKey

func MakeStoreKey(suffix, detail proto.Key) proto.Key

MakeStoreKey creates a store-local key based on the metadata key suffix, and optional detail.

func PutProto

func PutProto(engine Engine, key proto.EncodedKey, msg gogoproto.Message) (keyBytes, valBytes int64, err error)

PutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp. Returns the length in bytes of key and the value.

func RaftAppliedIndexKey

func RaftAppliedIndexKey(raftID int64) proto.Key

RaftAppliedIndexKey returns a system-local key for a raft applied index.

func RaftHardStateKey

func RaftHardStateKey(raftID int64) proto.Key

RaftHardStateKey returns a system-local key for a Raft HardState.

func RaftLogKey

func RaftLogKey(raftID int64, logIndex uint64) proto.Key

RaftLogKey returns a system-local key for a Raft log entry.

func RaftLogPrefix

func RaftLogPrefix(raftID int64) proto.Key

RaftLogPrefix returns the system-local prefix shared by all entries in a Raft log.

func RaftTruncatedStateKey

func RaftTruncatedStateKey(raftID int64) proto.Key

RaftTruncatedStateKey returns a system-local key for a RaftTruncatedState.

func RangeDescriptorKey

func RangeDescriptorKey(key proto.Key) proto.Key

RangeDescriptorKey returns a range-local key for the descriptor for the range with specified key.

func RangeGCMetadataKey

func RangeGCMetadataKey(raftID int64) proto.Key

RangeGCMetadataKey returns a range-local key for range garbage collection metadata.

func RangeLastVerificationTimestampKey

func RangeLastVerificationTimestampKey(raftID int64) proto.Key

RangeLastVerificationTimestampKey returns a range-local key for the range's last verification timestamp.

func RangeMetaKey

func RangeMetaKey(key proto.Key) proto.Key

RangeMetaKey returns a range metadata (meta1, meta2) indexing key for the given key. For ordinary keys this returns a level 2 metadata key - for level 2 keys, it returns a level 1 key. For level 1 keys and local keys, KeyMin is returned.

func RangeMetaLookupKey

func RangeMetaLookupKey(r *proto.RangeDescriptor) proto.Key

RangeMetaLookupKey returns the metadata key at which this range descriptor should be stored as a value.

func RangeStatKey

func RangeStatKey(raftID int64, stat proto.Key) proto.Key

RangeStatKey returns the key for accessing the named stat for the specified Raft ID.

func RangeTreeNodeKey

func RangeTreeNodeKey(raftID int64) proto.Key

RangeTreeNodeKey returns a range-local key for the the range's node in the range tree.

func ResponseCacheKey

func ResponseCacheKey(raftID int64, cmdID *proto.ClientCmdID) proto.Key

ResponseCacheKey returns a range-local key by Raft ID for a response cache entry, with detail specified by encoding the supplied client command ID.

func Scan

func Scan(engine Engine, start, end proto.EncodedKey, max int64) ([]proto.RawKeyValue, error)

Scan returns up to max key/value objects starting from start (inclusive) and ending at end (non-inclusive). Specify max=0 for unbounded scans.

func StoreIdentKey

func StoreIdentKey() proto.Key

StoreIdentKey returns a store-local key for the store metadata.

func StoreStatKey

func StoreStatKey(storeID int32, stat proto.Key) proto.Key

StoreStatKey returns the key for accessing the named stat for the specified store ID.

func TransactionKey

func TransactionKey(key proto.Key, id []byte) proto.Key

TransactionKey returns a transaction key based on the provided transaction key and ID. The base key is encoded in order to guarantee that all transaction records for a range sort together.

func ValidateRangeMetaKey

func ValidateRangeMetaKey(key proto.Key) error

ValidateRangeMetaKey validates that the given key is a valid Range Metadata key. It must have an appropriate metadata range prefix, and the original key value must be less than KeyMax. As a special case, KeyMin is considered a valid Range Metadata Key.

Types

type Batch

type Batch struct {
	// contains filtered or unexported fields
}

Batch wrap an instance of Engine and provides a limited subset of Engine functionality. Mutations are added to a write batch transparently and only applied to the wrapped engine on invocation of Commit(). Reads are passed through to the wrapped engine. In the event that reads access keys for which there are already-batched updates, reads from the wrapped engine are combined on the fly with pending write, delete, and merge updates.

This struct is not thread safe.

func NewBatch

func NewBatch(engine Engine) *Batch

NewBatch returns a new instance of Batch which wraps engine.

func (*Batch) ApproximateSize

func (b *Batch) ApproximateSize(start, end proto.EncodedKey) (uint64, error)

ApproximateSize returns an error if called on a Batch.

func (*Batch) Attrs

func (b *Batch) Attrs() proto.Attributes

Attrs is a noop for Batch.

func (*Batch) Capacity

func (b *Batch) Capacity() (StoreCapacity, error)

Capacity returns an error if called on a Batch.

func (*Batch) Clear

func (b *Batch) Clear(key proto.EncodedKey) error

Clear stores the key as a BatchDelete in the updates tree.

func (*Batch) Commit

func (b *Batch) Commit() error

Commit writes all pending updates to the underlying engine in an atomic write batch.

func (*Batch) Flush

func (b *Batch) Flush() error

Flush returns an error if called on a Batch.

func (*Batch) Get

func (b *Batch) Get(key proto.EncodedKey) ([]byte, error)

Get reads first from the updates tree. If the key is found there and is deleted, then a nil value is returned. If the key is found, and is a Put, returns the value from the tree. If a merge, then merge is performed on the fly to combine with the value from the underlying engine. Otherwise, the Get is simply passed through to the wrapped engine.

func (*Batch) GetProto

func (b *Batch) GetProto(key proto.EncodedKey, msg gogoproto.Message) (
	ok bool, keyBytes, valBytes int64, err error)

GetProto fetches the value at the specified key and unmarshals it.

func (*Batch) Iterate

func (b *Batch) Iterate(start, end proto.EncodedKey, f func(proto.RawKeyValue) (bool, error)) error

Iterate invokes f on key/value pairs merged from the underlying engine and pending batch updates. If f returns done or an error, the iteration ends and propagates the error.

func (*Batch) Merge

func (b *Batch) Merge(key proto.EncodedKey, value []byte) error

Merge stores the key / value as a BatchMerge in the updates tree. If the updates map already contains a BatchPut, then this value is merged with the Put and kept as a BatchPut. If the updates map already contains a BatchMerge, then this value is merged with the existing BatchMerge and kept as a BatchMerge. If the updates map contains a BatchDelete, then this value is merged with a nil byte slice and stored as a BatchPut.

func (*Batch) NewBatch

func (b *Batch) NewBatch() Engine

NewBatch returns a new Batch instance wrapping same underlying engine.

func (*Batch) NewIterator

func (b *Batch) NewIterator() Iterator

NewIterator returns an iterator over Batch. Batch iterators are not thread safe.

func (*Batch) NewSnapshot

func (b *Batch) NewSnapshot() Engine

NewSnapshot returns nil if called on a Batch.

func (*Batch) Put

func (b *Batch) Put(key proto.EncodedKey, value []byte) error

Put stores the key / value as a BatchPut in the updates tree.

func (*Batch) Scan

func (b *Batch) Scan(start, end proto.EncodedKey, max int64) ([]proto.RawKeyValue, error)

Scan scans from both the updates tree and the underlying engine and combines the results, up to max.

func (*Batch) SetGCTimeouts

func (b *Batch) SetGCTimeouts(minTxnTS, minRCacheTS int64)

SetGCTimeouts is a noop for Batch.

func (*Batch) Start

func (b *Batch) Start() error

Start returns an error if called on a Batch.

func (*Batch) Stop

func (b *Batch) Stop()

Stop is a noop for Batch.

func (*Batch) WriteBatch

func (b *Batch) WriteBatch([]interface{}) error

WriteBatch returns an error if called on a Batch.

type BatchDelete

type BatchDelete struct {
	proto.RawKeyValue
}

A BatchDelete is a delete operation executed as part of an atomic batch.

type BatchMerge

type BatchMerge struct {
	proto.RawKeyValue
}

A BatchMerge is a merge operation executed as part of an atomic batch.

type BatchPut

type BatchPut struct {
	proto.RawKeyValue
}

A BatchPut is a put operation executed as part of an atomic batch.

type Engine

type Engine interface {
	// Start initializes and starts the engine.
	Start() error
	// Stop closes the engine, freeing up any outstanding resources.
	Stop()
	// Attrs returns the engine/store attributes.
	Attrs() proto.Attributes
	// Put sets the given key to the value provided.
	Put(key proto.EncodedKey, value []byte) error
	// Get returns the value for the given key, nil otherwise.
	Get(key proto.EncodedKey) ([]byte, error)
	// GetProto fetches the value at the specified key and unmarshals it
	// using a protobuf decoder. Returns true on success or false if the
	// key was not found. On success, returns the length in bytes of the
	// key and the value.
	GetProto(key proto.EncodedKey, msg gogoproto.Message) (ok bool, keyBytes, valBytes int64, err error)
	// Iterate scans from start to end keys, visiting at most max
	// key/value pairs. On each key value pair, the function f is
	// invoked. If f returns an error or if the scan itself encounters
	// an error, the iteration will stop and return the error.
	// If the first result of f is true, the iteration stops.
	Iterate(start, end proto.EncodedKey, f func(proto.RawKeyValue) (bool, error)) error
	// Clear removes the item from the db with the given key.
	// Note that clear actually removes entries from the storage
	// engine, rather than inserting tombstones.
	Clear(key proto.EncodedKey) error
	// WriteBatch atomically applies the specified writes, deletions and
	// merges. The list passed to WriteBatch must only contain elements
	// of type Batch{Put,Merge,Delete}.
	WriteBatch([]interface{}) error
	// Merge is a high-performance write operation used for values which are
	// accumulated over several writes. Multiple values can be merged
	// sequentially into a single key; a subsequent read will return a "merged"
	// value which is computed from the original merged values.
	//
	// Merge currently provides specialized behavior for three data types:
	// integers, byte slices, and time series observations. Merged integers are
	// summed, acting as a high-performance accumulator.  Byte slices are simply
	// concatenated in the order they are merged. Time series observations
	// (stored as byte slices with a special tag on the proto.Value) are
	// combined with specialized logic beyond that of simple byte slices.
	//
	// The logic for merges is written in db.cc in order to be compatible with RocksDB.
	Merge(key proto.EncodedKey, value []byte) error
	// Capacity returns capacity details for the engine's available storage.
	Capacity() (StoreCapacity, error)
	// SetGCTimeouts sets a function which yields timeout values for GC
	// compaction of transaction and response cache entries. The return
	// values are in unix nanoseconds for the minimum transaction row
	// timestamp and the minimum response cache row timestamp respectively.
	// Rows with timestamps less than the associated value will be GC'd
	// during compaction.
	SetGCTimeouts(minTxnTS, minRCacheTS int64)
	// ApproximateSize returns the approximate number of bytes the engine is
	// using to store data for the given range of keys.
	ApproximateSize(start, end proto.EncodedKey) (uint64, error)
	// Flush causes the engine to write all in-memory data to disk
	// immediately.
	Flush() error
	// NewIterator returns a new instance of an Iterator over this
	// engine. The caller must invoke Iterator.Close() when finished with
	// the iterator to free resources.
	NewIterator() Iterator
	// NewSnapshot returns a new instance of a read-only snapshot
	// engine. Snapshots are instantaneous and, as long as they're
	// released relatively quickly, inexpensive. Snapshots are released
	// by invoking Stop(). Note that snapshots must not be used after the
	// original engine has been stopped.
	NewSnapshot() Engine
	// NewBatch returns a new instance of a batched engine which wraps
	// this engine. Batched engines accumulate all mutations and apply
	// them atomically on a call to Commit().
	NewBatch() Engine
	// Commit atomically applies any batched updates to the underlying
	// engine. This is a noop unless the engine was created via NewBatch().
	Commit() error
}

Engine is the interface that wraps the core operations of a key/value store.

type GarbageCollector

type GarbageCollector struct {
	// contains filtered or unexported fields
}

GarbageCollector GCs MVCC key/values using a zone-specific GC policy allows either the union or intersection of maximum # of versions and maximum age.

func NewGarbageCollector

func NewGarbageCollector(now proto.Timestamp, policy proto.GCPolicy) *GarbageCollector

NewGarbageCollector allocates and returns a new GC, with expiration computed based on current time and policy.TTLSeconds.

func (*GarbageCollector) Filter

func (gc *GarbageCollector) Filter(keys []proto.EncodedKey, values [][]byte) proto.Timestamp

Filter makes decisions about garbage collection based on the garbage collection policy for batches of values for the same key. Returns the timestamp including, and after which, all values should be garbage collected. If no values should be GC'd, returns proto.ZeroTimestamp.

type InMem

type InMem struct {
	*RocksDB
}

InMem wraps RocksDB and configures it for in-memory only storage.

func NewInMem

func NewInMem(attrs proto.Attributes, cacheSize int64) *InMem

NewInMem allocates and returns a new InMem object.

type InvalidRangeMetaKeyError

type InvalidRangeMetaKeyError struct {
	Msg string
	Key proto.Key
}

InvalidRangeMetaKeyError indicates that a Range Metadata key is somehow invalid.

func NewInvalidRangeMetaKeyError

func NewInvalidRangeMetaKeyError(msg string, k proto.Key) *InvalidRangeMetaKeyError

NewInvalidRangeMetaKeyError returns a new InvalidRangeMetaKeyError

func (*InvalidRangeMetaKeyError) Error

func (i *InvalidRangeMetaKeyError) Error() string

Error formats error string.

type Iterator

type Iterator interface {
	// Close frees up resources held by the iterator.
	Close()
	// Seek advances the iterator to the first key in the engine which
	// is >= the provided key.
	Seek(key []byte)
	// Valid returns true if the iterator is currently valid. An
	// iterator which hasn't been seeked or has gone past the end of the
	// key range is invalid.
	Valid() bool
	// Advances the iterator to the next key/value in the
	// iteration. After this call, the Valid() will be true if the
	// iterator was not positioned at the last key.
	Next()
	// Key returns the current key as a byte slice.
	Key() proto.EncodedKey
	// Value returns the current value as a byte slice.
	Value() []byte
	// ValueProto unmarshals the value the iterator is currently
	// pointing to using a protobuf decoder.
	ValueProto(msg gogoproto.Message) error
	// Error returns the error, if any, which the iterator encountered.
	Error() error
}

Iterator is an interface for iterating over key/value pairs in an engine. Iterator implementation are thread safe unless otherwise noted.

type MVCCStats

type MVCCStats struct {
	LiveBytes, KeyBytes, ValBytes, IntentBytes int64
	LiveCount, KeyCount, ValCount, IntentCount int64
	IntentAge, GCBytesAge, LastUpdateNanos     int64
}

MVCCStats tracks byte and instance counts for:

  • Live key/values (i.e. what a scan at current time will reveal; note that this includes intent keys and values, but not keys and values with most recent value deleted)
  • Key bytes (includes all keys, even those with most recent value deleted)
  • Value bytes (includes all versions)
  • Key count (count of all keys, including keys with deleted tombstones)
  • Value count (all versions, including deleted tombstones)
  • Intents (provisional values written during txns)

func MVCCComputeStats

func MVCCComputeStats(engine Engine, key, endKey proto.Key, nowNanos int64) (MVCCStats, error)

MVCCComputeStats scans the underlying engine from start to end keys and computes stats counters based on the values. This method is used after a range is split to recompute stats for each subrange. The start key is always adjusted to avoid counting local keys in the event stats are being recomputed for the first range (i.e. the one with start key == KeyMin). The nowNanos arg specifies the wall time in nanoseconds since the epoch and is used to compute the total age of all intents.

func (*MVCCStats) Accumulate

func (ms *MVCCStats) Accumulate(oms MVCCStats)

Accumulate adds values from oms to ms.

func (*MVCCStats) MergeStats

func (ms *MVCCStats) MergeStats(engine Engine, raftID int64)

MergeStats merges accumulated stats to stat counters for specified range.

func (*MVCCStats) SetStats

func (ms *MVCCStats) SetStats(engine Engine, raftID int64)

SetStats sets stat counters for specified range.

type RocksDB

type RocksDB struct {
	// contains filtered or unexported fields
}

RocksDB is a wrapper around a RocksDB database instance.

func NewRocksDB

func NewRocksDB(attrs proto.Attributes, dir string, cacheSize int64) *RocksDB

NewRocksDB allocates and returns a new RocksDB object.

func (*RocksDB) ApproximateSize

func (r *RocksDB) ApproximateSize(start, end proto.EncodedKey) (uint64, error)

ApproximateSize returns the approximate number of bytes on disk that RocksDB is using to store data for the given range of keys.

func (*RocksDB) Attrs

func (r *RocksDB) Attrs() proto.Attributes

Attrs returns the list of attributes describing this engine. This may include a specification of disk type (e.g. hdd, ssd, fio, etc.) and potentially other labels to identify important attributes of the engine.

func (*RocksDB) Capacity

func (r *RocksDB) Capacity() (StoreCapacity, error)

Capacity queries the underlying file system for disk capacity information.

func (*RocksDB) Clear

func (r *RocksDB) Clear(key proto.EncodedKey) error

Clear removes the item from the db with the given key.

func (*RocksDB) Commit

func (r *RocksDB) Commit() error

Commit is a noop for RocksDB engine.

func (*RocksDB) CompactRange

func (r *RocksDB) CompactRange(start, end proto.EncodedKey)

CompactRange compacts the specified key range. Specifying nil for the start key starts the compaction from the start of the database. Similarly, specifying nil for the end key will compact through the last key. Note that the use of the word "Range" here does not refer to Cockroach ranges, just to a generalized key range.

func (*RocksDB) Destroy

func (r *RocksDB) Destroy() error

Destroy destroys the underlying filesystem data associated with the database.

func (*RocksDB) Flush

func (r *RocksDB) Flush() error

Flush causes RocksDB to write all in-memory data to disk immediately.

func (*RocksDB) Get

func (r *RocksDB) Get(key proto.EncodedKey) ([]byte, error)

Get returns the value for the given key.

func (*RocksDB) GetProto

func (r *RocksDB) GetProto(key proto.EncodedKey, msg gogoproto.Message) (
	ok bool, keyBytes, valBytes int64, err error)

GetProto fetches the value at the specified key and unmarshals it.

func (*RocksDB) Iterate

func (r *RocksDB) Iterate(start, end proto.EncodedKey, f func(proto.RawKeyValue) (bool, error)) error

Iterate iterates from start to end keys, invoking f on each key/value pair. See engine.Iterate for details.

func (*RocksDB) Merge

func (r *RocksDB) Merge(key proto.EncodedKey, value []byte) error

Merge implements the RocksDB merge operator using the function goMergeInit to initialize missing values and goMerge to merge the old and the given value into a new value, which is then stored under key. Currently 64-bit counter logic is implemented. See the documentation of goMerge and goMergeInit for details.

The key and value byte slices may be reused safely. merge takes a copy of them before returning.

func (*RocksDB) NewBatch

func (r *RocksDB) NewBatch() Engine

NewBatch returns a new Batch wrapping this rocksdb engine.

func (*RocksDB) NewIterator

func (r *RocksDB) NewIterator() Iterator

NewIterator returns an iterator over this rocksdb engine.

func (*RocksDB) NewSnapshot

func (r *RocksDB) NewSnapshot() Engine

NewSnapshot creates a snapshot handle from engine and returns a read-only rocksDBSnapshot engine.

func (*RocksDB) Put

func (r *RocksDB) Put(key proto.EncodedKey, value []byte) error

Put sets the given key to the value provided.

The key and value byte slices may be reused safely. put takes a copy of them before returning.

func (*RocksDB) SetGCTimeouts

func (r *RocksDB) SetGCTimeouts(minTxnTS, minRCacheTS int64)

SetGCTimeouts calls through to the DBEngine's SetGCTimeouts method.

func (*RocksDB) Start

func (r *RocksDB) Start() error

Start creates options and opens the database. If the database doesn't yet exist at the specified directory, one is initialized from scratch. Subsequent calls to this method on an open DB are no-ops.

func (*RocksDB) Stop

func (r *RocksDB) Stop()

Stop closes the database by deallocating the underlying handle.

func (*RocksDB) String

func (r *RocksDB) String() string

String formatter.

func (*RocksDB) WriteBatch

func (r *RocksDB) WriteBatch(cmds []interface{}) error

WriteBatch applies the puts, merges and deletes atomically via the RocksDB write batch facility. The list must only contain elements of type Batch{Put,Merge,Delete}.

type StoreCapacity

type StoreCapacity struct {
	Capacity  int64
	Available int64
}

StoreCapacity contains capacity information for a storage device.

func (StoreCapacity) PercentAvail

func (sc StoreCapacity) PercentAvail() float64

PercentAvail computes the percentage of disk space that is available.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL