Documentation ¶
Overview ¶
Package engine provides low-level storage. It interacts with storage backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At one level higher, MVCC provides multi-version concurrency control capability on top of an Engine instance.
The Engine interface provides an API for key-value stores. InMem implements an in-memory engine using a sorted map. RocksDB implements an engine for data stored to local disk using RocksDB, a variant of LevelDB.
MVCC provides a multi-version concurrency control system on top of an engine. MVCC is the basis for Cockroach's support for distributed transactions. It is intended for direct use from storage.Range objects.
Notes on MVCC architecture ¶
Each MVCC value contains a metadata key/value pair and one or more version key/value pairs. The MVCC metadata key is the actual key for the value, using the util/encoding.EncodeBytes scheme. The MVCC metadata value is of type MVCCMetadata and contains the most recent version timestamp and an optional roachpb.Transaction message. If set, the most recent version of the MVCC value is a transactional "intent". It also contains some information on the size of the most recent version's key and value for efficient stat counter computations. Note that it is not necessary to explicitly store the MVCC metadata as its contents can be reconstructed from the most recent versioned value as long as an intent is not present. The implementation takes advantage of this and deletes the MVCC metadata when possible.
Each MVCC version key/value pair has a key which is also binary-encoded, but is suffixed with a decreasing, big-endian encoding of the timestamp (eight bytes for the nanosecond wall time, followed by four bytes for the logical time except for meta key value pairs, for which the timestamp is implicit). The MVCC version value is a message of type roachpb.Value. A deletion is indicated by an empty value. Note that an empty roachpb.Value will encode to a non-empty byte slice. The decreasing encoding on the timestamp sorts the most recent version directly after the metadata key, which is treated specially by the RocksDB comparator (by making the zero timestamp sort first). This increases the likelihood that an Engine.Get() of the MVCC metadata will get the same block containing the most recent version, even if there are many versions. We rely on getting the MVCC metadata key/value and then using it to directly get the MVCC version using the metadata's most recent version timestamp. This avoids using an expensive merge iterator to scan the most recent version. It also allows us to leverage RocksDB's bloom filters.
The binary encoding used on the MVCC keys allows arbitrary keys to be stored in the map (no restrictions on intermediate nil-bytes, for example), while still sorting lexicographically and guaranteeing that all timestamp-suffixed MVCC version keys sort consecutively with the metadata key. We use an escape-based encoding which transforms all nul ("\x00") characters in the key and is terminated with the sequence "\x00\x01", which is guaranteed to not occur elsewhere in the encoded value. See util/encoding/encoding.go for more details.
We considered inlining the most recent MVCC version in the MVCCMetadata. This would reduce the storage overhead of storing the same key twice (which is small due to block compression), and the runtime overhead of two separate DB lookups. On the other hand, all writes that create a new version of an existing key would incur a double write as the previous value is moved out of the MVCCMetadata into its versioned key. Preliminary benchmarks have not shown enough performance improvement to justify this change, although we may revisit this decision if it turns out that multiple versions of the same key are rare in practice.
However, we do allow inlining in order to use the MVCC interface to store non-versioned values. It turns out that not everything which Cockroach needs to store would be efficient or possible using MVCC. Examples include transaction records, response cache entries, stats counters, time series data, and system-local config values. However, supporting a mix of encodings is problematic in terms of resulting complexity. So Cockroach treats an MVCC timestamp of zero to mean an inlined, non-versioned value. These values are replaced if they exist on a Put operation and are cleared from the engine on a delete. Importantly, zero-timestamped MVCC values may be merged, as is necessary for stats counters and time series data.
Index ¶
- Constants
- Variables
- func AccountForSelf(ms *enginepb.MVCCStats, rangeID roachpb.RangeID) error
- func ComputeStatsGo(iter SimpleIterator, start, end MVCCKey, nowNanos int64) (enginepb.MVCCStats, error)
- func EncodeKey(key MVCCKey) []byte
- func IsIntentOf(meta enginepb.MVCCMetadata, txn *roachpb.Transaction) bool
- func IsValidSplitKey(key roachpb.Key) bool
- func MVCCBlindConditionalPut(ctx context.Context, engine Writer, ms *enginepb.MVCCStats, key roachpb.Key, ...) error
- func MVCCBlindInitPut(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCBlindPut(ctx context.Context, engine Writer, ms *enginepb.MVCCStats, key roachpb.Key, ...) error
- func MVCCConditionalPut(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCDelete(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCDeleteRange(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) ([]roachpb.Key, *roachpb.Span, int64, error)
- func MVCCFindSplitKey(ctx context.Context, engine Reader, rangeID roachpb.RangeID, ...) (roachpb.Key, error)
- func MVCCGarbageCollect(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCGet(ctx context.Context, engine Reader, key roachpb.Key, timestamp hlc.Timestamp, ...) (*roachpb.Value, []roachpb.Intent, error)
- func MVCCGetAsTxn(ctx context.Context, engine Reader, key roachpb.Key, timestamp hlc.Timestamp, ...) (*roachpb.Value, []roachpb.Intent, error)
- func MVCCGetProto(ctx context.Context, engine Reader, key roachpb.Key, timestamp hlc.Timestamp, ...) (bool, error)
- func MVCCGetRangeStats(ctx context.Context, engine Reader, rangeID roachpb.RangeID) (enginepb.MVCCStats, error)
- func MVCCIncrement(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) (int64, error)
- func MVCCInitPut(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCIterate(ctx context.Context, engine Reader, startKey, endKey roachpb.Key, ...) ([]roachpb.Intent, error)
- func MVCCMerge(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCPut(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCPutProto(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCResolveWriteIntent(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) error
- func MVCCResolveWriteIntentRange(ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, ...) (int64, error)
- func MVCCResolveWriteIntentRangeUsingIter(ctx context.Context, engine ReadWriter, iterAndBuf IterAndBuf, ...) (int64, error)
- func MVCCResolveWriteIntentUsingIter(ctx context.Context, engine ReadWriter, iterAndBuf IterAndBuf, ...) error
- func MVCCReverseScan(ctx context.Context, engine Reader, key, endKey roachpb.Key, max int64, ...) ([]roachpb.KeyValue, *roachpb.Span, []roachpb.Intent, error)
- func MVCCScan(ctx context.Context, engine Reader, key, endKey roachpb.Key, max int64, ...) ([]roachpb.KeyValue, *roachpb.Span, []roachpb.Intent, error)
- func MVCCSetRangeStats(ctx context.Context, engine ReadWriter, rangeID roachpb.RangeID, ...) error
- func MakeValue(meta enginepb.MVCCMetadata) roachpb.Value
- func MergeInternalTimeSeriesData(sources ...roachpb.InternalTimeSeriesData) (roachpb.InternalTimeSeriesData, error)
- func PutProto(engine Writer, key MVCCKey, msg proto.Message) (keyBytes, valBytes int64, err error)
- func RocksDBBatchCount(repr []byte) (int, error)
- func RunLDB(args []string)
- func SplitMVCCKey(mvccKey []byte) (key []byte, ts []byte, ok bool)
- type Batch
- type BatchType
- type Engine
- type GarbageCollector
- type InMem
- type IterAndBuf
- type Iterator
- type MVCCKey
- type MVCCKeyValue
- type ReadWriter
- type Reader
- type RocksDB
- func (r *RocksDB) ApplyBatchRepr(repr []byte, sync bool) error
- func (r *RocksDB) Attrs() roachpb.Attributes
- func (r *RocksDB) Capacity() (roachpb.StoreCapacity, error)
- func (r *RocksDB) Clear(key MVCCKey) error
- func (r *RocksDB) ClearIterRange(iter Iterator, start, end MVCCKey) error
- func (r *RocksDB) ClearRange(start, end MVCCKey) error
- func (r *RocksDB) Close()
- func (r *RocksDB) Closed() bool
- func (r *RocksDB) Compact() error
- func (r *RocksDB) Destroy() error
- func (r *RocksDB) Flush() error
- func (r *RocksDB) Get(key MVCCKey) ([]byte, error)
- func (r *RocksDB) GetAuxiliaryDir() string
- func (r *RocksDB) GetCompactionStats() string
- func (r *RocksDB) GetProto(key MVCCKey, msg proto.Message) (ok bool, keyBytes, valBytes int64, err error)
- func (r *RocksDB) GetSSTables() SSTableInfos
- func (r *RocksDB) GetStats() (*Stats, error)
- func (r *RocksDB) IngestExternalFile(ctx context.Context, path string, move bool) error
- func (r *RocksDB) Iterate(start, end MVCCKey, f func(MVCCKeyValue) (bool, error)) error
- func (r *RocksDB) Merge(key MVCCKey, value []byte) error
- func (r *RocksDB) NewBatch() Batch
- func (r *RocksDB) NewIterator(prefix bool) Iterator
- func (r *RocksDB) NewReadOnly() ReadWriter
- func (r *RocksDB) NewSnapshot() Reader
- func (r *RocksDB) NewTimeBoundIterator(start, end hlc.Timestamp) Iterator
- func (r *RocksDB) NewWriteOnlyBatch() Batch
- func (r *RocksDB) Put(key MVCCKey, value []byte) error
- func (r *RocksDB) String() string
- func (r *RocksDB) WriteFile(filename string, data []byte) error
- type RocksDBBatchBuilder
- func (b *RocksDBBatchBuilder) ApplyRepr(repr []byte) error
- func (b *RocksDBBatchBuilder) Clear(key MVCCKey)
- func (b *RocksDBBatchBuilder) Finish() []byte
- func (b *RocksDBBatchBuilder) Len() int
- func (b *RocksDBBatchBuilder) Merge(key MVCCKey, value []byte)
- func (b *RocksDBBatchBuilder) Put(key MVCCKey, value []byte)
- type RocksDBBatchReader
- type RocksDBCache
- type RocksDBConfig
- type RocksDBError
- type RocksDBMap
- func (r *RocksDBMap) Close(ctx context.Context)
- func (r *RocksDBMap) Get(k []byte) ([]byte, error)
- func (r *RocksDBMap) NewBatchWriter() SortedDiskMapBatchWriter
- func (r *RocksDBMap) NewBatchWriterCapacity(capacityBytes int) SortedDiskMapBatchWriter
- func (r *RocksDBMap) NewIterator() SortedDiskMapIterator
- func (r *RocksDBMap) Put(k []byte, v []byte) error
- type RocksDBMapBatchWriter
- type RocksDBMapIterator
- func (i *RocksDBMapIterator) Close()
- func (i *RocksDBMapIterator) Key() []byte
- func (i *RocksDBMapIterator) Next()
- func (i *RocksDBMapIterator) Rewind()
- func (i *RocksDBMapIterator) Seek(k []byte)
- func (i *RocksDBMapIterator) UnsafeKey() []byte
- func (i *RocksDBMapIterator) UnsafeValue() []byte
- func (i *RocksDBMapIterator) Valid() (bool, error)
- func (i *RocksDBMapIterator) Value() []byte
- type RocksDBSstFileReader
- type RocksDBSstFileWriter
- type SSTableInfo
- type SSTableInfos
- type SimpleIterator
- type SortedDiskMap
- type SortedDiskMapBatchWriter
- type SortedDiskMapIterator
- type Stats
- type Version
- type Writer
Constants ¶
const ( BatchTypeDeletion BatchType = 0x0 BatchTypeValue = 0x1 BatchTypeMerge = 0x2 )
These constants come from rocksdb/db/dbformat.h.
const ( // RecommendedMaxOpenFiles is the recommended value for RocksDB's // max_open_files option. RecommendedMaxOpenFiles = 10000 // MinimumMaxOpenFiles is the minimum value that RocksDB's max_open_files // option can be set to. While this should be set as high as possible, the // minimum total for a single store node must be under 2048 for Windows // compatibility. See: // https://wpdev.uservoice.com/forums/266908-command-prompt-console-bash-on-ubuntu-on-windo/suggestions/17310124-add-ability-to-change-max-number-of-open-files-for MinimumMaxOpenFiles = 1700 )
Variables ¶
var ( // MVCCKeyMax is a maximum mvcc-encoded key value which sorts after // all other keys.` MVCCKeyMax = MakeMVCCMetadataKey(roachpb.KeyMax) // NilKey is the nil MVCCKey. NilKey = MVCCKey{} )
Functions ¶
func AccountForSelf ¶
AccountForSelf adjusts ms to account for the predicted impact it will have on the values that it records when the structure is initially stored. Specifically, MVCCStats is stored on the RangeStats key, which means that its creation will have an impact on system-local data size and key count.
func ComputeStatsGo ¶ added in v1.1.0
func ComputeStatsGo( iter SimpleIterator, start, end MVCCKey, nowNanos int64, ) (enginepb.MVCCStats, error)
ComputeStatsGo scans the underlying engine from start to end keys and computes stats counters based on the values. This method is used after a range is split to recompute stats for each subrange. The start key is always adjusted to avoid counting local keys in the event stats are being recomputed for the first range (i.e. the one with start key == KeyMin). The nowNanos arg specifies the wall time in nanoseconds since the epoch and is used to compute the total age of all intents.
Most codepaths will be computing stats on a RocksDB iterator, which is implemented in c++, so iter.ComputeStats will save several cgo calls per kv processed. (Plus, on equal footing, the c++ implementation is slightly faster.) ComputeStatsGo is here for codepaths that have a pure-go implementation of SimpleIterator.
This implementation must match engine/db.cc:MVCCComputeStatsInternal.
func EncodeKey ¶ added in v1.1.0
EncodeKey encodes an engine.MVCC key into the RocksDB representation. This encoding must match with the encoding in engine/db.cc:EncodeKey().
func IsIntentOf ¶
func IsIntentOf(meta enginepb.MVCCMetadata, txn *roachpb.Transaction) bool
IsIntentOf returns true if the meta record is an intent of the supplied transaction.
func IsValidSplitKey ¶
IsValidSplitKey returns whether the key is a valid split key. Certain key ranges cannot be split (the meta1 span and the system DB span); split keys chosen within any of these ranges are considered invalid. And a split key equal to Meta2KeyMax (\x03\xff\xff) is considered invalid.
func MVCCBlindConditionalPut ¶
func MVCCBlindConditionalPut( ctx context.Context, engine Writer, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, expVal *roachpb.Value, txn *roachpb.Transaction, ) error
MVCCBlindConditionalPut is a fast-path of MVCCConditionalPut. See the MVCCConditionalPut comments for details of the semantics. MVCCBlindConditionalPut skips retrieving the existing metadata for the key requiring the caller to guarantee no versions for the key currently exist.
func MVCCBlindInitPut ¶
func MVCCBlindInitPut( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, failOnTombstones bool, txn *roachpb.Transaction, ) error
MVCCBlindInitPut is a fast-path of MVCCInitPut. See the MVCCInitPut comments for details of the semantics. MVCCBlindInitPut skips retrieving the existing metadata for the key requiring the caller to guarantee no version for the key currently exist.
func MVCCBlindPut ¶
func MVCCBlindPut( ctx context.Context, engine Writer, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, txn *roachpb.Transaction, ) error
MVCCBlindPut is a fast-path of MVCCPut. See the MVCCPut comments for details of the semantics. MVCCBlindPut skips retrieving the existing metadata for the key requiring the caller to guarantee no versions for the key currently exist in order for stats to be updated properly. If a previous version of the key does exist it is up to the caller to properly account for their existence in updating the stats.
func MVCCConditionalPut ¶
func MVCCConditionalPut( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, expVal *roachpb.Value, txn *roachpb.Transaction, ) error
MVCCConditionalPut sets the value for a specified key only if the expected value matches. If not, the return a ConditionFailedError containing the actual value.
The condition check reads a value from the key using the same operational timestamp as we use to write a value.
func MVCCDelete ¶
func MVCCDelete( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, txn *roachpb.Transaction, ) error
MVCCDelete marks the key deleted so that it will not be returned in future get responses.
func MVCCDeleteRange ¶
func MVCCDeleteRange( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key, endKey roachpb.Key, max int64, timestamp hlc.Timestamp, txn *roachpb.Transaction, returnKeys bool, ) ([]roachpb.Key, *roachpb.Span, int64, error)
MVCCDeleteRange deletes the range of key/value pairs specified by start and end keys. It returns the range of keys deleted when returnedKeys is set, the next span to resume from, and the number of keys deleted.
func MVCCFindSplitKey ¶
func MVCCFindSplitKey( ctx context.Context, engine Reader, rangeID roachpb.RangeID, key, endKey roachpb.RKey, targetSize int64, ) (roachpb.Key, error)
MVCCFindSplitKey suggests a split key from the given user-space key range that aims to roughly cut into half the total number of bytes used (in raw key and value byte strings) in both subranges. Specify a snapshot engine to safely invoke this method in a goroutine.
The split key will never be chosen from the key ranges listed in illegalSplitKeySpans.
debugFn, if not nil, is used to print informational log messages about the key finding process.
func MVCCGarbageCollect ¶
func MVCCGarbageCollect( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, keys []roachpb.GCRequest_GCKey, timestamp hlc.Timestamp, maxClears int64, ) error
MVCCGarbageCollect creates an iterator on the engine. In parallel it iterates through the keys listed for garbage collection by the keys slice. The engine iterator is seeked in turn to each listed key, clearing all values with timestamps <= to expiration. The timestamp parameter is used to compute the intent age on GC. Garbage collection stops after clearing maxClears values (to limit the size of the WriteBatch produced).
func MVCCGet ¶
func MVCCGet( ctx context.Context, engine Reader, key roachpb.Key, timestamp hlc.Timestamp, consistent bool, txn *roachpb.Transaction, ) (*roachpb.Value, []roachpb.Intent, error)
MVCCGet returns the value for the key specified in the request, while satisfying the given timestamp condition. The key may contain arbitrary bytes. If no value for the key exists, or it has been deleted, returns nil for value.
The values of multiple versions for the given key should be organized as follows: ... keyA : MVCCMetadata of keyA keyA_Timestamp_n : value of version_n keyA_Timestamp_n-1 : value of version_n-1 ... keyA_Timestamp_0 : value of version_0 keyB : MVCCMetadata of keyB ...
The consistent parameter indicates that intents should cause WriteIntentErrors. If set to false, a possible intent on the key will be ignored for reading the value (but returned via the roachpb.Intent slice); the previous value (if any) is read instead.
func MVCCGetAsTxn ¶
func MVCCGetAsTxn( ctx context.Context, engine Reader, key roachpb.Key, timestamp hlc.Timestamp, txnMeta enginepb.TxnMeta, ) (*roachpb.Value, []roachpb.Intent, error)
MVCCGetAsTxn constructs a temporary Transaction from the given txn metadata and calls MVCCGet as that transaction. This method is required only for reading intents of a transaction when only its metadata is known and should rarely be used. The read is carried out without the chance of uncertainty restarts.
func MVCCGetProto ¶
func MVCCGetProto( ctx context.Context, engine Reader, key roachpb.Key, timestamp hlc.Timestamp, consistent bool, txn *roachpb.Transaction, msg proto.Message, ) (bool, error)
MVCCGetProto fetches the value at the specified key and unmarshals it into msg if msg is non-nil. Returns true on success or false if the key was not found. The semantics of consistent are the same as in MVCCGet.
func MVCCGetRangeStats ¶
func MVCCGetRangeStats( ctx context.Context, engine Reader, rangeID roachpb.RangeID, ) (enginepb.MVCCStats, error)
MVCCGetRangeStats reads stat counters for the specified range and sets the values in the enginepb.MVCCStats struct.
func MVCCIncrement ¶
func MVCCIncrement( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, txn *roachpb.Transaction, inc int64, ) (int64, error)
MVCCIncrement fetches the value for key, and assuming the value is an "integer" type, increments it by inc and stores the new value. The newly incremented value is returned.
An initial value is read from the key using the same operational timestamp as we use to write a value.
func MVCCInitPut ¶
func MVCCInitPut( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, failOnTombstones bool, txn *roachpb.Transaction, ) error
MVCCInitPut sets the value for a specified key if the key doesn't exist. It returns a ConditionFailedError when the write fails or if the key exists with an existing value that is different from the supplied value. If failOnTombstones is set to true, tombstones count as mismatched values and will cause a ConditionFailedError.
func MVCCIterate ¶
func MVCCIterate( ctx context.Context, engine Reader, startKey, endKey roachpb.Key, timestamp hlc.Timestamp, consistent bool, txn *roachpb.Transaction, reverse bool, f func(roachpb.KeyValue) (bool, error), ) ([]roachpb.Intent, error)
MVCCIterate iterates over the key range [start,end). At each step of the iteration, f() is invoked with the current key/value pair. If f returns true (done) or an error, the iteration stops and the error is propagated. If the reverse is flag set the iterator will be moved in reverse order.
func MVCCMerge ¶
func MVCCMerge( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, ) error
MVCCMerge implements a merge operation. Merge adds integer values, concatenates undifferentiated byte slice values, and efficiently combines time series observations if the roachpb.Value tag value indicates the value byte slice is of type TIMESERIES.
func MVCCPut ¶
func MVCCPut( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, value roachpb.Value, txn *roachpb.Transaction, ) error
MVCCPut sets the value for a specified key. It will save the value with different versions according to its timestamp and update the key metadata. The timestamp must be passed as a parameter; using the Timestamp field on the value results in an error.
If the timestamp is specified as hlc.Timestamp{}, the value is inlined instead of being written as a timestamp-versioned value. A zero timestamp write to a key precludes a subsequent write using a non-zero timestamp and vice versa. Inlined values require only a single row and never accumulate more than a single value. Successive zero timestamp writes to a key replace the value and deletes clear the value. In addition, zero timestamp values may be merged.
func MVCCPutProto ¶
func MVCCPutProto( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, key roachpb.Key, timestamp hlc.Timestamp, txn *roachpb.Transaction, msg proto.Message, ) error
MVCCPutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp.
func MVCCResolveWriteIntent ¶
func MVCCResolveWriteIntent( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, intent roachpb.Intent, ) error
MVCCResolveWriteIntent either commits or aborts (rolls back) an extant write intent for a given txn according to commit parameter. ResolveWriteIntent will skip write intents of other txns.
Transaction epochs deserve a bit of explanation. The epoch for a transaction is incremented on transaction retries. A transaction retry is different from an abort. Retries can occur in SSI transactions when the commit timestamp is not equal to the proposed transaction timestamp. On a retry, the epoch is incremented instead of creating an entirely new transaction. This allows the intents that were written on previous runs to serve as locks which prevent concurrent reads from further incrementing the timestamp cache, making further transaction retries less likely.
Because successive retries of a transaction may end up writing to different keys, the epochs serve to classify which intents get committed in the event the transaction succeeds (all those with epoch matching the commit epoch), and which intents get aborted, even if the transaction succeeds.
TODO(tschottdorf): encountered a bug in which a Txn committed with its original timestamp after laying down intents at higher timestamps. Doesn't look like this code here caught that. Shouldn't resolve intents when they're not at the timestamp the Txn mandates them to be.
func MVCCResolveWriteIntentRange ¶
func MVCCResolveWriteIntentRange( ctx context.Context, engine ReadWriter, ms *enginepb.MVCCStats, intent roachpb.Intent, max int64, ) (int64, error)
MVCCResolveWriteIntentRange commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns.
func MVCCResolveWriteIntentRangeUsingIter ¶
func MVCCResolveWriteIntentRangeUsingIter( ctx context.Context, engine ReadWriter, iterAndBuf IterAndBuf, ms *enginepb.MVCCStats, intent roachpb.Intent, max int64, ) (int64, error)
MVCCResolveWriteIntentRangeUsingIter commits or aborts (rolls back) the range of write intents specified by start and end keys for a given txn. ResolveWriteIntentRange will skip write intents of other txns.
func MVCCResolveWriteIntentUsingIter ¶
func MVCCResolveWriteIntentUsingIter( ctx context.Context, engine ReadWriter, iterAndBuf IterAndBuf, ms *enginepb.MVCCStats, intent roachpb.Intent, ) error
MVCCResolveWriteIntentUsingIter is a variant of MVCCResolveWriteIntent that uses iterator and buffer passed as parameters (e.g. when used in a loop).
func MVCCReverseScan ¶
func MVCCReverseScan( ctx context.Context, engine Reader, key, endKey roachpb.Key, max int64, timestamp hlc.Timestamp, consistent bool, txn *roachpb.Transaction, ) ([]roachpb.KeyValue, *roachpb.Span, []roachpb.Intent, error)
MVCCReverseScan scans the key range [start,end) key up to some maximum number of results in descending order. If it hits max, it returns a span to be used in the next call to this function.
func MVCCScan ¶
func MVCCScan( ctx context.Context, engine Reader, key, endKey roachpb.Key, max int64, timestamp hlc.Timestamp, consistent bool, txn *roachpb.Transaction, ) ([]roachpb.KeyValue, *roachpb.Span, []roachpb.Intent, error)
MVCCScan scans the key range [start,end) key up to some maximum number of results in ascending order. If it hits max, it returns a span to be used in the next call to this function.
func MVCCSetRangeStats ¶
func MVCCSetRangeStats( ctx context.Context, engine ReadWriter, rangeID roachpb.RangeID, ms *enginepb.MVCCStats, ) error
MVCCSetRangeStats sets stat counters for specified range.
func MakeValue ¶
func MakeValue(meta enginepb.MVCCMetadata) roachpb.Value
MakeValue returns the inline value.
func MergeInternalTimeSeriesData ¶
func MergeInternalTimeSeriesData( sources ...roachpb.InternalTimeSeriesData, ) (roachpb.InternalTimeSeriesData, error)
MergeInternalTimeSeriesData exports the engine's C++ merge logic for InternalTimeSeriesData to higher level packages. This is intended primarily for consumption by high level testing of time series functionality.
func PutProto ¶
PutProto sets the given key to the protobuf-serialized byte string of msg and the provided timestamp. Returns the length in bytes of key and the value.
func RocksDBBatchCount ¶ added in v1.1.0
RocksDBBatchCount provides an efficient way to get the count of mutations in a RocksDB Batch representation.
Types ¶
type Batch ¶
type Batch interface { ReadWriter // Commit atomically applies any batched updates to the underlying // engine. This is a noop unless the engine was created via NewBatch(). If // sync is true, the batch is synchronously committed to disk. Commit(sync bool) error // Distinct returns a view of the existing batch which only sees writes that // were performed before the Distinct batch was created. That is, the // returned batch will not read its own writes, but it will read writes to // the parent batch performed before the call to Distinct(). The returned // batch needs to be closed before using the parent batch again. This is used // as an optimization to avoid flushing mutations buffered by the batch in // situations where we know all of the batched operations are for distinct // keys. Distinct() ReadWriter // Repr returns the underlying representation of the batch and can be used to // reconstitute the batch on a remote node using Writer.ApplyBatchRepr(). Repr() []byte }
Batch is the interface for batch specific operations.
type BatchType ¶
type BatchType byte
BatchType represents the type of an entry in an encoded RocksDB batch.
type Engine ¶
type Engine interface { ReadWriter // Attrs returns the engine/store attributes. Attrs() roachpb.Attributes // Capacity returns capacity details for the engine's available storage. Capacity() (roachpb.StoreCapacity, error) // Flush causes the engine to write all in-memory data to disk // immediately. Flush() error // GetStats retrieves stats from the engine. GetStats() (*Stats, error) // GetAuxiliaryDir returns a path under which files can be stored // persistently, and from which data can be ingested by the engine. // // Not thread safe. GetAuxiliaryDir() string // NewBatch returns a new instance of a batched engine which wraps // this engine. Batched engines accumulate all mutations and apply // them atomically on a call to Commit(). NewBatch() Batch // NewReadOnly returns a new instance of a ReadWriter that wraps // this engine. This wrapper panics when unexpected operations (e.g., write // operations) are executed on it and caches iterators to avoid the overhead // of creating multiple iterators for batched reads. NewReadOnly() ReadWriter // NewWriteOnlyBatch returns a new instance of a batched engine which wraps // this engine. A write-only batch accumulates all mutations and applies them // atomically on a call to Commit(). Read operations return an error. // // TODO(peter): This should return a WriteBatch interface, but there are mild // complications in both defining that interface and implementing it. In // particular, Batch.Close would no longer come from Reader and we'd need to // refactor a bunch of code in rocksDBBatch. NewWriteOnlyBatch() Batch // NewSnapshot returns a new instance of a read-only snapshot // engine. Snapshots are instantaneous and, as long as they're // released relatively quickly, inexpensive. Snapshots are released // by invoking Close(). Note that snapshots must not be used after the // original engine has been stopped. NewSnapshot() Reader // IngestExternalFile links a file into the RocksDB log-structured // merge-tree. IngestExternalFile(ctx context.Context, path string, move bool) error }
Engine is the interface that wraps the core operations of a key/value store.
func NewTempEngine ¶ added in v1.1.0
NewTempEngine creates a new engine for DistSQL processors to use when the working set is larger than can be stored in memory. It returns nil if it could not set up a temporary Engine. When closed, it destroys the underlying data.
type GarbageCollector ¶
GarbageCollector GCs MVCC key/values using a zone-specific GC policy allows either the union or intersection of maximum # of versions and maximum age.
func MakeGarbageCollector ¶
func MakeGarbageCollector(now hlc.Timestamp, policy config.GCPolicy) GarbageCollector
MakeGarbageCollector allocates and returns a new GC, with expiration computed based on current time and policy.TTLSeconds.
func (GarbageCollector) Filter ¶
func (gc GarbageCollector) Filter(keys []MVCCKey, values [][]byte) hlc.Timestamp
Filter makes decisions about garbage collection based on the garbage collection policy for batches of values for the same key. Returns the timestamp including, and after which, all values should be garbage collected. If no values should be GC'd, returns the zero timestamp. keys must be in descending time order. Values deleted at or before the returned timestamp can be deleted without invalidating any reads in the time interval (gc.expiration, \infinity).
The GC keeps all values (including deletes) above the expiration time, plus the first value before or at the expiration time. This allows reads to be guaranteed as described above. However if this were the only rule, then if the most recent write was a delete, it would never be removed. Thus, when a deleted value is the most recent before expiration, it can be deleted. This would still allow for the tombstone bugs in #6227, so in the future we will add checks that disallow writes before the last GC expiration time.
type InMem ¶
type InMem struct {
*RocksDB
}
InMem wraps RocksDB and configures it for in-memory only storage.
type IterAndBuf ¶
type IterAndBuf struct {
// contains filtered or unexported fields
}
IterAndBuf used to pass iterators and buffers between MVCC* calls, allowing reuse without the callers needing to know the particulars.
func GetIterAndBuf ¶
func GetIterAndBuf(engine Reader) IterAndBuf
GetIterAndBuf returns a IterAndBuf for passing into various MVCC* methods.
func (IterAndBuf) Cleanup ¶
func (b IterAndBuf) Cleanup()
Cleanup must be called to release the resources when done.
type Iterator ¶
type Iterator interface { SimpleIterator // SeekReverse advances the iterator to the first key in the engine which // is <= the provided key. SeekReverse(key MVCCKey) // Prev moves the iterator backward to the previous key/value // in the iteration. After this call, Valid() will be true if the // iterator was not positioned at the first key. Prev() // PrevKey moves the iterator backward to the previous MVCC key. This // operation is distinct from Prev which moves the iterator backward to the // prev version of the current key or the prev key if the iterator is // currently located at the first version for a key. PrevKey() // Key returns the current key. Key() MVCCKey // Value returns the current value as a byte slice. Value() []byte // ValueProto unmarshals the value the iterator is currently // pointing to using a protobuf decoder. ValueProto(msg proto.Message) error // Less returns true if the key the iterator is currently positioned at is // less than the specified key. Less(key MVCCKey) bool // ComputeStats scans the underlying engine from start to end keys and // computes stats counters based on the values. This method is used after a // range is split to recompute stats for each subrange. The start key is // always adjusted to avoid counting local keys in the event stats are being // recomputed for the first range (i.e. the one with start key == KeyMin). // The nowNanos arg specifies the wall time in nanoseconds since the // epoch and is used to compute the total age of all intents. ComputeStats(start, end MVCCKey, nowNanos int64) (enginepb.MVCCStats, error) }
Iterator is an interface for iterating over key/value pairs in an engine. Iterator implementations are thread safe unless otherwise noted.
type MVCCKey ¶
MVCCKey is a versioned key, distinguished from roachpb.Key with the addition of a timestamp.
func AllocIterKeyValue ¶
func AllocIterKeyValue( a bufalloc.ByteAllocator, iter Iterator, ) (bufalloc.ByteAllocator, MVCCKey, []byte)
AllocIterKeyValue returns iter.Key() and iter.Value() with the underlying storage allocated from the passed ChunkAllocator.
func DecodeKey ¶
DecodeKey decodes an engine.MVCCKey from its serialized representation. This decoding must match engine/db.cc:DecodeKey().
func MakeMVCCMetadataKey ¶
MakeMVCCMetadataKey creates an MVCCKey from a roachpb.Key.
func (MVCCKey) EncodedSize ¶
EncodedSize returns the size of the MVCCKey when encoded.
type MVCCKeyValue ¶
MVCCKeyValue contains the raw bytes of the value for a key.
type ReadWriter ¶
ReadWriter is the read/write interface to an engine's data.
type Reader ¶
type Reader interface { // Close closes the reader, freeing up any outstanding resources. Note that // various implementations have slightly different behaviors. In particular, // Distinct() batches release their parent batch for future use while // Engines, Snapshots and Batches free the associated C++ resources. Close() // Closed returns true if the reader has been closed or is not usable. // Objects backed by this reader (e.g. Iterators) can check this to ensure // that they are not using a closed engine. Intended for use within package // engine; exported to enable wrappers to exist in other packages. Closed() bool // Get returns the value for the given key, nil otherwise. Get(key MVCCKey) ([]byte, error) // GetProto fetches the value at the specified key and unmarshals it // using a protobuf decoder. Returns true on success or false if the // key was not found. On success, returns the length in bytes of the // key and the value. GetProto(key MVCCKey, msg proto.Message) (ok bool, keyBytes, valBytes int64, err error) // Iterate scans from start to end keys, visiting at most max // key/value pairs. On each key value pair, the function f is // invoked. If f returns an error or if the scan itself encounters // an error, the iteration will stop and return the error. // If the first result of f is true, the iteration stops. Iterate(start, end MVCCKey, f func(MVCCKeyValue) (bool, error)) error // NewIterator returns a new instance of an Iterator over this engine. When // prefix is true, Seek will use the user-key prefix of the supplied MVCC key // to restrict which sstables are searched, but iteration (using Next) over // keys without the same user-key prefix will not work correctly (keys may be // skipped). The caller must invoke Iterator.Close() when finished with the // iterator to free resources. NewIterator(prefix bool) Iterator // NewTimeBoundIterator is like NewIterator, but the underlying iterator will // efficiently skip over SSTs that contain no MVCC keys in the time range // (start, end]. NewTimeBoundIterator(start, end hlc.Timestamp) Iterator }
Reader is the read interface to an engine's data.
type RocksDB ¶
type RocksDB struct {
// contains filtered or unexported fields
}
RocksDB is a wrapper around a RocksDB database instance.
func NewRocksDB ¶
func NewRocksDB(cfg RocksDBConfig, cache RocksDBCache) (*RocksDB, error)
NewRocksDB allocates and returns a new RocksDB object. This creates options and opens the database. If the database doesn't yet exist at the specified directory, one is initialized from scratch. The caller must call the engine's Close method when the engine is no longer needed.
func (*RocksDB) ApplyBatchRepr ¶
ApplyBatchRepr atomically applies a set of batched updates. Created by calling Repr() on a batch. Using this method is equivalent to constructing and committing a batch whose Repr() equals repr.
func (*RocksDB) Attrs ¶
func (r *RocksDB) Attrs() roachpb.Attributes
Attrs returns the list of attributes describing this engine. This may include a specification of disk type (e.g. hdd, ssd, fio, etc.) and potentially other labels to identify important attributes of the engine.
func (*RocksDB) Capacity ¶
func (r *RocksDB) Capacity() (roachpb.StoreCapacity, error)
Capacity queries the underlying file system for disk capacity information.
func (*RocksDB) ClearIterRange ¶
ClearIterRange removes a set of entries, from start (inclusive) to end (exclusive).
func (*RocksDB) ClearRange ¶
ClearRange removes a set of entries, from start (inclusive) to end (exclusive).
func (*RocksDB) Close ¶
func (r *RocksDB) Close()
Close closes the database by deallocating the underlying handle.
func (*RocksDB) Destroy ¶
Destroy destroys the underlying filesystem data associated with the database.
func (*RocksDB) GetAuxiliaryDir ¶ added in v1.1.0
GetAuxiliaryDir returns the auxiliary storage path for this engine.
func (*RocksDB) GetCompactionStats ¶ added in v1.1.0
GetCompactionStats returns the internal RocksDB compaction stats. See https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#rocksdb-statistics.
func (*RocksDB) GetProto ¶
func (r *RocksDB) GetProto( key MVCCKey, msg proto.Message, ) (ok bool, keyBytes, valBytes int64, err error)
GetProto fetches the value at the specified key and unmarshals it.
func (*RocksDB) GetSSTables ¶
func (r *RocksDB) GetSSTables() SSTableInfos
GetSSTables retrieves metadata about this engine's live sstables.
func (*RocksDB) GetStats ¶
GetStats retrieves stats from this engine's RocksDB instance and returns it in a new instance of Stats.
func (*RocksDB) IngestExternalFile ¶ added in v1.1.0
IngestExternalFile links a file into the RocksDB log-structured merge-tree.
func (*RocksDB) Iterate ¶
Iterate iterates from start to end keys, invoking f on each key/value pair. See engine.Iterate for details.
func (*RocksDB) Merge ¶
Merge implements the RocksDB merge operator using the function goMergeInit to initialize missing values and goMerge to merge the old and the given value into a new value, which is then stored under key. Currently 64-bit counter logic is implemented. See the documentation of goMerge and goMergeInit for details.
The key and value byte slices may be reused safely. merge takes a copy of them before returning.
func (*RocksDB) NewIterator ¶
NewIterator returns an iterator over this rocksdb engine.
func (*RocksDB) NewReadOnly ¶ added in v1.1.0
func (r *RocksDB) NewReadOnly() ReadWriter
NewReadOnly returns a new ReadWriter wrapping this rocksdb engine.
func (*RocksDB) NewSnapshot ¶
NewSnapshot creates a snapshot handle from engine and returns a read-only rocksDBSnapshot engine.
func (*RocksDB) NewTimeBoundIterator ¶ added in v1.1.0
NewTimeBoundIterator is like NewIterator, but returns a time-bound iterator.
func (*RocksDB) NewWriteOnlyBatch ¶
NewWriteOnlyBatch returns a new write-only batch wrapping this rocksdb engine.
type RocksDBBatchBuilder ¶
type RocksDBBatchBuilder struct {
// contains filtered or unexported fields
}
RocksDBBatchBuilder is used to construct the RocksDB batch representation. From the RocksDB code, the representation of a batch is:
WriteBatch::rep_ := sequence: fixed64 count: fixed32 data: record[count] record := kTypeValue varstring varstring kTypeDeletion varstring kTypeSingleDeletion varstring kTypeMerge varstring varstring kTypeColumnFamilyValue varint32 varstring varstring kTypeColumnFamilyDeletion varint32 varstring varstring kTypeColumnFamilySingleDeletion varint32 varstring varstring kTypeColumnFamilyMerge varint32 varstring varstring varstring := len: varint32 data: uint8[len]
The RocksDBBatchBuilder code currently only supports kTypeValue (BatchTypeValue), kTypeDeletion (BatchTypeDeletion)and kTypeMerge (BatchTypeMerge) operations. Before a batch is written to the RocksDB write-ahead-log, the sequence number is 0. The "fixed32" format is little endian.
The keys encoded into the batch are MVCC keys: a string key with a timestamp suffix. MVCC keys are encoded as:
<key>[<wall_time>[<logical>]]<#timestamp-bytes>
The <wall_time> and <logical> portions of the key are encoded as 64 and 32-bit big-endian integers. A custom RocksDB comparator is used to maintain the desired ordering as these keys do not sort lexicographically correctly. Note that the encoding of these keys needs to match up with the encoding in rocksdb/db.cc:EncodeKey().
func (*RocksDBBatchBuilder) ApplyRepr ¶ added in v1.1.0
func (b *RocksDBBatchBuilder) ApplyRepr(repr []byte) error
ApplyRepr applies the mutations in repr to the current batch.
func (*RocksDBBatchBuilder) Clear ¶
func (b *RocksDBBatchBuilder) Clear(key MVCCKey)
Clear removes the item from the db with the given key.
func (*RocksDBBatchBuilder) Finish ¶
func (b *RocksDBBatchBuilder) Finish() []byte
Finish returns the constructed batch representation. After calling Finish, the builder may be used to construct another batch, but the returned []byte is only valid until the next builder method is called.
func (*RocksDBBatchBuilder) Len ¶
func (b *RocksDBBatchBuilder) Len() int
Len returns the number of bytes currently in the under construction repr.
func (*RocksDBBatchBuilder) Merge ¶
func (b *RocksDBBatchBuilder) Merge(key MVCCKey, value []byte)
Merge is a high-performance write operation used for values which are accumulated over several writes. Multiple values can be merged sequentially into a single key; a subsequent read will return a "merged" value which is computed from the original merged values.
func (*RocksDBBatchBuilder) Put ¶
func (b *RocksDBBatchBuilder) Put(key MVCCKey, value []byte)
Put sets the given key to the value provided.
type RocksDBBatchReader ¶
type RocksDBBatchReader struct {
// contains filtered or unexported fields
}
RocksDBBatchReader is used to iterate the entries in a RocksDB batch representation.
Example: r, err := NewRocksDBBatchReader(...)
if err != nil { return err }
for r.Next() { switch r.BatchType() { case BatchTypeDeletion: fmt.Printf("delete(%x)", r.UnsafeKey()) case BatchTypeValue: fmt.Printf("put(%x,%x)", r.UnsafeKey(), r.UnsafeValue()) case BatchTypeMerge: fmt.Printf("merge(%x,%x)", r.UnsafeKey(), r.UnsafeValue()) } }
if err != nil { return nil }
func NewRocksDBBatchReader ¶
func NewRocksDBBatchReader(repr []byte) (*RocksDBBatchReader, error)
NewRocksDBBatchReader creates a RocksDBBatchReader from the given repr and verifies the header.
func (*RocksDBBatchReader) BatchType ¶
func (r *RocksDBBatchReader) BatchType() BatchType
BatchType returns the type of the current batch entry.
func (*RocksDBBatchReader) Count ¶
func (r *RocksDBBatchReader) Count() int
Count returns the declared number of entries in the batch.
func (*RocksDBBatchReader) Error ¶
func (r *RocksDBBatchReader) Error() error
Error returns the error, if any, which the iterator encountered.
func (*RocksDBBatchReader) Next ¶
func (r *RocksDBBatchReader) Next() bool
Next advances to the next entry in the batch, returning false when the batch is empty.
func (*RocksDBBatchReader) UnsafeKey ¶
func (r *RocksDBBatchReader) UnsafeKey() []byte
UnsafeKey returns the key of the current batch entry. The memory is invalidated on the next call to Next.
func (*RocksDBBatchReader) UnsafeValue ¶
func (r *RocksDBBatchReader) UnsafeValue() []byte
UnsafeValue returns the value of the current batch entry. The memory is invalidated on the next call to Next. UnsafeValue panics if the BatchType is BatchTypeDeleted.
type RocksDBCache ¶
type RocksDBCache struct {
// contains filtered or unexported fields
}
RocksDBCache is a wrapper around C.DBCache
func NewRocksDBCache ¶
func NewRocksDBCache(cacheSize int64) RocksDBCache
NewRocksDBCache creates a new cache of the specified size. Note that the cache is refcounted internally and starts out with a refcount of one (i.e. Release() should be called after having used the cache).
func (RocksDBCache) Release ¶
func (c RocksDBCache) Release()
Release releases the cache. Note that the cache will continue to be used until all of the RocksDB engines it was attached to have been closed, and that RocksDB engines which use it auto-release when they close.
type RocksDBConfig ¶ added in v1.1.0
type RocksDBConfig struct { Attrs roachpb.Attributes // Dir is the data directory for this store. Dir string // If true, creating the instance fails if the target directory does not hold // an initialized RocksDB instance. // // Makes no sense for in-memory instances. MustExist bool // MaxSizeBytes is used for calculating free space and making rebalancing // decisions. Zero indicates that there is no maximum size. MaxSizeBytes int64 // MaxOpenFiles controls the maximum number of file descriptors RocksDB // creates. If MaxOpenFiles is zero, this is set to DefaultMaxOpenFiles. MaxOpenFiles uint64 // WarnLargeBatchThreshold controls if a log message is printed when a // WriteBatch takes longer than WarnLargeBatchThreshold. If it is set to // zero, no log messages are ever printed. WarnLargeBatchThreshold time.Duration // Settings instance for cluster-wide knobs. Settings *cluster.Settings }
RocksDBConfig holds all configuration parameters and knobs used in setting up a new RocksDB instance.
type RocksDBError ¶ added in v1.1.3
type RocksDBError struct {
// contains filtered or unexported fields
}
A RocksDBError wraps an error returned from a RocksDB operation.
func (*RocksDBError) Error ¶ added in v1.1.3
func (err *RocksDBError) Error() string
Error implements the error interface.
func (*RocksDBError) SafeMessage ¶ added in v1.1.3
func (err *RocksDBError) SafeMessage() string
SafeMessage implements log.SafeMessager.
type RocksDBMap ¶ added in v1.1.0
type RocksDBMap struct {
// contains filtered or unexported fields
}
RocksDBMap is a SortedDiskMap that uses RocksDB as its underlying storage engine.
func NewRocksDBMap ¶ added in v1.1.0
func NewRocksDBMap(e Engine) *RocksDBMap
NewRocksDBMap creates a new RocksDBMap with the passed in Engine as the underlying store. The RocksDBMap instance will have a keyspace prefixed by a unique prefix.
func NewRocksDBMultiMap ¶ added in v1.1.0
func NewRocksDBMultiMap(e Engine) *RocksDBMap
NewRocksDBMultiMap creates a new RocksDBMap with the passed in Engine as the underlying store. The RocksDBMap instance will have a keyspace prefixed by a unique prefix. Unlike NewRocksDBMap, Puts with identical keys will write multiple entries (instead of overwriting previous entries) that will be returned during iteration.
func (*RocksDBMap) Close ¶ added in v1.1.0
func (r *RocksDBMap) Close(ctx context.Context)
Close implements the SortedDiskMap interface.
func (*RocksDBMap) Get ¶ added in v1.1.0
func (r *RocksDBMap) Get(k []byte) ([]byte, error)
Get implements the SortedDiskMap interface.
func (*RocksDBMap) NewBatchWriter ¶ added in v1.1.0
func (r *RocksDBMap) NewBatchWriter() SortedDiskMapBatchWriter
NewBatchWriter implements the SortedDiskMap interface.
func (*RocksDBMap) NewBatchWriterCapacity ¶ added in v1.1.0
func (r *RocksDBMap) NewBatchWriterCapacity(capacityBytes int) SortedDiskMapBatchWriter
NewBatchWriterCapacity implements the SortedDiskMap interface.
func (*RocksDBMap) NewIterator ¶ added in v1.1.0
func (r *RocksDBMap) NewIterator() SortedDiskMapIterator
NewIterator implements the SortedDiskMap interface.
type RocksDBMapBatchWriter ¶ added in v1.1.0
type RocksDBMapBatchWriter struct {
// contains filtered or unexported fields
}
RocksDBMapBatchWriter batches writes to a RocksDBMap.
func (*RocksDBMapBatchWriter) Close ¶ added in v1.1.0
func (b *RocksDBMapBatchWriter) Close(ctx context.Context) error
Close implements the SortedDiskMapBatchWriter interface.
func (*RocksDBMapBatchWriter) Flush ¶ added in v1.1.0
func (b *RocksDBMapBatchWriter) Flush() error
Flush implements the SortedDiskMapBatchWriter interface.
type RocksDBMapIterator ¶ added in v1.1.0
type RocksDBMapIterator struct {
// contains filtered or unexported fields
}
RocksDBMapIterator iterates over the keys of a RocksDBMap in sorted order.
func (*RocksDBMapIterator) Close ¶ added in v1.1.0
func (i *RocksDBMapIterator) Close()
Close implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) Key ¶ added in v1.1.0
func (i *RocksDBMapIterator) Key() []byte
Key implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) Next ¶ added in v1.1.0
func (i *RocksDBMapIterator) Next()
Next implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) Rewind ¶ added in v1.1.0
func (i *RocksDBMapIterator) Rewind()
Rewind implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) Seek ¶ added in v1.1.0
func (i *RocksDBMapIterator) Seek(k []byte)
Seek implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) UnsafeKey ¶ added in v1.1.0
func (i *RocksDBMapIterator) UnsafeKey() []byte
UnsafeKey implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) UnsafeValue ¶ added in v1.1.0
func (i *RocksDBMapIterator) UnsafeValue() []byte
UnsafeValue implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) Valid ¶ added in v1.1.0
func (i *RocksDBMapIterator) Valid() (bool, error)
Valid implements the SortedDiskMapIterator interface.
func (*RocksDBMapIterator) Value ¶ added in v1.1.0
func (i *RocksDBMapIterator) Value() []byte
Value implements the SortedDiskMapIterator interface.
type RocksDBSstFileReader ¶
type RocksDBSstFileReader struct {
// contains filtered or unexported fields
}
RocksDBSstFileReader allows iteration over a number of non-overlapping sstables exported by `RocksDBSstFileWriter`.
func MakeRocksDBSstFileReader ¶
func MakeRocksDBSstFileReader() RocksDBSstFileReader
MakeRocksDBSstFileReader creates a RocksDBSstFileReader backed by an in-memory RocksDB instance.
func (*RocksDBSstFileReader) Close ¶
func (fr *RocksDBSstFileReader) Close()
Close finishes the reader.
func (*RocksDBSstFileReader) IngestExternalFile ¶ added in v1.1.0
func (fr *RocksDBSstFileReader) IngestExternalFile(data []byte) error
IngestExternalFile links a file with the given contents into a database. See the RocksDB documentation on `IngestExternalFile` for the various restrictions on what can be added.
func (*RocksDBSstFileReader) Iterate ¶
func (fr *RocksDBSstFileReader) Iterate( start, end MVCCKey, f func(MVCCKeyValue) (bool, error), ) error
Iterate iterates over the keys between start inclusive and end exclusive, invoking f() on each key/value pair.
func (*RocksDBSstFileReader) NewIterator ¶
func (fr *RocksDBSstFileReader) NewIterator(prefix bool) Iterator
NewIterator returns an iterator over this sst reader.
type RocksDBSstFileWriter ¶
type RocksDBSstFileWriter struct { // DataSize tracks the total key and value bytes added so far. DataSize int64 // contains filtered or unexported fields }
RocksDBSstFileWriter creates a file suitable for importing with RocksDBSstFileReader.
func MakeRocksDBSstFileWriter ¶
func MakeRocksDBSstFileWriter() (RocksDBSstFileWriter, error)
MakeRocksDBSstFileWriter creates a new RocksDBSstFileWriter with the default configuration.
func (*RocksDBSstFileWriter) Add ¶
func (fw *RocksDBSstFileWriter) Add(kv MVCCKeyValue) error
Add puts a kv entry into the sstable being built. An error is returned if it is not greater than any previously added entry (according to the comparator configured during writer creation). `Close` cannot have been called.
func (*RocksDBSstFileWriter) Close ¶
func (fw *RocksDBSstFileWriter) Close()
Close finishes and frees memory and other resources. Close is idempotent.
func (*RocksDBSstFileWriter) Finish ¶ added in v1.1.0
func (fw *RocksDBSstFileWriter) Finish() ([]byte, error)
Finish finalizes the writer and returns the constructed file's contents. At least one kv entry must have been added.
type SSTableInfo ¶
SSTableInfo contains metadata about a single RocksDB sstable. This mirrors the C.DBSSTable struct contents.
type SSTableInfos ¶
type SSTableInfos []SSTableInfo
SSTableInfos is a slice of SSTableInfo structures.
func (SSTableInfos) Len ¶
func (s SSTableInfos) Len() int
func (SSTableInfos) Less ¶
func (s SSTableInfos) Less(i, j int) bool
func (SSTableInfos) ReadAmplification ¶
func (s SSTableInfos) ReadAmplification() int
ReadAmplification returns RocksDB's worst case read amplification, which is the number of level-0 sstables plus the number of levels, other than level 0, with at least one sstable.
This definition comes from here: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#level-style-compaction
func (SSTableInfos) String ¶
func (s SSTableInfos) String() string
func (SSTableInfos) Swap ¶
func (s SSTableInfos) Swap(i, j int)
type SimpleIterator ¶ added in v1.1.0
type SimpleIterator interface { // Close frees up resources held by the iterator. Close() // Seek advances the iterator to the first key in the engine which // is >= the provided key. Seek(key MVCCKey) // Valid must be called after any call to Seek(), Next(), Prev(), or // similar methods. It returns (true, nil) if the iterator points to // a valid key (it is undefined to call Key(), Value(), or similar // methods unless Valid() has returned (true, nil)). It returns // (false, nil) if the iterator has moved past the end of the valid // range, or (false, err) if an error has occurred. Valid() will // never return true with a non-nil error. Valid() (bool, error) // Next advances the iterator to the next key/value in the // iteration. After this call, Valid() will be true if the // iterator was not positioned at the last key. Next() // NextKey advances the iterator to the next MVCC key. This operation is // distinct from Next which advances to the next version of the current key // or the next key if the iterator is currently located at the last version // for a key. NextKey() // UnsafeKey returns the same value as Key, but the memory is invalidated on // the next call to {Next,Prev,Seek,SeekReverse,Close}. UnsafeKey() MVCCKey // UnsafeValue returns the same value as Value, but the memory is // invalidated on the next call to {Next,Prev,Seek,SeekReverse,Close}. UnsafeValue() []byte }
SimpleIterator is an interface for iterating over key/value pairs in an engine. SimpleIterator implementations are thread safe unless otherwise noted. SimpleIterator is a subset of the functionality offered by Iterator.
type SortedDiskMap ¶ added in v1.1.0
type SortedDiskMap interface { // Put writes the given key/value pair. Put(k []byte, v []byte) error // Get reads the value for the given key. Get(k []byte) ([]byte, error) // NewIterator returns a SortedDiskMapIterator that can be used to iterate // over key/value pairs in sorted order. NewIterator() SortedDiskMapIterator // NewBatchWriter returns a SortedDiskMapBatchWriter that can be used to // batch writes to this map for performance improvements. NewBatchWriter() SortedDiskMapBatchWriter // NewBatchWriterCapacity is identical to NewBatchWriter, but overrides the // SortedDiskMapBatchWriter's default capacity with capacityBytes. NewBatchWriterCapacity(capacityBytes int) SortedDiskMapBatchWriter // Close frees up resources held by the map. Close(context.Context) }
SortedDiskMap is an on-disk map. Keys are iterated over in sorted order.
type SortedDiskMapBatchWriter ¶ added in v1.1.0
type SortedDiskMapBatchWriter interface { // Put writes the given key/value pair to the batch. The write to the // underlying store happens on Flush(), Close(), or when the batch writer // reaches its capacity. Put(k []byte, v []byte) error // Flush flushes all writes to the underlying store. The batch can be reused // after a call to Flush(). Flush() error // Close flushes all writes to the underlying store and frees up resources // held by the batch writer. Close(context.Context) error }
SortedDiskMapBatchWriter batches writes to a SortedDiskMap.
type SortedDiskMapIterator ¶ added in v1.1.0
type SortedDiskMapIterator interface { // Seek sets the iterator's position to the first key greater than or equal // to the provided key. Seek(key []byte) // Rewind seeks to the start key. Rewind() // Valid must be called after any call to Seek(), Rewind(), or Next(). It // returns (true, nil) if the iterator points to a valid key and // (false, nil) if the iterator has moved past the end of the valid range. // If an error has occurred, the returned bool is invalid. Valid() (bool, error) // Next advances the iterator to the next key in the iteration. Next() // Key returns the current key. The resulting byte slice is still valid // after the next call to Seek(), Rewind(), or Next(). Key() []byte // Value returns the current value. The resulting byte slice is still valid // after the next call to Seek(), Rewind(), or Next(). Value() []byte // UnsafeKey returns the same value as Key, but the memory is invalidated on // the next call to {Next,Rewind,Seek,Close}. UnsafeKey() []byte // UnsafeValue returns the same value as Value, but the memory is // invalidated on the next call to {Next,Rewind,Seek,Close}. UnsafeValue() []byte // Close frees up resources held by the iterator. Close() }
SortedDiskMapIterator is a simple iterator used to iterate over keys and/or values. Example use of iterating over all keys:
var i SortedDiskMapIterator for i.Rewind(); ; i.Next() { if ok, err := i.Valid(); err != nil { // Handle error. } else if !ok { break } key := i.Key() // Do something. }
type Stats ¶
type Stats struct { BlockCacheHits int64 BlockCacheMisses int64 BlockCacheUsage int64 BlockCachePinnedUsage int64 BloomFilterPrefixChecked int64 BloomFilterPrefixUseful int64 MemtableHits int64 MemtableMisses int64 MemtableTotalSize int64 Flushes int64 Compactions int64 TableReadersMemEstimate int64 }
Stats is a set of RocksDB stats. These are all described in RocksDB
Currently, we collect stats from the following sources:
- RocksDB's internal "tickers" (i.e. counters). They're defined in rocksdb/statistics.h
- DBEventListener, which implements RocksDB's EventListener interface.
- rocksdb::DB::GetProperty().
This is a good resource describing RocksDB's memory-related stats: https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
type Version ¶
type Version struct {
Version storageVersion
}
Version stores all the version information for all stores and is used as the format for the version file.
type Writer ¶
type Writer interface { // ApplyBatchRepr atomically applies a set of batched updates. Created by // calling Repr() on a batch. Using this method is equivalent to constructing // and committing a batch whose Repr() equals repr. If sync is true, the // batch is synchronously written to disk. It is an error to specify // sync=true if the Writer is a Batch. ApplyBatchRepr(repr []byte, sync bool) error // Clear removes the item from the db with the given key. // Note that clear actually removes entries from the storage // engine, rather than inserting tombstones. Clear(key MVCCKey) error // ClearRange removes a set of entries, from start (inclusive) to end // (exclusive). Similar to Clear, this method actually removes entries from // the storage engine. ClearRange(start, end MVCCKey) error // ClearIterRange removes a set of entries, from start (inclusive) to end // (exclusive). Similar to Clear and ClearRange, this method actually removes // entries from the storage engine. Unlike ClearRange, the entries to remove // are determined by iterating over iter and per-key tombstones are // generated. ClearIterRange(iter Iterator, start, end MVCCKey) error // Merge is a high-performance write operation used for values which are // accumulated over several writes. Multiple values can be merged // sequentially into a single key; a subsequent read will return a "merged" // value which is computed from the original merged values. // // Merge currently provides specialized behavior for three data types: // integers, byte slices, and time series observations. Merged integers are // summed, acting as a high-performance accumulator. Byte slices are simply // concatenated in the order they are merged. Time series observations // (stored as byte slices with a special tag on the roachpb.Value) are // combined with specialized logic beyond that of simple byte slices. // // The logic for merges is written in db.cc in order to be compatible with RocksDB. Merge(key MVCCKey, value []byte) error // Put sets the given key to the value provided. Put(key MVCCKey, value []byte) error }
Writer is the write interface to an engine's data.