storage

package
v0.0.0-...-0bec3fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 26, 2024 License: BSD-3-Clause Imports: 19 Imported by: 0

Documentation

Overview

Package storage defines the storage abstractions needed for Oscar: DB, a basic key-value store, and VectorDB, a vector database. The storage needs are intentionally minimal (avoiding, for example, a requirement on SQL), to admit as many implementations as possible.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Fmt

func Fmt(data []byte) string

Fmt formats data for printing, first trying ordered.DecodeFmt in case data is an [ordered encoding], then trying a backquoted string if possible (handling simple JSON data), and finally resorting to strconv.QuoteToASCII.

func JSON

func JSON(x any) []byte

JSON converts x to JSON and returns the result. It panics if there is any error converting x to JSON. Since whether x can be converted to JSON depends almost entirely on its type, a marshaling error indicates a bug at the call site.

(The exception is certain malformed UTF-8 and floating-point infinity and NaN. Code must be careful not to use JSON with those.)

func Panic

func Panic(msg string, args ...any)

Panic panics with the text formatting of its arguments. It is meant to be called for database errors or corruption, which have been defined to be impossible. (See the DB documentation.)

Panic is expected to be used by DB implementations. DB clients should use the [DB.Panic] method instead.

func TestDB

func TestDB(t *testing.T, db DB)

TestDB runs basic tests on db. It should be empty when TestDB is called. To run tests on Lock and Unlock, also call TestDBLock.

func TestDBLock

func TestDBLock(t *testing.T, db locker)

TestDBLock verifies that Lock behaves correctly. It is separate from TestDB because it can't be used with a recorder/replayer, thanks to its sensitivity to time.

func TestVectorDB

func TestVectorDB(t *testing.T, opendb func() VectorDB)

TestVectorDB verifies that implementations of VectorDB conform to its specification. The opendb function should create a new connection to the same underlying storage.

Types

type Batch

type Batch interface {
	// Delete deletes any value associated with key.
	// Delete of an unset key is a no-op.
	//
	// Delete does not retain any reference to key after returning.
	Delete(key []byte)

	// DeleteRange deletes all key-value pairs with start ≤ key ≤ end.
	//
	// DeleteRange does not retain any reference to start or end after returning.
	DeleteRange(start, end []byte)

	// Set sets the value associated with key to val.
	//
	// Set does not retain any reference to key or val after returning.
	Set(key, val []byte)

	// MaybeApply calls Apply if the batch is getting close to full.
	// Every Batch has a limit to how many operations can be batched,
	// so in a bulk operation where atomicity of the entire batch is not a concern,
	// calling MaybeApply gives the Batch implementation
	// permission to flush the batch at specific “safe points”.
	// A typical limit for a batch is about 100MB worth of logged operations.
	// MaybeApply reports whether it called Apply.
	MaybeApply() bool

	// Apply applies all the batched operations to the underlying DB
	// as a single atomic unit.
	// When Apply returns, the Batch is an empty batch ready for
	// more operations.
	Apply()
}

A Batch accumulates database mutations that are applied to a DB as a single atomic operation. Applying bulk operations in a batch is also more efficient than making individual DB method calls. The batched operations apply in the order they are made. For example, Set("a", "b") followed by Delete("a") is the same as Delete("a"), while Delete("a") followed by Set("a", "b") is the same as Set("a", "b").

type DB

type DB interface {
	// Lock acquires a lock on the given name, which need not exist in the database.
	// After a successful Lock(name),
	// any other call to Lock(name) from any other client of the database
	// (including in another process, for shared databases)
	// must block until Unlock(name) has been called.
	// In a shared database, a lock may also unlock
	// when the client disconnects or times out.
	Lock(name string)

	// Unlock releases the lock with the given name,
	// which the caller must have locked.
	Unlock(name string)

	// Set sets the value associated with key to val.
	// The key must not be of length zero.
	Set(key, val []byte)

	// Get looks up the value associated with key.
	// If there is no entry for key in the database, Get returns nil, false.
	// Otherwise it returns val, true.
	Get(key []byte) (val []byte, ok bool)

	// Scan returns an iterator over all key-value pairs with start ≤ key ≤ end.
	// The second value in each iteration pair is a function returning the value,
	// not the value itself:
	//
	//	for key, getVal := range db.Scan([]byte("aaa"), []byte("zzz")) {
	//		val := getVal()
	//		fmt.Printf("%q: %q\n", key, val)
	//	}
	//
	// In iterations that only need the keys or only need the values for a subset of keys,
	// some DB implementations may avoid work when the value function is not called.
	//
	// A Scan may or may not observe concurrent modifications made
	// using Set, Delete, and DeleteRange.
	Scan(start, end []byte) iter.Seq2[[]byte, func() []byte]

	// Delete deletes any value associated with key.
	// Delete of an unset key is a no-op.
	Delete(key []byte)

	// DeleteRange deletes all key-value pairs with start ≤ key ≤ end.
	DeleteRange(start, end []byte)

	// Batch returns a new [Batch] that accumulates database mutations
	// to apply in an atomic operation. In addition to the atomicity, using a
	// Batch for bulk operations is more efficient than making each
	// change using repeated calls to DB's Set, Delete, and DeleteRange methods.
	Batch() Batch

	// Flush flushes DB changes to permanent storage.
	// Flush must be called before the process crashes or exits,
	// or else any changes since the previous Flush may be lost.
	Flush()

	// Close flushes and then closes the database.
	// Like the other routines, it panics if an error happens,
	// so there is no error result.
	Close()

	// Panic logs the error message and args using the database's slog.Logger
	// and then panics with the text formatting of its arguments.
	// It is meant to be called when database corruption or other
	// database-related “can't happen” conditions have been detected.
	Panic(msg string, args ...any)
}

A DB is a key-value database.

DB operations are assumed not to fail. They panic, intending to take down the program, if there is an error accessing the database. The assumption is that the program cannot possibly continue without the database, since that's where all the state is stored. Similarly, clients of DB conventionally panic using [DB.Panic] if the database returns corrupted data. Code using multiple parallel database operations can recover at the outermost calls.

func MemDB

func MemDB() DB

MemDB returns an in-memory DB implementation.

func NewOverlayDB

func NewOverlayDB(overlay, base DB) DB

NewOverlayDB returns a DB that combines overlay with base. Reads happen from the overlay first, then the base. All writes go to the overlay.

An overlay DB should only be used for testing. It can violate the specification of DB when a process is writing to the base concurrently. Locks held in the overlay will not be observed by the base, so changes from other processes can occur even while the process with the overlay has the lock.

The overlay DB assumes that all keys are encoded with rsc.io/ordered. The part of the key space beginning with ordered.Encode(overlayPrefix) in the overlay DB is reserved for use by the implementation.

type MemLocker

type MemLocker struct {
	// contains filtered or unexported fields
}

A MemLocker is a single-process implementation of the database Lock and Unlock methods, suitable if there is only one process accessing the database at a time.

The zero value for a MemLocker is a valid MemLocker with no locks held. It must not be copied after first use.

func (*MemLocker) Lock

func (l *MemLocker) Lock(name string)

Lock locks the mutex with the given name.

func (*MemLocker) Unlock

func (l *MemLocker) Unlock(name string)

Unlock unlocks the mutex with the given name.

type VectorBatch

type VectorBatch interface {
	// Set sets the vector associated with the given document ID to vec.
	Set(id string, vec llm.Vector)

	// Delete deletes any vector associated with document ID key.
	// Delete of an unset key is a no-op.
	Delete(id string)

	// MaybeApply calls Apply if the VectorBatch is getting close to full.
	// Every VectorBatch has a limit to how many operations can be batched,
	// so in a bulk operation where atomicity of the entire batch is not a concern,
	// calling MaybeApply gives the VectorBatch implementation
	// permission to flush the batch at specific “safe points”.
	// A typical limit for a batch is about 100MB worth of logged operations.
	//
	// MaybeApply reports whether it called Apply.
	MaybeApply() bool

	// Apply applies all the batched operations to the underlying VectorDB
	// as a single atomic unit.
	// When Apply returns, the VectorBatch is an empty batch ready for
	// more operations.
	Apply()
}

A VectorBatch accumulates vector database mutations that are applied to a VectorDB in a single atomic operation. Applying bulk operations in a batch is also more efficient than making individual VectorDB method calls. The batched operations apply in the order they are made.

type VectorDB

type VectorDB interface {
	// Set sets the vector associated with the given document ID to vec.
	// The id argument must not be empty.
	Set(id string, vec llm.Vector)

	// Delete deletes any vector associated with document ID key.
	// Delete of an unset key is a no-op.
	Delete(id string)

	// Get gets the vector associated with the given document ID.
	// If no such document exists, Get returns nil, false.
	// If a document exists, Get returns vec, true.
	Get(id string) (llm.Vector, bool)

	// All returns an iterator over all ID-vector pairs in the vector db.
	// The second value in each iteration pair is a function returning a
	// vector, not the vector itself:
	//
	//	for key, getVec := range vecdb.All() {
	//		vec := getVec()
	//		fmt.Printf("%q: %q\n", key, vec)
	//	}
	//
	// The pairs are ordered in lexicographic order of IDs.
	// In iterations that only need the keys or only need the vectors for a subset of keys,
	// some VectorDB implementations may avoid work when the value function is not called.
	All() iter.Seq2[string, func() llm.Vector]

	// Batch returns a new [VectorBatch] that accumulates
	// vector database mutations to apply in an atomic operation.
	// It is more efficient than repeated calls to Set.
	Batch() VectorBatch

	// Search searches the database for the n vectors
	// most similar to vec, returning the document IDs
	// and similarity scores.
	//
	// Normally a VectorDB is used entirely with vectors of a single length.
	// Search ignores stored vectors with a different length than vec.
	Search(vec llm.Vector, n int) []VectorResult

	// Flush flushes storage to disk.
	Flush()
}

A VectorDB is a vector database that implements nearest-neighbor search over embedding vectors corresponding to documents.

func MemVectorDB

func MemVectorDB(db DB, lg *slog.Logger, namespace string) VectorDB

MemVectorDB returns a VectorDB that stores its vectors in db but uses a cached, in-memory copy to implement Search using a brute-force scan.

The namespace is incorporated into the keys used in the underlying db, to allow multiple vector databases to be stored in a single DB.

When MemVectorDB is called, it reads all previously stored vectors from db; after that, changes must be made using the MemVectorDB Set method.

A MemVectorDB requires approximately 3kB of memory per stored vector.

The db keys used by a MemVectorDB have the form

ordered.Encode("llm.Vector", namespace, id)

where id is the document ID passed to Set.

type VectorResult

type VectorResult struct {
	ID    string  // document ID
	Score float64 // similarity score in range [0, 1]; 1 is exact match
}

A VectorResult is a single document returned by a VectorDB search.

Directories

Path Synopsis
Package timed implements modification-time-indexed storage that can be accessed in modification-time order as well as key order.
Package timed implements modification-time-indexed storage that can be accessed in modification-time order as well as key order.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL