billy

package module

v0.0.0-...-7527a25 Latest Latest Go to latest Published: Oct 12, 2024 License: BSD-3-Clause Imports: 10 Imported by: 81

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/holiman/billy

Links

Open Source Insights

README ¶

Billy

Billy (previously BagDB) is a super simple datastore. It can't quite be called a database, because it avoids implementing some of the most complex parts of an actual database. It's intended to be used in very particular circumstances.

billy

It is named after the bookcase, because it's very much like a bookcase, with N shelves of various heights.

Cheats

A 'proper' database is very complex, and has to solve several difficult problems:

Maintain an index. This serves several purposes:
- It decouples the key, which is the 'external' identifier, from the actual internal storage representation. This gives the database freedom to store the data 'wherever' it fits best. Having this decoupling means that when elements are deleted, other pieces of data can be moved to overwrite the freed element, and the database can be made more compact.
- The index allows for the external caller to 'forget' the existence of a piece of data, and later on, when it discovers that it needs the data represented by key, it can query the database for it, and the database can look it up from disk and hand it back to the process.

What if?

But what if we don't need to maintain an index? There are two obvious cavats here:

Without an index, we cannot move the data around after it's written. Compaction will not be possible. This can lead to fragmentation; where inserts/deletes fragments the buffer space, caused by differently-sized data-chunks, eventually deteriorating performance due to many small gaps spread out across the entire storage space.
Without an index, the external caller can no longer 'forget' about the data, and later query for a specific key. This means that the usability is limited to datasets which can be / will be backed by an in-memory reference map.

What are the upsides?

Without an index, we don't need to maintain a complex index-implementation, but can be very low on resource consumption. No extra allocated memory for index maintenance, no background threads for compaction work.

In practice

For the proposal by @karalabe about implementing a disk-backed transaction pool for geth, we have very special circumstances:

The disk-backed storage is indeed backed by an in-memory structure of metadata.
The payloads will roughly equally heavy on write, delete and read operations.
The data is somewhat transient, meaning that it's expected that the mean-storage time for a piece of data is measured in minutes rather than weeks.

Implementation

The bagdb uses has the following API:

Put(data []byte) uint64. This operation stores the given data, and returns a 'direct' reference to where the data is stored. By 'direct', it means that there is no indirection involved, the returned key is a direct reference to the shelf and slot where the data can later be found.
- The billy uses a set of shelves. Each shelf has a dynamic number of slots, where each slot within a shelf is a fixed size. This design is meant to alleviate the fragmentation problem: if a piece of data is 168 bytes, and our shelf sizes are 100, 200, 400, 800, 1600 .... , then billy will choose the shelf with 200 byte size. The 168 bytes of data will be placed into shelf 1, at the first free slot.
Delete(key uint64). The delete operation will simply look up the shelf, and tell the shelf that the identified slot now is free for re-use. It will be overwritten during a later Put operation.
Get(key uint64). Again, this is a very trivial operation: find the shelf, load the identified slot, and return.

shelves

Compaction

Saying that we can't do compaction is not strictly true: there are two things that can be (and are) done to minimize the disk usage.

Truncate-on-delete
- Truncate-on-delete is what it sounds like: when we delete items at the end of the file, we truncate the file. This has a slight performance hit: a normal delete never touches the disk, but only appends to the in-memory gap slice. In order to increase the chance for an opportunity to delete, the gaps are kept sorted, and we always prefer writing to lower gaps, leaving the higher gaps for later.
Compact-on-open
- Compact-on-open uses the fact that before the external calles is notified about the data content, we have the freedom to reorder the data, and uses this period overwrite any gaps and truncate the underlying file.

Data format

The identifer for accessing an item, a uint64 is composed as follows:

bit	range	usage
0-23	24 bits, `16M`	reserved for future use
23-35	12 bits, `4K`	`shelf id` - Shelf identifier
35-63	28 bits, `256M`	`slotkey` - slot identifier

The items themselves are stored with size as a 32-bit big-endian encoded integer, followed by the item itself. The 'slack-space' after size is not cleared, so might contain old data.

uint32: size | <data>

Documentation ¶

Index ¶

Variables
type Database
- func Open(opts Options, slotSizeFn SlotSizeFn, onData OnDataFn) (Database, error)
type Infos
type OnDataFn
type Options
type ShelfInfos
type SlotSizeFn
- func SlotSizeLinear(size, count int) SlotSizeFn
- func SlotSizePowerOfTwo(min, max uint32) SlotSizeFn

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrClosed      = errors.New("shelf closed")
	ErrOversized   = errors.New("data too large for shelf")
	ErrBadIndex    = errors.New("bad index")
	ErrEmptyData   = errors.New("empty data")
	ErrReadonly    = errors.New("read-only mode")
	ErrCorruptData = errors.New("corrupt data")
)

View Source

var (
	Magic           = [5]byte{'b', 'i', 'l', 'l', 'y'}
	ShelfHeaderSize = binary.Size(shelfHeader{})
)

Functions ¶

This section is empty.

Types ¶

type Database ¶

type Database interface {
	io.Closer

	// Put stores the data to the underlying database, and returns the key needed
	// for later accessing the data.
	// The data is copied by the database, and is safe to modify after the method returns
	Put(data []byte) (uint64, error)

	// Get retrieves the data stored at the given key.
	Get(key uint64) ([]byte, error)

	// Delete marks the data for deletion, which means it will (eventually) be
	// overwritten by other data. After calling Delete with a given key, the results
	// from doing Get(key) is undefined -- it may return the same data, or some other
	// data, or fail with an error.
	Delete(key uint64) error

	// Size returns the storage size of the value belonging to the given key.
	Size(key uint64) uint32

	// Limits returns the smallest and largest slot size.
	Limits() (uint32, uint32)

	// Infos retrieves various internal statistics about the database.
	Infos() *Infos

	// Iterate iterates through all the data in the database, and invokes the
	// given onData method for every element
	Iterate(onData OnDataFn) error
}

Database represents a `billy` storage.

func Open ¶

func Open(opts Options, slotSizeFn SlotSizeFn, onData OnDataFn) (Database, error)

Open opens a (new or existing) database, with configurable limits. The given slotSizeFn will be used to determine both the shelf sizes and the number of shelves. The function must yield values in increasing order.

If shelf already exists, they are opened and read, in order to populate the internal gap-list. While doing so, it's a good opportunity for the caller to read the data out, (which is probably desirable), which can be done using the optional onData callback.

type Infos ¶

type Infos struct {
	Shelves []*ShelfInfos
}

Infos contains a set of statistics about the underlying datastore.

type OnDataFn ¶

type OnDataFn func(key uint64, size uint32, data []byte)

OnDataFn is used to iterate the entire dataset in the database. After the method returns, the content of 'data' will be modified by the iterator, so it needs to be copied if it is to be used later.

type Options ¶

type Options struct {
	Path     string
	Readonly bool
	Repair   bool
	Snappy   bool // unused for now
}

type ShelfInfos ¶

type ShelfInfos struct {
	SlotSize    uint32
	FilledSlots uint64
	GappedSlots uint64
}

ShelfInfos contains some statistics about the data stored in a single shelf.

type SlotSizeFn ¶

type SlotSizeFn func() (size uint32, done bool)

SlotSizeFn is a method that acts as a "generator": a closure which, at each invocation, should spit out the next slot-size. In order to create a database with three shelves invocation of the method should return e.g.

10, false
20, false
30, true

OBS! The slot size must take item header size (4 bytes) into account. So if you plan to store 120 bytes, then the slot needs to be at least 124 bytes large.

func SlotSizeLinear ¶

func SlotSizeLinear(size, count int) SlotSizeFn

SlotSizeLinear is a SlotSizeFn which arranges the slots in shelves which increase linearly.

func SlotSizePowerOfTwo ¶

func SlotSizePowerOfTwo(min, max uint32) SlotSizeFn

SlotSizePowerOfTwo is a SlotSizeFn which arranges the slots in shelves which double in size for each level.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
billy
billyfuzz

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL