sorted_array

package module
v0.0.0-...-e5f8025 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 1, 2023 License: MIT Imports: 12 Imported by: 0

README

Sorted Array Lib

This package allows to work with huge sorted arrays efficiently by splitting them in chunks under the hood. The basic idea here is simple - avoid loading all the numbers into memory and only work with smaller parts of the original array (chunking).

Motivation for this package is to be used for building inverted indexes. The main requirement for such use-case is being memory and space-efficient. I chose uint32 as the element type to support indexing of unix timestamps.

Usage Sample

storage := NewInMemoryChunkStorage()
maxChunkSize := 1000
arr := NewSortedArray(maxChunkSize, storage)

arr.Add([]uint32{10, 20, 30, 40, 50})
arr.Delete([]uint32{10, 30, 50})
arr.Flush() // write to the storage

require.EqualValues(t, []uint32{20, 40}, arr.ToSlice())

Storage

To store chunks one needs to implement this interface:

type ChunkStorage interface {
Read(chunkIds []uint32) (map[uint32]*Chunk, error)
Save(chunks map[uint32]*Chunk) error
Remove(chunkIds []uint32) error

ReadMeta() ([]*ChunkMeta, error)
SaveMeta([]*ChunkMeta) error
}

Two implementations are present in this package:

  • in-memory (used for testing purposes)
  • sqlite

Transactions are not assumed by this package. I kept in mind one particular use-case: sqlite as a storage. One would start an SQLite transaction and within this transaction load this index and work with numbers. SQLite would handle concurrent access. Within Sqlite I would use blobs to keep this array's chunks.

Compression

A sorted array is perfect for compression. It looks like the best algorithms are designed by Lemire: https://lemire.me/blog/2012/09/12/fast-integer-compression-decoding-billions-of-integers-per-second/

Some GO implementations are:

Concurrent Access And Garbage Collector

This lib loads chunks into memory. Chunks are determined based on meta description and input given to Add/Delete/GetInRange. Loaded chunks are freed (for further GC) upon these events:

  • Flush() called
  • GetInRage() loads and frees one chunk at a time while iterating

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Chunk

type Chunk struct {
	Items []uint32
}

Chunk represents an asc sorted array of numbers, grouped together for faster processing the type chosen to be uint32 to contain unix timestamps in seconds enough for general index purposes when there are not many events per second

func NewChunk

func NewChunk(items []uint32) *Chunk

func UnserializeChunk

func UnserializeChunk(data []byte) (*Chunk, error)

func (*Chunk) Add

func (c *Chunk) Add(items []uint32) (added int)

Add insert new values to the sorted array with just one allocation return the number of NEW elements added to the array

func (*Chunk) Contains

func (c *Chunk) Contains(item uint32) bool

func (*Chunk) GetInRange

func (c *Chunk) GetInRange(from, to uint32) []uint32

func (*Chunk) Remove

func (c *Chunk) Remove(itemsToRemove []uint32) (removed int)

func (*Chunk) Serialize

func (c *Chunk) Serialize() ([]byte, error)

type ChunkMeta

type ChunkMeta struct {
	// contains filtered or unexported fields
}

ChunkMeta is a light description of a chunk used to select relevant chunks before fetching them

type ChunkStorage

type ChunkStorage interface {
	// Read if err is nil then it always return a map of size = len(chunkIds)
	Read(chunkIds []uint32) (map[uint32]*Chunk, error)
	Save(chunks map[uint32]*Chunk) error
	Remove(chunkIds []uint32) error

	ReadMeta() (*Meta, error)
	SaveMeta(*Meta) error
}

ChunkStorage does simple CRUD operations on persistent storage Serialization(+compression) must be implemented at this level

type InMemoryChunkStorage

type InMemoryChunkStorage struct {
	// contains filtered or unexported fields
}

func NewInMemoryChunkStorage

func NewInMemoryChunkStorage() *InMemoryChunkStorage

func (*InMemoryChunkStorage) Read

func (s *InMemoryChunkStorage) Read(chunkIds []uint32) (map[uint32]*Chunk, error)

func (*InMemoryChunkStorage) ReadMeta

func (s *InMemoryChunkStorage) ReadMeta() (*Meta, error)

func (*InMemoryChunkStorage) Remove

func (s *InMemoryChunkStorage) Remove(chunkIds []uint32) error

func (*InMemoryChunkStorage) Save

func (s *InMemoryChunkStorage) Save(chunks map[uint32]*Chunk) error

func (*InMemoryChunkStorage) SaveMeta

func (s *InMemoryChunkStorage) SaveMeta(meta *Meta) error

type Meta

type Meta struct {
	// contains filtered or unexported fields
}

Meta contains a list of SORTED chunks descriptions No overlapping allowed

func NewMeta

func NewMeta() *Meta

func UnserializeMeta

func UnserializeMeta(data []byte) (*Meta, error)

func (*Meta) Add

func (m *Meta) Add(metas []*ChunkMeta)

func (*Meta) FindRelevantForInsert

func (m *Meta) FindRelevantForInsert(item uint32) []*ChunkMeta

FindRelevantForInsert returns possible chunks that can be used for insertion that includes ones that include the item via [min,max], or surround the item (chunk +item+ chunk)

func (*Meta) FindRelevantForRead

func (m *Meta) FindRelevantForRead(item uint32) *ChunkMeta

FindRelevantForRead return a link to a chunk description that CAN contains the item null means that no chunk CAN contain this item (used in Search)

func (*Meta) FindRelevantForReadRange

func (m *Meta) FindRelevantForReadRange(min, max uint32) []*ChunkMeta

FindRelevantForReadRange return does the same as FindRelevantForRead but for a range

func (*Meta) GetChunkById

func (m *Meta) GetChunkById(id uint32) *ChunkMeta

func (*Meta) Remove

func (m *Meta) Remove(meta *ChunkMeta)

func (*Meta) Serialize

func (m *Meta) Serialize() ([]byte, error)

func (*Meta) TakeNextId

func (m *Meta) TakeNextId() (id uint32)

TakeNextId starts from 0 and returns the NEXT available id

type SortedArray

type SortedArray struct {
	// contains filtered or unexported fields
}

SortedArray manages ASC sorted array in chunks for better performance Chunks contain up to maxInsertSize items and may not intersect with each other

func NewSortedArray

func NewSortedArray(maxChunkSize uint32, s ChunkStorage) *SortedArray

func (*SortedArray) Add

func (a *SortedArray) Add(items []uint32) error

Add Puts new items to the array

func (*SortedArray) Delete

func (a *SortedArray) Delete(items []uint32) error

func (*SortedArray) Flush

func (a *SortedArray) Flush() error

func (*SortedArray) GetInRange

GetInRange returns a stream of items (min,max are INCLUDED)

func (*SortedArray) ToSlice

func (a *SortedArray) ToSlice() []uint32

ToSlice dump all index to a single slice (for debugging/testing)

type SortedArraySqlTxStorage

type SortedArraySqlTxStorage struct {
	// contains filtered or unexported fields
}

SortedArraySqlTxStorage implemented sorted array storage for sqlite it uses blobs to store chunks and meta key is used to produce unique ids for the blobs in a shared table

func NewSqliteTxSortedArrayStorage

func NewSqliteTxSortedArrayStorage(tx *sql.Tx, key []byte) *SortedArraySqlTxStorage

func (*SortedArraySqlTxStorage) Read

func (s *SortedArraySqlTxStorage) Read(chunkIds []uint32) (map[uint32]*Chunk, error)

func (*SortedArraySqlTxStorage) ReadMeta

func (s *SortedArraySqlTxStorage) ReadMeta() (*Meta, error)

func (*SortedArraySqlTxStorage) Remove

func (s *SortedArraySqlTxStorage) Remove(chunkIds []uint32) error

func (*SortedArraySqlTxStorage) Save

func (s *SortedArraySqlTxStorage) Save(chunks map[uint32]*Chunk) error

func (*SortedArraySqlTxStorage) SaveMeta

func (s *SortedArraySqlTxStorage) SaveMeta(meta *Meta) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL