roaring

package
v1.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 20, 2024 License: AGPL-3.0, Apache-2.0 Imports: 15 Imported by: 0

README

sroar: Serialized Roaring Bitmaps

sroar is a re-written version of Roaring Bitmaps in Go, with the aim to have equality between in-memory representation and on-disk representation. An sroar.Bitmap does not need to be marshalled or unmarshalled, as the underlying represetation is a byte slice. Therefore, it can be written to disk, brought to memory, or shipped over the network immediately. This is needed in Dgraph, where we need to deal with lots of bitmaps.

sroar only implements array and bitmap containers. It does NOT implement run containers, which is an optimization that RoaringBitmaps has. Despite that, it outperforms RoaringBitmaps as shown in the Benchmarks section.

The code borrows concepts and code from RoaringBitmaps.

Benchmarks

The benchmarks were run:

  • Using real data set as described in RoaringBitmaps.
  • Only on the 64-bit version of roaring bitmaps (roaring64).
  • Only on FastOr, which is the more expensive operation than And or equivalent.
  • On AMD Ryzen Threadripper 2950X 16-Core Processor.
  • Using Go benchmarks serially.

Based on the benchmarks, sroar is:

  • 6.5x faster (-85% p50) for benchmarks >1ms, uses
  • 15x (-93.5% p50) less memory for allocations >1MB.
  • 25x fewer allocations.

The benchmark command and the results are:

$ go test -bench BenchmarkRealDataFastOr --run=XXX --count=5 --benchmem

name CPU                                    old time/op    new time/op    delta
RealDataFastOr/census1881-32                 302ms ± 2%       2ms ± 3%   -99.29%  (p=0.008 n=5+5)
RealDataFastOr/wikileaks-noquotes-32        76.5ms ± 1%     0.9ms ± 1%   -98.83%  (p=0.008 n=5+5)
RealDataFastOr/wikileaks-noquotes_srt-32    34.8ms ± 5%     1.0ms ± 2%   -97.07%  (p=0.008 n=5+5)
RealDataFastOr/dimension_033-32             55.0ms ± 3%     2.7ms ± 0%   -95.16%  (p=0.016 n=5+4)
RealDataFastOr/census1881_srt-32            36.8ms ± 3%     2.9ms ± 1%   -92.13%  (p=0.008 n=5+5)
RealDataFastOr/dimension_003-32             50.4ms ± 1%    11.6ms ± 4%   -77.06%  (p=0.008 n=5+5)
RealDataFastOr/dimension_008-32             10.0ms ± 2%     3.7ms ± 2%   -62.69%  (p=0.008 n=5+5)
RealDataFastOr/weather_sept_85_srt-32       6.13ms ± 3%    2.72ms ± 2%   -55.66%  (p=0.008 n=5+5)
RealDataFastOr/census-income-32             1.70ms ± 3%    1.05ms ± 1%   -38.53%  (p=0.008 n=5+5)
RealDataFastOr/weather_sept_85-32           2.28ms ± 2%    4.07ms ± 2%   +78.52%  (p=0.008 n=5+5)

RealDataFastOr/uscensus2000-32               556µs ± 2%     791µs ± 1%   +42.17%  (p=0.008 n=5+5)
RealDataFastOr/census-income_srt-32          260µs ± 4%     986µs ± 2%  +279.09%  (p=0.008 n=5+5)

name MEM_BYTES                             old alloc/op   new alloc/op   delta
RealDataFastOr/census1881-32                 585MB ± 0%       1MB ± 0%   -99.75%  (p=0.008 n=5+5)
RealDataFastOr/wikileaks-noquotes-32        76.3MB ± 0%     0.6MB ± 0%   -99.24%  (p=0.008 n=5+5)
RealDataFastOr/wikileaks-noquotes_srt-32    22.8MB ± 0%     0.6MB ± 0%   -97.46%  (p=0.008 n=5+5)
RealDataFastOr/census1881_srt-32            15.3MB ± 0%     1.4MB ± 0%   -90.58%  (p=0.008 n=5+5)
RealDataFastOr/dimension_003-32             7.78MB ± 0%    1.44MB ± 0%   -81.49%  (p=0.008 n=5+5)
RealDataFastOr/dimension_033-32             1.10MB ± 0%    1.44MB ± 0%   +30.92%  (p=0.008 n=5+5)

RealDataFastOr/dimension_008-32              537kB ± 0%      97kB ± 0%   -81.94%  (p=0.008 n=5+5)
RealDataFastOr/census-income-32              187kB ± 0%      70kB ± 0%   -62.86%  (p=0.008 n=5+5)
RealDataFastOr/census-income_srt-32         99.1kB ± 0%    69.6kB ± 0%   -29.81%  (p=0.008 n=5+5)
RealDataFastOr/weather_sept_85_srt-32        375kB ± 0%     292kB ± 0%   -21.95%  (p=0.008 n=5+5)
RealDataFastOr/uscensus2000-32               169kB ± 0%     231kB ± 0%   +36.97%  (p=0.008 n=5+5)
RealDataFastOr/weather_sept_85-32            169kB ± 0%     292kB ± 0%   +72.93%  (p=0.008 n=5+5)

name MEM_ALLOCS                           old allocs/op  new allocs/op  delta
RealDataFastOr/census1881_srt-32             29.7k ± 0%      0.0k ± 0%   -99.91%  (p=0.008 n=5+5)
RealDataFastOr/wikileaks-noquotes_srt-32     6.06k ± 0%     0.02k ± 0%   -99.74%  (p=0.008 n=5+5)
RealDataFastOr/dimension_003-32              4.57k ± 0%     0.03k ± 2%   -99.42%  (p=0.008 n=5+5)
RealDataFastOr/dimension_033-32              4.33k ± 0%     0.03k ± 0%   -99.38%  (p=0.000 n=5+4)
RealDataFastOr/uscensus2000-32               1.75k ± 0%     0.06k ± 0%   -96.85%  (p=0.008 n=5+5)
RealDataFastOr/dimension_008-32                704 ± 0%        23 ± 3%   -96.79%  (p=0.008 n=5+5)
RealDataFastOr/census-income-32                271 ± 0%         9 ± 0%   -96.68%  (p=0.008 n=5+5)
RealDataFastOr/weather_sept_85_srt-32          248 ± 0%        14 ± 0%   -94.35%  (p=0.008 n=5+5)
RealDataFastOr/weather_sept_85-32             81.0 ± 0%      14.0 ± 0%   -82.72%  (p=0.008 n=5+5)
RealDataFastOr/census-income_srt-32           40.0 ± 0%       9.0 ± 0%   -77.50%  (p=0.008 n=5+5)
RealDataFastOr/census1881-32                 54.5k ± 0%      0.0k ± 0%      ~     (p=0.079 n=4+5)
RealDataFastOr/wikileaks-noquotes-32         39.2k ± 0%      0.0k ± 0%      ~     (p=0.079 n=4+5)

Documentation

Index

Constants

View Source
const (
	// Min64BitSigned - Minimum 64 bit value
	Min64BitSigned = -9223372036854775808
	// Max64BitSigned - Maximum 64 bit value
	Max64BitSigned = 9223372036854775807
)

Variables

This section is empty.

Functions

func AndCardinality

func AndCardinality(a, b *Bitmap) (answer uint64)

func Memclr

func Memclr(b []uint16)

Types

type BSI

type BSI struct {
	MaxValue int64
	MinValue int64
	// contains filtered or unexported fields
}

BSI is at its simplest is an array of bitmaps that represent an encoded binary value. The advantage of a BSI is that comparisons can be made across ranges of values whereas a bitmap can only represent the existence of a single value for a given column ID. Another usage scenario involves storage of high cardinality values.

It depends upon the bitmap libraries. It is not thread safe, so upstream concurrency guards must be provided.

func NewBSI

func NewBSI(maxValue int64, minValue int64) *BSI

NewBSI constructs a new BSI. Note that it is your responsibility to ensure that the min/max values are set correctly. Queries CompareValue, MinMax, etc. will not work correctly if the min/max values are not set correctly.

func NewBSIFromBuffer

func NewBSIFromBuffer(data []byte) *BSI

func NewDefaultBSI

func NewDefaultBSI() *BSI

NewDefaultBSI constructs an auto-sized BSI

func (*BSI) Append added in v1.4.1

func (b *BSI) Append(offsets []uint32, data []byte) ([]uint32, []byte)

func (*BSI) BitCount

func (b *BSI) BitCount() int

BitCount returns the number of bits needed to represent values.

func (*BSI) CompareValue

func (b *BSI) CompareValue(parallelism int, op Operation, valueOrStart, end int64,
	foundSet *Bitmap) *Bitmap

CompareValue compares value. Values should be in the range of the BSI (max, min). If the value is outside the range, the result might erroneous. The operation parameter indicates the type of comparison to be made. For all operations with the exception of RANGE, the value to be compared is specified by valueOrStart. For the RANGE parameter the comparison criteria is >= valueOrStart and <= end. The parallelism parameter indicates the number of CPU threads to be applied for processing. A value of zero indicates that all available CPU resources will be potentially utilized.

func (*BSI) Each added in v1.4.1

func (b *BSI) Each(f func(idx byte, bs *Bitmap) error) error

func (*BSI) Extract

func (b *BSI) Extract(foundSet *Bitmap) map[uint64]int64

func (*BSI) GetCardinality

func (b *BSI) GetCardinality() uint64

GetCardinality returns a count of unique column IDs for which a value has been set.

func (*BSI) GetExistenceBitmap

func (b *BSI) GetExistenceBitmap() *Bitmap

GetExistenceBitmap returns a pointer to the underlying existence bitmap of the BSI

func (*BSI) GetSizeInBytes

func (b *BSI) GetSizeInBytes() int

GetSizeInBytes - the size in bytes of the data structure

func (*BSI) GetValue

func (b *BSI) GetValue(columnID uint64) (value int64, exists bool)

GetValue gets the value at the column ID. Second param will be false for non-existent values.

func (*BSI) MinMax

func (b *BSI) MinMax(parallelism int, op Operation, foundSet *Bitmap) int64

MinMax - Find minimum or maximum value.

func (*BSI) Or

func (a *BSI) Or(b *BSI) *BSI

We only perform Or on a and b. we don't want to modify a or b because there is a posibility a is read from buffer which may corrupt the backing slice..

func (*BSI) Reset

func (b *BSI) Reset()

func (*BSI) SetValue

func (b *BSI) SetValue(columnID uint64, value int64)

SetValue sets a value for a given columnID.

func (*BSI) String

func (b *BSI) String() string

func (*BSI) Sum

func (b *BSI) Sum(foundSet *Bitmap) (sum int64, count uint64)

Sum all values contained within the foundSet. As a convenience, the cardinality of the foundSet is also returned (for calculating the average).

func (*BSI) ToBuffer

func (b *BSI) ToBuffer() []byte

func (*BSI) ToBufferWith

func (b *BSI) ToBufferWith(offsets []uint32, data []byte) ([]uint32, []byte)

func (*BSI) ValueExists

func (b *BSI) ValueExists(columnID uint64) bool

ValueExists tests whether the value exists.

type Batch added in v1.5.1

type Batch [][]uint64

func (Batch) Run added in v1.5.1

func (b Batch) Run(ra *Bitmap, action BatchAction)

type BatchAction added in v1.5.1

type BatchAction func(idx int, next Next)

type Bitmap

type Bitmap struct {
	// contains filtered or unexported fields
}

func And

func And(a, b *Bitmap) *Bitmap

func FastAnd

func FastAnd(bitmaps ...*Bitmap) *Bitmap

func FastOr

func FastOr(bitmaps ...*Bitmap) *Bitmap

FastOr would merge given Bitmaps into one Bitmap. This is faster than doing an OR over the bitmaps iteratively.

func FastParOr

func FastParOr(numGo int, bitmaps ...*Bitmap) *Bitmap

FastParOr would group up bitmaps and call FastOr on them concurrently. It would then merge the groups into final Bitmap. This approach is simpler and faster than operating at a container level, because we can't operate on array containers belonging to the same Bitmap concurrently because array containers can expand, leaving no clear boundaries.

If FastParOr is called with numGo=1, it just calls FastOr.

Experiments with numGo=4 shows that FastParOr would be 2x the speed of FastOr, but 4x the memory usage, even under 50% CPU usage. So, use wisely.

func FromBuffer

func FromBuffer(data []byte) *Bitmap

FromBuffer returns a pointer to bitmap corresponding to the given buffer. This bitmap shouldn't be modified because it might corrupt the given buffer.

func FromBufferWithCopy

func FromBufferWithCopy(src []byte) *Bitmap

FromBufferWithCopy creates a copy of the given buffer and returns a bitmap based on the copied buffer. This bitmap is safe for both read and write operations.

func FromSortedList

func FromSortedList(vals []uint64) *Bitmap

func NewBitmap

func NewBitmap() *Bitmap

func NewBitmapWith

func NewBitmapWith(numKeys int) *Bitmap

func Or

func Or(a, b *Bitmap) *Bitmap

func (*Bitmap) And

func (ra *Bitmap) And(bm *Bitmap)

func (*Bitmap) AndCardinality

func (b *Bitmap) AndCardinality(a *Bitmap) (answer uint64)

func (*Bitmap) AndNot

func (ra *Bitmap) AndNot(bm *Bitmap)

func (*Bitmap) BSI added in v1.5.1

func (ra *Bitmap) BSI(id uint64, svalue int64)

func (*Bitmap) BSISum added in v1.5.1

func (ra *Bitmap) BSISum(shard uint64, match *Bitmap) (sum int64)

func (*Bitmap) Batch added in v1.5.1

func (ra *Bitmap) Batch() Batch

func (*Bitmap) Bool added in v1.5.1

func (ra *Bitmap) Bool(id uint64, value bool)

func (*Bitmap) Cleanup

func (ra *Bitmap) Cleanup()

func (*Bitmap) Clone

func (ra *Bitmap) Clone() *Bitmap

func (*Bitmap) Contains

func (ra *Bitmap) Contains(x uint64) bool

func (*Bitmap) Debug

func (ra *Bitmap) Debug(x uint64) string

func (*Bitmap) Each

func (ra *Bitmap) Each(f func(value uint64))

func (*Bitmap) ExtractBSI added in v1.5.1

func (ra *Bitmap) ExtractBSI(shard uint64, match *Bitmap, f func(id uint64, value int64))

func (*Bitmap) ExtractMutex added in v1.5.1

func (ra *Bitmap) ExtractMutex(match *Bitmap, f func(row uint64, columns *Bitmap) error) error

func (*Bitmap) False added in v1.5.1

func (ra *Bitmap) False(shard uint64, match *Bitmap) *Bitmap

func (*Bitmap) GetCardinality

func (ra *Bitmap) GetCardinality() int

func (*Bitmap) GetSizeInBytes

func (ra *Bitmap) GetSizeInBytes() int

func (*Bitmap) IsEmpty

func (ra *Bitmap) IsEmpty() bool

func (*Bitmap) ManyIterator

func (r *Bitmap) ManyIterator() *ManyItr

TODO: See if this is needed, we should remove this

func (*Bitmap) Maximum

func (ra *Bitmap) Maximum() uint64

func (*Bitmap) Minimum

func (ra *Bitmap) Minimum() uint64

func (*Bitmap) Mutex added in v1.5.1

func (ra *Bitmap) Mutex(id uint64, value uint64)

func (*Bitmap) NewIterator

func (bm *Bitmap) NewIterator() *Iterator

func (*Bitmap) NewRangeIterators

func (bm *Bitmap) NewRangeIterators(numRanges int) []*Iterator

func (*Bitmap) OffsetRange added in v1.5.1

func (ra *Bitmap) OffsetRange(offset, start, end uint64) *Bitmap

func (*Bitmap) Or

func (dst *Bitmap) Or(src *Bitmap)

TODO: Check if we want to use lazyMode

func (*Bitmap) Rank

func (ra *Bitmap) Rank(x uint64) int

func (*Bitmap) Remove

func (ra *Bitmap) Remove(x uint64) bool

func (*Bitmap) RemoveRange

func (ra *Bitmap) RemoveRange(lo, hi uint64)

Remove range removes [lo, hi) from the bitmap.

func (*Bitmap) Reset

func (ra *Bitmap) Reset()

func (*Bitmap) Row added in v1.5.1

func (ra *Bitmap) Row(shard, rowID uint64) *Bitmap

func (*Bitmap) Select

func (ra *Bitmap) Select(x uint64) (uint64, error)

Select returns the element at the xth index. (0-indexed)

func (*Bitmap) Set

func (ra *Bitmap) Set(x uint64) bool

func (*Bitmap) SetMany

func (ra *Bitmap) SetMany(vals []uint64)

TODO: Potentially this can be optimized.

func (*Bitmap) Split

func (bm *Bitmap) Split(externalSize func(start, end uint64) uint64, maxSz uint64) []*Bitmap

Split splits the bitmap based on maxSz and the externalSize function. It splits the bitmap such that size of each split bitmap + external size corresponding to its elements approximately equal to maxSz (it can be greater than maxSz sometimes). The splits are returned in sorted order. externalSize is a function that should return the external size corresponding to elements in range [start, end]. External size is used to calculate the split boundaries.

func (*Bitmap) String

func (ra *Bitmap) String() string

func (*Bitmap) ToArray

func (ra *Bitmap) ToArray() []uint64

func (*Bitmap) ToBuffer

func (ra *Bitmap) ToBuffer() []byte

func (*Bitmap) ToBufferWithCopy

func (ra *Bitmap) ToBufferWithCopy() []byte

func (*Bitmap) True added in v1.5.1

func (ra *Bitmap) True(shard uint64, match *Bitmap) *Bitmap

type Iterator

type Iterator struct {
	// contains filtered or unexported fields
}

func (*Iterator) Next

func (it *Iterator) Next() uint64

type ManyItr

type ManyItr struct {
	// contains filtered or unexported fields
}

func (*ManyItr) NextMany

func (itr *ManyItr) NextMany(buf []uint64) int

type Next added in v1.5.1

type Next func() (uint64, bool)

type Operation

type Operation int

Operation identifier

const (
	// LT less than
	LT Operation = 1 + iota
	// LE less than or equal
	LE
	// EQ equal
	EQ
	// GE greater than or equal
	GE
	// GT greater than
	GT
	// RANGE range
	RANGE
	// MIN find minimum
	MIN
	// MAX find maximum
	MAX
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL