roaring

package module
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2018 License: Apache-2.0, Apache-2.0 Imports: 18 Imported by: 0

README

roaring Build Status Coverage Status GoDoc Go Report Card

This is a go port of the Roaring bitmap data structure.

Roaring bitmaps are used by several major systems such as Apache Lucene and derivative systems such as Solr and Elasticsearch, Metamarkets' Druid, LinkedIn Pinot, Netflix Atlas, Apache Spark, OpenSearchServer, Cloud Torrent, Whoosh, Pilosa, Microsoft Visual Studio Team Services (VSTS), and eBay's Apache Kylin.

Roaring bitmaps are found to work well in many important applications:

Use Roaring for bitmap compression whenever possible. Do not use other bitmap compression methods (Wang et al., SIGMOD 2017)

The roaring Go library is used by

There are also Java and C/C++ versions. The Java, C, C++ and Go version are binary compatible: e.g, you can save bitmaps from a Java program and load them back in Go, and vice versa. We have a format specification.

This code is licensed under Apache License, Version 2.0 (ASL2.0).

Copyright 2016 by the authors.

References
  • Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai, Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience (to appear) arXiv:1709.07821
  • Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin, Better bitmap performance with Roaring bitmaps, Software: Practice and Experience 46 (5), 2016. http://arxiv.org/abs/1402.6407 This paper used data from http://lemire.me/data/realroaring2014.html
  • Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser, Consistently faster and smaller compressed bitmaps with Roaring, Software: Practice and Experience 46 (11), 2016. http://arxiv.org/abs/1603.06549
Dependencies

Dependencies are fetched automatically by giving the -t flag to go get.

they include

  • github.com/smartystreets/goconvey/convey
  • github.com/willf/bitset
  • github.com/mschoch/smat
  • github.com/glycerine/go-unsnap-stream
  • github.com/philhofer/fwd
  • github.com/jtolds/gls

Note that the smat library requires Go 1.6 or better.

Installation
  • go get -t github.com/RoaringBitmap/roaring
Example

Here is a simplified but complete example:

package main

import (
    "fmt"
    "github.com/RoaringBitmap/roaring"
    "bytes"
)


func main() {
    // example inspired by https://github.com/fzandona/goroar
    fmt.Println("==roaring==")
    rb1 := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
    fmt.Println(rb1.String())

    rb2 := roaring.BitmapOf(3, 4, 1000)
    fmt.Println(rb2.String())

    rb3 := roaring.NewBitmap()
    fmt.Println(rb3.String())

    fmt.Println("Cardinality: ", rb1.GetCardinality())

    fmt.Println("Contains 3? ", rb1.Contains(3))

    rb1.And(rb2)

    rb3.Add(1)
    rb3.Add(5)

    rb3.Or(rb1)

    // computes union of the three bitmaps in parallel using 4 workers  
    ParOr(4, rb1, rb2, rb3)
    // computes intersection of the three bitmaps in parallel using 4 workers  
    ParAnd(4, rb1, rb2, rb3)


    // prints 1, 3, 4, 5, 1000
    i := rb3.Iterator()
    for i.HasNext() {
        fmt.Println(i.Next())
    }
    fmt.Println()

    // next we include an example of serialization
    buf := new(bytes.Buffer)
    rb1.WriteTo(buf) // we omit error handling
    newrb:= roaring.NewBitmap()
    newrb.ReadFrom(buf)
    if rb1.Equals(newrb) {
    	fmt.Println("I wrote the content to a byte stream and read it back.")
    }
}

If you wish to use serialization and handle errors, you might want to consider the following sample of code:

	rb := BitmapOf(1, 2, 3, 4, 5, 100, 1000)
	buf := new(bytes.Buffer)
	size,err:=rb.WriteTo(buf)
	if err != nil {
		t.Errorf("Failed writing")
	}
	newrb:= NewBitmap()
	size,err=newrb.ReadFrom(buf)
	if err != nil {
		t.Errorf("Failed reading")
	}
	if ! rb.Equals(newrb) {
		t.Errorf("Cannot retrieve serialized version")
	}

Given N integers in [0,x), then the serialized size in bytes of a Roaring bitmap should never exceed this bound:

8 + 9 * ((long)x+65535)/65536 + 2 * N

That is, given a fixed overhead for the universe size (x), Roaring bitmaps never use more than 2 bytes per integer. You can call BoundSerializedSizeInBytes for a more precise estimate.

Documentation

Current documentation is available at http://godoc.org/github.com/RoaringBitmap/roaring

Goroutine safety

In general, it should not generally be considered safe to access the same bitmaps using different goroutines--they are left unsynchronized for performance. Should you want to access a Bitmap from more than one goroutine, you should provide synchronization. Typically this is done by using channels to pass the *Bitmap around (in Go style; so there is only ever one owner), or by using sync.Mutex to serialize operations on Bitmaps.

Coverage

We test our software. For a report on our test coverage, see

https://coveralls.io/github/RoaringBitmap/roaring?branch=master

Benchmark

Type

     go test -bench Benchmark -run -

To run benchmarks on Real Roaring Datasets run the following:

go get github.com/RoaringBitmap/real-roaring-datasets
BENCH_REAL_DATA=1 go test -bench BenchmarkRealData -run -
Iterative use

You can use roaring with gore:

  • go get -u github.com/motemen/gore
  • Make sure that $GOPATH/bin is in your $PATH.
  • go get github/RoaringBitmap/roaring
$ gore
gore version 0.2.6  :help for help
gore> :import github.com/RoaringBitmap/roaring
gore> x:=roaring.New()
gore> x.Add(1)
gore> x.String()
"{1}"
Fuzzy testing

You can help us test further the library with fuzzy testing:

     go get github.com/dvyukov/go-fuzz/go-fuzz
     go get github.com/dvyukov/go-fuzz/go-fuzz-build
     go test -tags=gofuzz -run=TestGenerateSmatCorpus
     go-fuzz-build github.com/RoaringBitmap/roaring
     go-fuzz -bin=./roaring-fuzz.zip -workdir=workdir/ -timeout=200

Let it run, and if the # of crashers is > 0, check out the reports in the workdir where you should be able to find the panic goroutine stack traces.

Alternative in Go

There is a Go version wrapping the C/C++ implementation https://github.com/RoaringBitmap/gocroaring

For an alternative implementation in Go, see https://github.com/fzandona/goroar The two versions were written independently.

Mailing list/discussion group

https://groups.google.com/forum/#!forum/roaring-bitmaps

Documentation

Overview

Package roaring is an implementation of Roaring Bitmaps in Go. They provide fast compressed bitmap data structures (also called bitset). They are ideally suited to represent sets of integers over relatively small ranges. See http://roaringbitmap.org for details.

Index

Constants

View Source
const MaxUint16 = 65535

MaxUint16 is the largest 16 bit unsigned int. This is the largest value an interval16 can store.

View Source
const MaxUint32 = 4294967295

MaxUint32 is the largest uint32 value.

Variables

This section is empty.

Functions

func BoundSerializedSizeInBytes added in v0.2.1

func BoundSerializedSizeInBytes(cardinality uint64, universeSize uint64) uint64

BoundSerializedSizeInBytes returns an upper bound on the serialized size in bytes assuming that one wants to store "cardinality" integers in [0, universe_size)

Types

type Bitmap added in v0.2.0

type Bitmap struct {
	// contains filtered or unexported fields
}

Bitmap represents a compressed bitmap where you can add integers.

func And

func And(x1, x2 *Bitmap) *Bitmap

And computes the intersection between two bitmaps and returns the result

func AndNot

func AndNot(x1, x2 *Bitmap) *Bitmap

AndNot computes the difference between two bitmaps and returns the result

func BitmapOf

func BitmapOf(dat ...uint32) *Bitmap

BitmapOf generates a new bitmap filled with the specified integers

func FastAnd

func FastAnd(bitmaps ...*Bitmap) *Bitmap

FastAnd computes the intersection between many bitmaps quickly Compared to the And function, it can take many bitmaps as input, thus saving the trouble of manually calling "And" many times.

func FastOr

func FastOr(bitmaps ...*Bitmap) *Bitmap

FastOr computes the union between many bitmaps quickly, as opposed to having to call Or repeatedly. It might also be faster than calling Or repeatedly.

func Flip

func Flip(bm *Bitmap, rangeStart, rangeEnd uint64) *Bitmap

Flip negates the bits in the given range (i.e., [rangeStart,rangeEnd)), any integer present in this range and in the bitmap is removed, and any integer present in the range and not in the bitmap is added, a new bitmap is returned leaving the current bitmap unchanged. The function uses 64-bit parameters even though a Bitmap stores 32-bit values because it is allowed and meaningful to use [0,uint64(0x100000000)) as a range while uint64(0x100000000) cannot be represented as a 32-bit value.

func FlipInt

func FlipInt(bm *Bitmap, rangeStart, rangeEnd int) *Bitmap

FlipInt calls Flip after casting the parameters (convenience method)

func HeapOr added in v0.1.1

func HeapOr(bitmaps ...*Bitmap) *Bitmap

HeapOr computes the union between many bitmaps quickly using a heap. It might be faster than calling Or repeatedly.

func HeapXor added in v0.1.1

func HeapXor(bitmaps ...*Bitmap) *Bitmap

HeapXor computes the symmetric difference between many bitmaps quickly (as opposed to calling Xor repeated). Internally, this function uses a heap. It might be faster than calling Xor repeatedly.

func New added in v0.2.8

func New() *Bitmap

New creates a new empty Bitmap (same as NewBitmap)

func NewBitmap added in v0.2.0

func NewBitmap() *Bitmap

NewBitmap creates a new empty Bitmap (see also New)

func Or

func Or(x1, x2 *Bitmap) *Bitmap

Or computes the union between two bitmaps and returns the result

func ParAnd added in v0.3.13

func ParAnd(parallelism int, bitmaps ...*Bitmap) *Bitmap

ParAnd computes the intersection (AND) of all provided bitmaps in parallel, where the parameter "parallelism" determines how many workers are to be used (if it is set to 0, a default number of workers is chosen)

func ParOr added in v0.3.13

func ParOr(parallelism int, bitmaps ...*Bitmap) *Bitmap

ParOr computes the union (OR) of all provided bitmaps in parallel, where the parameter "parallelism" determines how many workers are to be used (if it is set to 0, a default number of workers is chosen)

func Xor

func Xor(x1, x2 *Bitmap) *Bitmap

Xor computes the symmetric difference between two bitmaps and returns the result

func (*Bitmap) Add added in v0.2.0

func (rb *Bitmap) Add(x uint32)

Add the integer x to the bitmap

func (*Bitmap) AddInt added in v0.2.0

func (rb *Bitmap) AddInt(x int)

AddInt adds the integer x to the bitmap (convenience method: the parameter is casted to uint32 and we call Add)

func (*Bitmap) AddMany added in v0.2.8

func (rb *Bitmap) AddMany(dat []uint32)

AddMany add all of the values in dat

func (*Bitmap) AddRange added in v0.2.0

func (rb *Bitmap) AddRange(rangeStart, rangeEnd uint64)

AddRange adds the integers in [rangeStart, rangeEnd) to the bitmap. The function uses 64-bit parameters even though a Bitmap stores 32-bit values because it is allowed and meaningful to use [0,uint64(0x100000000)) as a range while uint64(0x100000000) cannot be represented as a 32-bit value.

func (*Bitmap) And added in v0.2.0

func (rb *Bitmap) And(x2 *Bitmap)

And computes the intersection between two bitmaps and stores the result in the current bitmap

func (*Bitmap) AndCardinality added in v0.2.0

func (rb *Bitmap) AndCardinality(x2 *Bitmap) uint64

AndCardinality returns the cardinality of the intersection between two bitmaps, bitmaps are not modified

func (*Bitmap) AndNot added in v0.2.0

func (rb *Bitmap) AndNot(x2 *Bitmap)

AndNot computes the difference between two bitmaps and stores the result in the current bitmap

func (*Bitmap) CheckedAdd added in v0.2.0

func (rb *Bitmap) CheckedAdd(x uint32) bool

CheckedAdd adds the integer x to the bitmap and return true if it was added (false if the integer was already present)

func (*Bitmap) CheckedRemove added in v0.2.0

func (rb *Bitmap) CheckedRemove(x uint32) bool

CheckedRemove removes the integer x from the bitmap and return true if the integer was effectively remove (and false if the integer was not present)

func (*Bitmap) Clear added in v0.2.0

func (rb *Bitmap) Clear()

Clear removes all content from the Bitmap and frees the memory

func (*Bitmap) Clone added in v0.2.0

func (rb *Bitmap) Clone() *Bitmap

Clone creates a copy of the Bitmap

func (*Bitmap) Contains added in v0.2.0

func (rb *Bitmap) Contains(x uint32) bool

Contains returns true if the integer is contained in the bitmap

func (*Bitmap) ContainsInt added in v0.2.0

func (rb *Bitmap) ContainsInt(x int) bool

ContainsInt returns true if the integer is contained in the bitmap (this is a convenience method, the parameter is casted to uint32 and Contains is called)

func (*Bitmap) Equals added in v0.2.0

func (rb *Bitmap) Equals(o interface{}) bool

Equals returns true if the two bitmaps contain the same integers

func (*Bitmap) Flip added in v0.2.0

func (rb *Bitmap) Flip(rangeStart, rangeEnd uint64)

Flip negates the bits in the given range (i.e., [rangeStart,rangeEnd)), any integer present in this range and in the bitmap is removed, and any integer present in the range and not in the bitmap is added. The function uses 64-bit parameters even though a Bitmap stores 32-bit values because it is allowed and meaningful to use [0,uint64(0x100000000)) as a range while uint64(0x100000000) cannot be represented as a 32-bit value.

func (*Bitmap) FlipInt added in v0.2.0

func (rb *Bitmap) FlipInt(rangeStart, rangeEnd int)

FlipInt calls Flip after casting the parameters (convenience method)

func (*Bitmap) FromBase64 added in v0.2.0

func (rb *Bitmap) FromBase64(str string) (int64, error)

FromBase64 deserializes a bitmap from Base64

func (*Bitmap) FromBuffer added in v0.3.15

func (rb *Bitmap) FromBuffer(buf []byte) (int64, error)

FromBuffer creates a bitmap from its serialized version stored in buffer

The format specification is available here: https://github.com/RoaringBitmap/RoaringFormatSpec

The provided byte array (buf) is expected to be a constant. The function makes the best effort attempt not to copy data. You should take care not to modify buff as it will likely result in unexpected program behavior.

Resulting bitmaps are effectively immutable in the following sense: a copy-on-write marker is used so that when you modify the resulting bitmap, copies of selected data (containers) are made. You should *not* change the copy-on-write status of the resulting bitmaps (SetCopyOnWrite).

func (*Bitmap) GetCardinality added in v0.2.0

func (rb *Bitmap) GetCardinality() uint64

GetCardinality returns the number of integers contained in the bitmap

func (*Bitmap) GetCopyOnWrite added in v0.2.4

func (rb *Bitmap) GetCopyOnWrite() (val bool)

GetCopyOnWrite gets this bitmap's copy-on-write property

func (*Bitmap) GetSerializedSizeInBytes added in v0.2.0

func (rb *Bitmap) GetSerializedSizeInBytes() uint64

GetSerializedSizeInBytes computes the serialized size in bytes of the Bitmap. It should correspond to the number of bytes written when invoking WriteTo. You can expect that this function is much cheaper computationally than WriteTo.

func (*Bitmap) GetSizeInBytes added in v0.2.0

func (rb *Bitmap) GetSizeInBytes() uint64

GetSizeInBytes estimates the memory usage of the Bitmap. Note that this might differ slightly from the amount of bytes required for persistent storage

func (*Bitmap) HasRunCompression added in v0.3.1

func (rb *Bitmap) HasRunCompression() bool

HasRunCompression returns true if the bitmap benefits from run compression

func (*Bitmap) Intersects added in v0.2.0

func (rb *Bitmap) Intersects(x2 *Bitmap) bool

Intersects checks whether two bitmap intersects, bitmaps are not modified

func (*Bitmap) IsEmpty added in v0.2.0

func (rb *Bitmap) IsEmpty() bool

IsEmpty returns true if the Bitmap is empty (it is faster than doing (GetCardinality() == 0))

func (*Bitmap) Iterator added in v0.2.0

func (rb *Bitmap) Iterator() IntIterable

Iterator creates a new IntIterable to iterate over the integers contained in the bitmap, in sorted order

func (*Bitmap) MarshalBinary added in v0.2.6

func (rb *Bitmap) MarshalBinary() ([]byte, error)

MarshalBinary implements the encoding.BinaryMarshaler interface for the bitmap

func (*Bitmap) Maximum added in v0.3.6

func (rb *Bitmap) Maximum() uint32

Maximum get the largest value stored in this roaring bitmap, assumes that it is not empty

func (*Bitmap) Minimum added in v0.3.6

func (rb *Bitmap) Minimum() uint32

Minimum get the smallest value stored in this roaring bitmap, assumes that it is not empty

func (*Bitmap) Or added in v0.2.0

func (rb *Bitmap) Or(x2 *Bitmap)

Or computes the union between two bitmaps and stores the result in the current bitmap

func (*Bitmap) OrCardinality added in v0.2.0

func (rb *Bitmap) OrCardinality(x2 *Bitmap) uint64

OrCardinality returns the cardinality of the union between two bitmaps, bitmaps are not modified

func (*Bitmap) Rank added in v0.2.0

func (rb *Bitmap) Rank(x uint32) uint64

Rank returns the number of integers that are smaller or equal to x (Rank(infinity) would be GetCardinality())

func (*Bitmap) ReadFrom added in v0.2.0

func (rb *Bitmap) ReadFrom(stream io.Reader) (int64, error)

ReadFrom reads a serialized version of this bitmap from stream. The format is compatible with other RoaringBitmap implementations (Java, C) and is documented here: https://github.com/RoaringBitmap/RoaringFormatSpec

func (*Bitmap) ReadFromMsgpack added in v0.3.0

func (rb *Bitmap) ReadFromMsgpack(stream io.Reader) (int64, error)

ReadFromMsgpack reads a msgpack2/snappy-streaming serialized version of this bitmap from stream. The format is expected is that written by the WriteToMsgpack() call; see additional notes there.

func (*Bitmap) Remove added in v0.2.0

func (rb *Bitmap) Remove(x uint32)

Remove the integer x from the bitmap

func (*Bitmap) RemoveRange added in v0.2.0

func (rb *Bitmap) RemoveRange(rangeStart, rangeEnd uint64)

RemoveRange removes the integers in [rangeStart, rangeEnd) from the bitmap. The function uses 64-bit parameters even though a Bitmap stores 32-bit values because it is allowed and meaningful to use [0,uint64(0x100000000)) as a range while uint64(0x100000000) cannot be represented as a 32-bit value.

func (*Bitmap) RunOptimize added in v0.3.0

func (rb *Bitmap) RunOptimize()

RunOptimize attempts to further compress the runs of consecutive values found in the bitmap

func (*Bitmap) Select added in v0.2.0

func (rb *Bitmap) Select(x uint32) (uint32, error)

Select returns the xth integer in the bitmap

func (*Bitmap) SetCopyOnWrite added in v0.2.4

func (rb *Bitmap) SetCopyOnWrite(val bool)

SetCopyOnWrite sets this bitmap to use copy-on-write so that copies are fast and memory conscious if the parameter is true, otherwise we leave the default where hard copies are made (copy-on-write requires extra care in a threaded context). Calling SetCopyOnWrite(true) on a bitmap created with FromBuffer is unsafe.

func (*Bitmap) Stats added in v0.3.0

func (rb *Bitmap) Stats() Statistics

Stats returns details on container type usage in a Statistics struct.

func (*Bitmap) String added in v0.2.0

func (rb *Bitmap) String() string

String creates a string representation of the Bitmap

func (*Bitmap) ToArray added in v0.2.0

func (rb *Bitmap) ToArray() []uint32

ToArray creates a new slice containing all of the integers stored in the Bitmap in sorted order

func (*Bitmap) ToBase64 added in v0.2.0

func (rb *Bitmap) ToBase64() (string, error)

ToBase64 serializes a bitmap as Base64

func (*Bitmap) ToBytes added in v0.3.3

func (rb *Bitmap) ToBytes() ([]byte, error)

ToBytes returns an array of bytes corresponding to what is written when calling WriteTo

func (*Bitmap) UnmarshalBinary added in v0.2.6

func (rb *Bitmap) UnmarshalBinary(data []byte) error

UnmarshalBinary implements the encoding.BinaryUnmarshaler interface for the bitmap

func (*Bitmap) WriteTo added in v0.2.0

func (rb *Bitmap) WriteTo(stream io.Writer) (int64, error)

WriteTo writes a serialized version of this bitmap to stream. The format is compatible with other RoaringBitmap implementations (Java, C) and is documented here: https://github.com/RoaringBitmap/RoaringFormatSpec

func (*Bitmap) WriteToMsgpack added in v0.3.0

func (rb *Bitmap) WriteToMsgpack(stream io.Writer) (int64, error)

WriteToMsgpack writes a msgpack2/snappy-streaming compressed serialized version of this bitmap to stream. The format is not compatible with the WriteTo() format, and is experimental: it may produce smaller on disk footprint and/or be faster to read, depending on your content. Currently only the Go roaring implementation supports this format.

func (*Bitmap) Xor added in v0.2.0

func (rb *Bitmap) Xor(x2 *Bitmap)

Xor computes the symmetric difference between two bitmaps and stores the result in the current bitmap

type IntIterable

type IntIterable interface {
	HasNext() bool
	Next() uint32
}

IntIterable allows you to iterate over the values in a Bitmap

type Statistics added in v0.3.0

type Statistics struct {
	Cardinality uint64
	Containers  uint64

	ArrayContainers      uint64
	ArrayContainerBytes  uint64
	ArrayContainerValues uint64

	BitmapContainers      uint64
	BitmapContainerBytes  uint64
	BitmapContainerValues uint64

	RunContainers      uint64
	RunContainerBytes  uint64
	RunContainerValues uint64
}

Statistics provides details on the container types in use.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL