timeseries

package
v1.7.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2024 License: AGPL-3.0 Imports: 29 Imported by: 0

Documentation

Overview

Package timeseries implements a Key/Value store backed, RoaringBitmaps based storage for web analytics events. We use pebble as embedded key value store and rely heavily on its value merge feature, so it is not possible to support different underlying key value store.

Web analytics event comprises of the folllowing fundamental properties. NOTE: The structure is taken from earlier version of vince, we no longer use protocol buffers but the name and datatype stays the same except for bounce, which is now represented as an int8.

int64 timestamp = 1;
int64 id = 2;
optional bool bounce = 3;
bool session = 4;
bool view = 5;
double duration = 6;
string browser = 19;
string browser_version = 20;
string city = 26;
string country = 23;
string device = 18;
string domain = 25;
string entry_page = 9;
string event = 7;
string exit_page = 10;
string host = 27;
string os = 21;
string os_version = 22;
string page = 8;
string referrer = 12;
string region = 24;
string source = 11;
string utm_campaign = 15;
string utm_content = 16;
string utm_medium = 14;
string utm_source = 13;
string utm_term = 17;
string tenant_id = 28;

This is the only data structure we need to store and query effiiciently. All string prpperties are used for search and aggregation.

Timeseries

All queries going through this package are time based. Computation of time ranges and resolutions is handled by the internal/compute package.

We have six time resolutions that is used for search

  • Minute
  • Hour
  • Day
  • Week
  • month
  • Year

Time in unix millisecond truncated to above resolution is stored as part of keys in a way that when querying similiar timestamp will load similar blocks speeding up data retrieval. Details about timestamp encoding will be discusssed in the Keys section.

Keys

A key is broken into the following components

[ byte(prefix) ][ byte(resolution) ][ uint64(timestamp) ][ byte(field) ][ uint64(shard) ]

prefix: encodes a unique global prefix assigned for timeseries data. This value is subject to change, however it is the sole indicator that the key holds time series data.

resoulution: ensures we only process blocks relevant to queries. All queries must present their resolution eg, by minute, hour ,dat ..etc.

shard: We store in 1 Million events partitions. Each event gets assigned a unique ID that is auto incement of uint64 value. To get the assigned shard.

shard = id / ( shard_width ) # shard_width = 1 << 20

field: we assign unique number to each property.

Field_unknown           Field = 0
Field_timestamp         Field = 1
Field_id                Field = 2
Field_bounce            Field = 3
Field_duration          Field = 4
Field_city              Field = 5
Field_view              Field = 6
Field_session           Field = 7
Field_browser           Field = 8
Field_browser_version   Field = 9
Field_country           Field = 10
Field_device            Field = 11
Field_domain            Field = 12
Field_entry_page        Field = 13
Field_event             Field = 14
Field_exit_page         Field = 15
Field_host              Field = 16
Field_os                Field = 17
Field_os_version        Field = 18
Field_page              Field = 19
Field_referrer          Field = 20
Field_source            Field = 21
Field_utm_campaign      Field = 22
Field_utm_content       Field = 23
Field_utm_medium        Field = 24
Field_utm_source        Field = 25
Field_utm_term          Field = 26
Field_subdivision1_code Field = 27
Field_subdivision2_code Field = 28

Values

All values are stored as serialized roaring bitmaps.this ensures that we only decode once at pebble level, values are loaded directly without decoding.

We use different schemes depending on datatype. All string fields are stored in a mutex encoding and the rest are stored as bit sliced index.

Bitmap values contains both row / column values. Details on how row and column are combined to derive positions in the bitmap are documented in respective (*Bitmap)Mutex and (*Bitmap)BSI methods

When saving key/value pairs we use (*pebble.Batch)Merge. And a custom value merger that only performs (*Bitmap)Or that is inlined. With this design we ensures that batch flushes are very fast and very efficient.

Index

Constants

View Source
const ShardWidth = 1 << 20

Variables

This section is empty.

Functions

This section is empty.

Types

type Cond added in v1.6.0

type Cond struct {
	Yes []uint64
	No  []uint64
}

func (*Cond) Apply added in v1.6.0

func (f *Cond) Apply(shard uint64, ra *roaring.Bitmap) *roaring.Bitmap

func (*Cond) IsEmpty added in v1.6.0

func (f *Cond) IsEmpty() bool

type FilterData added in v1.6.0

type FilterData [models.SearchFieldSize]*roaring.Bitmap

type FilterSet added in v1.6.0

type FilterSet [models.SearchFieldSize]Cond

func (*FilterSet) ScanFields added in v1.6.0

func (fs *FilterSet) ScanFields() (set models.BitSet)

ScanFields returns a set of all fields to scan for this filter.

func (*FilterSet) Set added in v1.6.0

func (fs *FilterSet) Set(yes bool, f models.Field, values ...uint64)

type ScanConfig added in v1.6.0

type ScanConfig struct {
	All, Data, Filter struct {
		Set      models.BitSet
		Min, Max models.Field
	}
}

type Timeseries added in v1.5.1

type Timeseries struct {
	// contains filtered or unexported fields
}

func New added in v1.5.1

func New(db *pebble.DB) *Timeseries

func (*Timeseries) Add added in v1.5.1

func (ts *Timeseries) Add(m *models.Model) error

Add process m and batches it. It must be called in the same goroutine as (*Timeseries)Save

When we reach a shard boundary, existing batch will be saved before adding m. m []byte fields must not be modified because we use reference during translation A safe usage is to release m imediately after calling this method and reset it by calling

*m = models.Model{}

func (*Timeseries) Close added in v1.5.1

func (ts *Timeseries) Close() error

Close releases resources and removes buffers used.

func (*Timeseries) Find added in v1.5.1

func (ts *Timeseries) Find(ctx context.Context, field models.Field, id uint64) (value string)

func (*Timeseries) Get added in v1.5.1

func (ts *Timeseries) Get() *pebble.DB

func (*Timeseries) Save added in v1.5.1

func (ts *Timeseries) Save() error

Save persist all buffered events into pebble key value store. This method is not safe for cocunrrent use. It is intended to be called in the same goroutine that calls (*Timeseries)Add.

The goal is to ensure almost lock free ingestion path ( with exception of translation with uses RWMutex)

func (*Timeseries) Scan added in v1.6.0

func (ts *Timeseries) Scan(
	res encoding.Resolution,
	start, end time.Time,
	filterSet FilterSet,
	valueSet models.BitSet,
	cb func(field models.Field, view, shard uint64, columns, ra *roaring.Bitmap) bool,
)

func (*Timeseries) Search added in v1.5.1

func (ts *Timeseries) Search(field models.Field, prefix []byte, f func(key []byte, value uint64))

func (*Timeseries) Select added in v1.5.1

func (ts *Timeseries) Select(
	ctx context.Context,
	values models.BitSet,
	domain string, start,
	end time.Time,
	intrerval query.Interval,
	filters query.Filters,
	cb func(field models.Field, view, shard uint64, columns, ra *roaring.Bitmap) bool)

func (*Timeseries) Translate added in v1.5.1

func (ts *Timeseries) Translate(field models.Field, value []byte) uint64

func (*Timeseries) Visitors added in v1.7.0

func (ts *Timeseries) Visitors(start, end time.Time, resolution encoding.Resolution, domain string) (visitors uint64)

Realtime computes total visitors in the last 5 minutes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL