timeseries

package
v1.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 27, 2024 License: AGPL-3.0 Imports: 28 Imported by: 0

Documentation

Overview

Package timeseries implements a Key/Value store backed, RoaringBitmaps based storage for web analytics events. We use pebble as embedded key value store and rely heavily on its value merge feature, so it is not possible to support different underlying key value store.

Web analytics event comprises of the folllowing fundamental properties. NOTE: The structure is taken from earlier version of vince, we no longer use protocol buffers but the name and datatype stays the same except for bounce, which is now represented as an int8.

int64 timestamp = 1;
int64 id = 2;
optional bool bounce = 3;
bool session = 4;
bool view = 5;
double duration = 6;
string browser = 19;
string browser_version = 20;
string city = 26;
string country = 23;
string device = 18;
string domain = 25;
string entry_page = 9;
string event = 7;
string exit_page = 10;
string host = 27;
string os = 21;
string os_version = 22;
string page = 8;
string referrer = 12;
string region = 24;
string source = 11;
string utm_campaign = 15;
string utm_content = 16;
string utm_medium = 14;
string utm_source = 13;
string utm_term = 17;
string tenant_id = 28;

This is the only data structure we need to store and query effiiciently. All string prpperties are used for search and aggregation.

Timeseries

All queries going through this package are time based. Computation of time ranges and resolutions is handled by the internal/compute package.

We have six time resolutions that is used for search

  • Minute
  • Hour
  • Day
  • Week
  • month
  • Year

Time in unix millisecond truncated to above resolution is stored as part of keys in a way that when querying similiar timestamp will load similar blocks speeding up data retrieval. Details about timestamp encoding will be discusssed in the Keys section.

Keys

A key is broken into the following components

[ byte(prefix) ][ uint64(shard) ][ uint64(timestamp) ][ byte(field) ]

prefix: encodes a unique global prefix assigned for timeseries data. This value is subject to change, however it is the sole indicator that the key holds time series data.

shard: We store in 1 Million events partitions. Each event gets assigned a unique ID that is auto incement of uint64 value. To get the assigned shard.

shard = id / ( shard_width ) # shard_width = 1 << 20

field: we assign unique number to each property.

Field_unknown           Field = 0
Field_timestamp         Field = 1
Field_id                Field = 2
Field_bounce            Field = 3
Field_duration          Field = 4
Field_city              Field = 5
Field_view              Field = 6
Field_session           Field = 7
Field_browser           Field = 8
Field_browser_version   Field = 9
Field_country           Field = 10
Field_device            Field = 11
Field_domain            Field = 12
Field_entry_page        Field = 13
Field_event             Field = 14
Field_exit_page         Field = 15
Field_host              Field = 16
Field_os                Field = 17
Field_os_version        Field = 18
Field_page              Field = 19
Field_referrer          Field = 20
Field_source            Field = 21
Field_utm_campaign      Field = 22
Field_utm_content       Field = 23
Field_utm_medium        Field = 24
Field_utm_source        Field = 25
Field_utm_term          Field = 26
Field_subdivision1_code Field = 27
Field_subdivision2_code Field = 28

shard and timestamp compoenents are encodded as binary.AppendUvarint. This scheme ensures efficient time range queries. We can effficiently iterate on co located data most of the times.

Values

All values are stored as serialized roaring bitmaps.this ensures that we only decode once at pebble level, values are loaded directly without decoding.

We use different schemes depending on datatype. All string fields are stored in a mutex encoding and the rest are stored as bit sliced index.

Bitmap values contains both rwo / column values. Details on how row and column are combined to derive positions in the bitmap are documented in respective (*Bitmap)Mutex and (*Bitmap)BSI methods

When saving key/value pairs we use (*pebble.Batch)Merge. And a custome value merger that only performs (*Bitmap)Or that is inlined. With this design we ensures that batch flushes are very fast and very efficient.

Index

Constants

View Source
const ShardWidth = 1 << 20

Variables

This section is empty.

Functions

This section is empty.

Types

type Cond added in v1.6.0

type Cond struct {
	Yes []uint64
	No  []uint64
}

func (*Cond) Apply added in v1.6.0

func (f *Cond) Apply(shard uint64, ra *roaring.Bitmap) *roaring.Bitmap

func (*Cond) IsEmpty added in v1.6.0

func (f *Cond) IsEmpty() bool

type FieldsData added in v1.6.0

type FieldsData [models.AllFields]*roaring.Bitmap

type FilterData added in v1.6.0

type FilterData [models.SearchFieldSize]*roaring.Bitmap

type FilterSet added in v1.6.0

type FilterSet [models.SearchFieldSize]Cond

func (*FilterSet) ScanFields added in v1.6.0

func (fs *FilterSet) ScanFields() *bitset.BitSet

ScanFields returns a set of all fields to scan for this filter.

func (*FilterSet) Set added in v1.6.0

func (fs *FilterSet) Set(yes bool, f models.Field, values ...uint64)

type ScanConfig added in v1.6.0

type ScanConfig struct {
	NotEmpty bool
	Shards   struct {
		Min, Max uint64
	}
	Views struct {
		Min, Max uint64
	}

	All, Data, Filter struct {
		Set      *bitset.BitSet
		Min, Max int
	}
}

type ScanData added in v1.6.0

type ScanData struct {
	Views   map[uint64]*FieldsData
	Columns *roaring.Bitmap
}

func (*ScanData) Set added in v1.6.0

func (s *ScanData) Set(view uint64, field int, ra *roaring.Bitmap)

type Timeseries added in v1.5.1

type Timeseries struct {
	// contains filtered or unexported fields
}

func New added in v1.5.1

func New(db *pebble.DB) *Timeseries

func (*Timeseries) Add added in v1.5.1

func (ts *Timeseries) Add(m *models.Model) error

func (*Timeseries) Close added in v1.5.1

func (ts *Timeseries) Close() error

func (*Timeseries) Find added in v1.5.1

func (ts *Timeseries) Find(ctx context.Context, field models.Field, id uint64) (value string)

func (*Timeseries) Get added in v1.5.1

func (ts *Timeseries) Get() *pebble.DB

func (*Timeseries) Realtime added in v1.6.0

func (ts *Timeseries) Realtime(domain string) (visitors uint64)

Realtime computes total visitors in the last 5 minutes. We make a few assumptions to ensure this call is very fast and efficient.

  • Only current shard is evaluated: A shard comprises about 1Million events. We assume that a site will have less than this unique visitors in a 5 minute span.

We call this periodically but continuously when a user is on website dashboard. Covering only one shard strickes a balance to ensure responsiveness in UI and observation of useful insight.

We can always adjust the number of shards we evaluate if we need to.

func (*Timeseries) Save added in v1.5.1

func (ts *Timeseries) Save() error

func (*Timeseries) Scan added in v1.6.0

func (ts *Timeseries) Scan(
	views []*roaring.Bitmap,
	filterSet FilterSet,
	valueSet *bitset.BitSet,
) (data []ScanData)

func (*Timeseries) ScanGlobal added in v1.6.0

func (ts *Timeseries) ScanGlobal(field models.Field, domain string, f func(shard uint64, columns, ra *roaring.Bitmap))

ScanGlobal reads bitmaps for the field that belongs to domain in a global key space Global key space has time resolution of 0.

Useful to compute aggregates across all shards like total visitors to a website since we received first event.

f is called with the shard and ra which should not be modified as its memory is borrowed and will be invalidated in the next call to f. If you want to own ra please call ra.Clone().

We store global bitmaps for all fields. Right now we only use this to display website's visitors on sites home dashboard.

func (*Timeseries) Search added in v1.5.1

func (ts *Timeseries) Search(field models.Field, prefix []byte, f func(key []byte, value uint64))

func (*Timeseries) Select added in v1.5.1

func (ts *Timeseries) Select(
	ctx context.Context,
	values *bitset.BitSet,
	domain string, start,
	end time.Time,
	intrerval query.Interval,
	filters query.Filters,
	cb func(shard, view uint64, columns *roaring.Bitmap, data FieldsData))

func (*Timeseries) Shards added in v1.5.1

func (ts *Timeseries) Shards(views iter.Seq[time.Time]) []*roaring.Bitmap

func (*Timeseries) Translate added in v1.5.1

func (ts *Timeseries) Translate(field models.Field, value []byte) uint64

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL