Documentation ¶
Overview ¶
Package timeseries implements a Key/Value store backed, RoaringBitmaps based storage for web analytics events. We use pebble as embedded key value store and rely heavily on its value merge feature, so it is not possible to support different underlying key value store.
Web analytics event comprises of the folllowing fundamental properties. NOTE: The structure is taken from earlier version of vince, we no longer use protocol buffers but the name and datatype stays the same except for bounce, which is now represented as an int8.
int64 timestamp = 1; int64 id = 2; optional bool bounce = 3; bool session = 4; bool view = 5; double duration = 6; string browser = 19; string browser_version = 20; string city = 26; string country = 23; string device = 18; string domain = 25; string entry_page = 9; string event = 7; string exit_page = 10; string host = 27; string os = 21; string os_version = 22; string page = 8; string referrer = 12; string region = 24; string source = 11; string utm_campaign = 15; string utm_content = 16; string utm_medium = 14; string utm_source = 13; string utm_term = 17; string tenant_id = 28;
This is the only data structure we need to store and query effiiciently. All string prpperties are used for search and aggregation.
Timeseries ¶
All queries going through this package are time based. Computation of time ranges and resolutions is handled by the internal/compute package.
We have six time resolutions that is used for search
- Minute
- Hour
- Day
- Week
- month
- Year
Time in unix millisecond truncated to above resolution is stored as part of keys in a way that when querying similiar timestamp will load similar blocks speeding up data retrieval. Details about timestamp encoding will be discusssed in the Keys section.
Keys ¶
A key is broken into the following components ¶
[ byte(prefix) ][ byte(resolution) ][ uint64(timestamp) ][ byte(field) ][ uint64(shard) ]
prefix: encodes a unique global prefix assigned for timeseries data. This value is subject to change, however it is the sole indicator that the key holds time series data.
resoulution: ensures we only process blocks relevant to queries. All queries must present their resolution eg, by minute, hour ,dat ..etc.
shard: We store in 1 Million events partitions. Each event gets assigned a unique ID that is auto incement of uint64 value. To get the assigned shard.
shard = id / ( shard_width ) # shard_width = 1 << 20
field: we assign unique number to each property.
Field_unknown Field = 0 Field_timestamp Field = 1 Field_id Field = 2 Field_bounce Field = 3 Field_duration Field = 4 Field_city Field = 5 Field_view Field = 6 Field_session Field = 7 Field_browser Field = 8 Field_browser_version Field = 9 Field_country Field = 10 Field_device Field = 11 Field_domain Field = 12 Field_entry_page Field = 13 Field_event Field = 14 Field_exit_page Field = 15 Field_host Field = 16 Field_os Field = 17 Field_os_version Field = 18 Field_page Field = 19 Field_referrer Field = 20 Field_source Field = 21 Field_utm_campaign Field = 22 Field_utm_content Field = 23 Field_utm_medium Field = 24 Field_utm_source Field = 25 Field_utm_term Field = 26 Field_subdivision1_code Field = 27 Field_subdivision2_code Field = 28
Values ¶
All values are stored as serialized roaring bitmaps.this ensures that we only decode once at pebble level, values are loaded directly without decoding.
We use different schemes depending on datatype. All string fields are stored in a mutex encoding and the rest are stored as bit sliced index.
Bitmap values contains both row / column values. Details on how row and column are combined to derive positions in the bitmap are documented in respective (*Bitmap)Mutex and (*Bitmap)BSI methods
When saving key/value pairs we use (*pebble.Batch)Merge. And a custom value merger that only performs (*Bitmap)Or that is inlined. With this design we ensures that batch flushes are very fast and very efficient.
Index ¶
- Constants
- type Cond
- type FilterSet
- type ScanCall
- type ScanConfig
- type Timeseries
- func (ts *Timeseries) Add(m *models.Model) error
- func (ts *Timeseries) Close() error
- func (ts *Timeseries) Find(ctx context.Context, field models.Field, id uint64) (value string)
- func (ts *Timeseries) Get() *shards.DB
- func (ts *Timeseries) Location() *location.Location
- func (ts *Timeseries) Save() error
- func (ts *Timeseries) Scan(res encoding.Resolution, start, end time.Time, filterSet FilterSet, ...) error
- func (ts *Timeseries) Search(field models.Field, prefix []byte, f func(key []byte, value uint64))
- func (ts *Timeseries) Select(ctx context.Context, values models.BitSet, domain string, start, end time.Time, ...) error
- func (ts *Timeseries) Translate(field models.Field, value []byte) uint64
- func (ts *Timeseries) Visitors(start, end time.Time, resolution encoding.Resolution, domain string) (visitors uint64)
Constants ¶
const ShardWidth = 1 << 20
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Cond ¶ added in v1.6.0
Cond defines exact matches rows and non exact match rows. Applies a union of all columns satsfying the conditions.
This assumes Cond is for a mutex field. vince only support filter on mutex fields.
type FilterSet ¶ added in v1.6.0
type FilterSet [models.SearchFieldSize]Cond
func (*FilterSet) ScanFields ¶ added in v1.6.0
ScanFields returns a set of all fields to scan for this filter.
type ScanConfig ¶ added in v1.6.0
type Timeseries ¶ added in v1.5.1
type Timeseries struct {
// contains filtered or unexported fields
}
func (*Timeseries) Add ¶ added in v1.5.1
func (ts *Timeseries) Add(m *models.Model) error
Add process m and batches it. It must be called in the same goroutine as (*Timeseries)Save
When we reach a shard boundary, existing batch will be saved before adding m. m []byte fields must not be modified because we use reference during translation A safe usage is to release m imediately after calling this method and reset it by calling
*m = models.Model{}
func (*Timeseries) Close ¶ added in v1.5.1
func (ts *Timeseries) Close() error
Close releases resources and removes buffers used.
func (*Timeseries) Get ¶ added in v1.5.1
func (ts *Timeseries) Get() *shards.DB
func (*Timeseries) Location ¶ added in v1.8.0
func (ts *Timeseries) Location() *location.Location
func (*Timeseries) Save ¶ added in v1.5.1
func (ts *Timeseries) Save() error
Save persist all buffered events into pebble key value store. This method is not safe for cocunrrent use. It is intended to be called in the same goroutine that calls (*Timeseries)Add.
The goal is to ensure almost lock free ingestion path ( with exception of translation with uses RWMutex)
func (*Timeseries) Scan ¶ added in v1.6.0
func (ts *Timeseries) Scan( res encoding.Resolution, start, end time.Time, filterSet FilterSet, valueSet models.BitSet, cb ScanCall, ) error
func (*Timeseries) Translate ¶ added in v1.5.1
func (ts *Timeseries) Translate(field models.Field, value []byte) uint64
func (*Timeseries) Visitors ¶ added in v1.7.0
func (ts *Timeseries) Visitors(start, end time.Time, resolution encoding.Resolution, domain string) (visitors uint64)
Realtime computes total visitors in the last 5 minutes.