Documentation ¶
Overview ¶
Package timeseries implements a Key/Value store backed, RoaringBitmaps based storage for web analytics events. We use pebble as embedded key value store and rely heavily on its value merge feature, so it is not possible to support different underlying key value store.
Web analytics event comprises of the folllowing fundamental properties. NOTE: The structure is taken from earlier version of vince, we no longer use protocol buffers but the name and datatype stays the same except for bounce, which is now represented as an int8.
int64 timestamp = 1; int64 id = 2; optional bool bounce = 3; bool session = 4; bool view = 5; double duration = 6; string browser = 19; string browser_version = 20; string city = 26; string country = 23; string device = 18; string domain = 25; string entry_page = 9; string event = 7; string exit_page = 10; string host = 27; string os = 21; string os_version = 22; string page = 8; string referrer = 12; string region = 24; string source = 11; string utm_campaign = 15; string utm_content = 16; string utm_medium = 14; string utm_source = 13; string utm_term = 17; string tenant_id = 28;
This is the only data structure we need to store and query effiiciently. All string prpperties are used for search and aggregation.
Timeseries ¶
All queries going through this package are time based. Computation of time ranges and resolutions is handled by the internal/compute package.
We have six time resolutions that is used for search
- Minute
- Hour
- Day
- Week
- month
- Year
Time in unix millisecond truncated to above resolution is stored as part of keys in a way that when querying similiar timestamp will load similar blocks speeding up data retrieval. Details about timestamp encoding will be discusssed in the Keys section.
Keys ¶
A key is broken into the following components ¶
[ byte(prefix) ][ uint64(shard) ][ uint64(timestamp) ][ byte(field) ]
prefix: encodes a unique global prefix assigned for timeseries data. This value is subject to change, however it is the sole indicator that the key holds time series data.
shard: We store in 1 Million events partitions. Each event gets assigned a unique ID that is auto incement of uint64 value. To get the assigned shard.
shard = id / ( shard_width ) # shard_width = 1 << 20
field: we assign unique number to each property.
Field_unknown Field = 0 Field_timestamp Field = 1 Field_id Field = 2 Field_bounce Field = 3 Field_duration Field = 4 Field_city Field = 5 Field_view Field = 6 Field_session Field = 7 Field_browser Field = 8 Field_browser_version Field = 9 Field_country Field = 10 Field_device Field = 11 Field_domain Field = 12 Field_entry_page Field = 13 Field_event Field = 14 Field_exit_page Field = 15 Field_host Field = 16 Field_os Field = 17 Field_os_version Field = 18 Field_page Field = 19 Field_referrer Field = 20 Field_source Field = 21 Field_utm_campaign Field = 22 Field_utm_content Field = 23 Field_utm_medium Field = 24 Field_utm_source Field = 25 Field_utm_term Field = 26 Field_subdivision1_code Field = 27 Field_subdivision2_code Field = 28
shard and timestamp compoenents are encodded as binary.AppendUvarint. This scheme ensures efficient time range queries. We can effficiently iterate on co located data most of the times.
Values ¶
All values are stored as serialized roaring bitmaps.this ensures that we only decode once at pebble level, values are loaded directly without decoding.
We use different schemes depending on datatype. All string fields are stored in a mutex encoding and the rest are stored as bit sliced index.
Bitmap values contains both rwo / column values. Details on how row and column are combined to derive positions in the bitmap are documented in respective (*Bitmap)Mutex and (*Bitmap)BSI methods
When saving key/value pairs we use (*pebble.Batch)Merge. And a custome value merger that only performs (*Bitmap)Or that is inlined. With this design we ensures that batch flushes are very fast and very efficient.
Index ¶
- Constants
- type Cond
- type FieldsData
- type FilterData
- type FilterSet
- type ScanConfig
- type ScanData
- type Timeseries
- func (ts *Timeseries) Add(m *models.Model) error
- func (ts *Timeseries) Close() error
- func (ts *Timeseries) Find(ctx context.Context, field models.Field, id uint64) (value string)
- func (ts *Timeseries) Get() *pebble.DB
- func (ts *Timeseries) Realtime(domain string) (visitors uint64)
- func (ts *Timeseries) Save() error
- func (ts *Timeseries) Scan(views []*roaring.Bitmap, filterSet FilterSet, valueSet *bitset.BitSet) (data []ScanData)
- func (ts *Timeseries) ScanGlobal(field models.Field, domain string, ...)
- func (ts *Timeseries) Search(field models.Field, prefix []byte, f func(key []byte, value uint64))
- func (ts *Timeseries) Select(ctx context.Context, values *bitset.BitSet, domain string, ...)
- func (ts *Timeseries) Shards(views iter.Seq[time.Time]) []*roaring.Bitmap
- func (ts *Timeseries) Translate(field models.Field, value []byte) uint64
Constants ¶
const ShardWidth = 1 << 20
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type FilterData ¶ added in v1.6.0
type FilterData [models.SearchFieldSize]*roaring.Bitmap
type FilterSet ¶ added in v1.6.0
type FilterSet [models.SearchFieldSize]Cond
func (*FilterSet) ScanFields ¶ added in v1.6.0
ScanFields returns a set of all fields to scan for this filter.
type ScanConfig ¶ added in v1.6.0
type ScanData ¶ added in v1.6.0
type ScanData struct { Views map[uint64]*FieldsData Columns *roaring.Bitmap }
type Timeseries ¶ added in v1.5.1
type Timeseries struct {
// contains filtered or unexported fields
}
func New ¶ added in v1.5.1
func New(db *pebble.DB) *Timeseries
func (*Timeseries) Close ¶ added in v1.5.1
func (ts *Timeseries) Close() error
func (*Timeseries) Get ¶ added in v1.5.1
func (ts *Timeseries) Get() *pebble.DB
func (*Timeseries) Realtime ¶ added in v1.6.0
func (ts *Timeseries) Realtime(domain string) (visitors uint64)
Realtime computes total visitors in the last 5 minutes. We make a few assumptions to ensure this call is very fast and efficient.
- Only current shard is evaluated: A shard comprises about 1Million events. We assume that a site will have less than this unique visitors in a 5 minute span.
We call this periodically but continuously when a user is on website dashboard. Covering only one shard strickes a balance to ensure responsiveness in UI and observation of useful insight.
We can always adjust the number of shards we evaluate if we need to.
func (*Timeseries) Save ¶ added in v1.5.1
func (ts *Timeseries) Save() error
func (*Timeseries) ScanGlobal ¶ added in v1.6.0
func (ts *Timeseries) ScanGlobal(field models.Field, domain string, f func(shard uint64, columns, ra *roaring.Bitmap))
ScanGlobal reads bitmaps for the field that belongs to domain in a global key space Global key space has time resolution of 0.
Useful to compute aggregates across all shards like total visitors to a website since we received first event.
f is called with the shard and ra which should not be modified as its memory is borrowed and will be invalidated in the next call to f. If you want to own ra please call ra.Clone().
We store global bitmaps for all fields. Right now we only use this to display website's visitors on sites home dashboard.