quamina

package module
v0.0.0-...-a217623 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 22, 2024 License: Apache-2.0 Imports: 17 Imported by: 0

README

Quamina

ALERTWest Fork

This repo is a fork of the original Quamina repo, with the goal of improving pattern compilation performance. The main difference is in trie.go, which uses a trie to first assemble the structure, then converts that to a coreMatcher.

WithPatterns vs AddPattern

The standard Quamina library uses (Quamina).AddPattern to add patterns to the matcher. The handling only adds a single pattern at a time, and does so via creating a new matching structure and merging it with the existing one.

This fork adds a WithPatterns option to New, which allows for bulk pattern addition. This is done by first creating a map[X]string and passing that to New via WithPatterns. The New function will then create a coreMatcher directly, avoiding the need to merge multiple structures. This is very fast for a large number of patterns/matching values, on a low number of fields, but has lots of opportunities to be improved.

[!IMPORTANT] WithPatterns currently only supports exact string pattern types.

Fast pattern-matching library


Quamina implements a data type with APIs to create an instance and add multiple Patterns to it, and then query data objects called Events to discover which of the Patterns match the fields in the Event. In typical cases, Quamina can match millions of Events per second, even with many Patterns added to the instance.

Quamina has no run-time dependencies beyond built-in Go libraries.

Quamina welcomes contributions.

Status

This is version 1.5.1 of Quamina. In future, the API will be changed only additively.

Note that we have documented more APIs than are actually fully implemented, with the intent of showing direction.

Patterns

Consider the following JSON Event, taken from the example in RFC 8259:

{
  "Image": {
    "Width":  800,
    "Height": 600,
    "Title":  "View from 15th Floor",
    "Thumbnail": {
      "Url":    "http://www.example.com/image/481989943",
      "Height": 125,
      "Width":  100
    },
    "Animated" : false,
    "IDs": [116, 943, 234, 38793]
  }
}

The following Patterns would match it:

{"Image": {"Width": [800]}}
{
  "Image": {
    "Animated": [ false ],
    "Thumbnail": {
      "Height": [ 125 ]
    },
    "IDs": [ 943 ]
  }
}
{"Image": { "Title": [ { "exists": true } ] } }
{
  "Image":  {
    "Width": [800],
    "Title": [ { "exists": true } ],
    "Animated": [ false ]
  }
}
{"Image": { "Width": [800], "IDs": [ { "exists": true } ] } }
{"Foo": [ { "exists": false } ] }
{
  "Image": {
    "Thumbnail": { "Url": [ { "wildcard": "*9943" } ] }
  }
}
{
  "Image": {
    "Thumbnail": { "Url":
      [ { "wildcard": "http://www.example.com/*" } ] }
  }
}
{
  "Image": {
    "Thumbnail": { "Url":
      [ { "wildcard": "http://www.example.*/*9943" } ] }
  }
}
{
  "Image": {
    "Title": [ {"anything-but":  ["Pikachu", "Eevee"] } ]
  }
}
{
  "Image": {
    "Thumbnail": {
      "Url": [ "a", { "prefix": "https:" } ] }
  }
} 
{
  "Image": {
    "Title": [ { "equals-ignore-case": "VIEW FROM 15th FLOOR" } ] 
  }
}

The syntax and semantics of Patterns are fully specified in Patterns in Quamina.

The structure of a Pattern, in terms of field names and nesting, must be the same as the structure of the Event to be matched. The field values are always given as an array; if any element of the array matches the value in the Event, the match is good. If the field in the Event is array-valued, matching is true if the intersection of the arrays is non-empty.

Fields which are not mentioned in the Pattern will be assumed to match, but all fields mentioned must match. So the semantics are effectively an OR on each field's values, but an AND on the field names.

The "exists":true and "exists":false patterns have corner cases; details are covered in Patterns in Quamina.

Flattening and Matching

The first step in finding matches for an Event is “flattening” it, which is to say turning it into a list of pathname/value pairs called Fields. Quamina defines a Flattener interface type and has a built-in Flattener for JSON.

Note that should you wish to process Events in a format other than JSON, you can implement the Flattener interface yourself.

Data Errors

Note: Both Patterns and Events are required to be RFC 8259-conforming JSON. In particular, field names and values in both Patterns and Events must be valid UTF-8. Unescaped characters smaller than 0x1F (illegal per JSON), and bytes with value greater than 0XF4 (can't occur in correctly composed UTF-8) are rejected by the APIs.

In some cases, JSON errors in Events may not be caught by the Matching APIs. The Flattener works hard to avoid processing areas of the Event that cannot match any of the provided Patterns and, in skipping over them, may miss certain errors.

APIs

Control APIs
func New(opts ...Option) (*Quamina, error)

func WithMediaType(mediaType string) Option
func WithFlattener(f Flattener) Option
func WithPatternDeletion(b bool) Option
func WithPatternStorage(ps LivePatternsState) Option

For example:

q, err := quamina.New(quamina.WithMediaType("application/json"))

The meanings of the Option functions are:

WithMediaType: In the futue, Quamina will support Events not just in JSON but in other formats such as Avro, Protobufs, and so on. This option will make sure to invoke the correct Flattener. At the moment, the only supported value is application/json, the default.

WithFlattener: Requests that Quamina flatten Events with the provided (presumably user-written) Flattener.

WithPatternDeletion: If true, arranges that Quamina allows Patterns to be deleted from an instance. This is not free; it can incur extra costs in memory and occasional stop-the-world Quamina rebuilds. (We plan to improve this.)

WithPatternStorage: If you provide an argument that supports the LivePatternStorage API, Quamina will use it to maintain a list of which Patterns have currently been added but not deleted. This could be useful if you wanted to rebuild Quamina instances for sharded processing or after a system failure. Note: Not yet implemented.

Data APIs
func (q *Quamina) AddPattern(x X, patternJSON string) error

The first argument identifies the Pattern and will be returned by Quamina when asked to match against Events. X is defined as any.

The Pattern is provided in the second argument string which must be a JSON object as exemplified above in this document.

The error return is used to signal invalid Pattern structure, which could be bad UTF-8 or malformed JSON or leaf values which are not provided as arrays.

As many Patterns as desired can be added to a Quamina instance. More than one Pattern can be added with the same X identifier.

The AddPattern call is single-threaded; if multiple threads call it, they will block and execute sequentially.

func (q *Quamina) DeletePatterns(x X) error

After calling this API, no list of matches from AddPattern will include the X value specified in the argument.

The error return value is nil unless there was an internal failure of Quamina’s storage system.

func (q *Quamina) MatchesForEvent(event []byte) ([]X, error)

The error return value is nil unless there was an error in the encoding of the Event.

The []X return slice may be empty if none of the Patterns match the provided Event.

Concurrency

A single Quamina instance can not safely be used by multiple goroutines at the same time. However, the underlying data structure is designed for concurrent access and the Copy API is provided to support this.

func (q *Quamina) Copy() *Quamina

This generates a copy of the target instance. Such copies may safely run in parallel in different goroutines executing any combination of MatchesForEvent(), AddPattern(), and DeletePattern() calls. There is a significant performance penalty if a high proportion of these calls are AddPattern().

Note that the Copy() API is somewhat expensive, and that a Quamina instance exhibits “warm-up” behavior, i.e. the performance of MatchesForEvent() improves slightly upon repeated calls, especially over the first few calls. The conclusion is that, for maximum efficiency, once you’ve created a Quamina instance, whether through New() or Copy(), keep it around and run as many Events through it as is practical.

AddPattern() Performance

Tens of thousands of Patterns per second can be added to a Quamina instance; the in-memory data structure will become larger, but not unreasonably so. The amount of available memory is the only significant limit to the number of patterns an instance can carry.

MatchesForEvent() Performance

I used to say that the performance of MatchesForEvent was O(1) in the number of Patterns. That’s probably a reasonable way to think about it, because it’s almost right, except in the case where a very large number of wildcard patterns have been added; this is discussed in the next section.

To be correct, the performance is a little worse than O(N) where N is the average number of unique fields in an event that are used in one or more Patterns that have been added to the Quamina instance.

For example, suppose you have a list of 50,000 words, and you add a Pattern for each, of the form:

{"word": ["one of the words"]}

The performance in matching events should be about the same for one word or 50,000, with some marginal loss following on growth in the size of the necessary data structures.

However, adding another pattern that looks like the following, in the case that the Events have a top-level number field, would decrease the performance by a factor of roughly 2:

{"number": [11, 22, 33]}

After that, adding a few thousand more "number" patterns shouldn’t decrease the performance observably.

A word of explanation: Quamina compiles the Patterns into a somewhat-decorated automaton and uses that to find matches in Events. For Quamina to work, the incoming Events must be flattened into a list of pathname/value pairs and sorted. This process exceeds 50% of execution time, and is optimized by discarding any fields that do not appear in one or more of the Patterns added to Quamina. This code is optimized and in many cases can avoid processing every byte of the event. It’s complicated.

Once the list of fields which might match a Pattern is extracted, the cost of traversing the automaton is at most N, the number of fields left after discarding.

So, adding a new Pattern that only mentions fields which are already mentioned in previous Patterns is effectively free, i.e. O(1) in terms of run-time performance.

Quamina instances with large numbers of wildcard Patterns

A study of the theory of finite automata reveals that processing regular-expression constructs such as * increases the complexity of the automata necessary to match them. It develops that when a large number of such automata are compiled together, the merged output can contain a high degree of nondeterminism which can result in a drastic slowdown.

A fuzz test which adds a pattern for each of 12,959 5-letter words with one * embedded in each at a random offset slows matching speed down to below 10,000/second, in stark contrast to most Quamina instances, which can achieve millions of matches/second.

This slowdown is under active investigation and it is possible that the situation will improve.

Further documentation

There is a series of blog posts entitled Quamina Diary that provides a detailed discussion of the design decisions at a length unsuitable for in-code comments.

Name

From Wikipedia: Quamina Gladstone (1778 – 16 September 1823), most often referred to simply as Quamina, was a Guyanese slave from Africa and father of Jack Gladstone. He and his son were involved in the Demerara rebellion of 1823, one of the largest slave revolts in the British colonies before slavery was abolished.

Credits

@timbray: v0.0 and patches.

@jsmorph: Pruner and concurrency testing.

@embano1: CI/CD and project structure.

@yosiat: Flattening optimization.

@arnehormann: compact high-precision number representation.

Documentation

Overview

Package quamina instances support adding Patterns and then presenting Events, generating a report of which Patterns match the Event. Patterns and Events are both represented as JSON objects, although there is a provided Flattener interface by which structured objects in formats other than JSON can be matched by quamina. Quamina instances match Events quickly and with a latency that is not strongly affected by the number of Patterns which have been added.

Index

Examples

Constants

View Source
const MaxBytesInEncoding = 10
View Source
const SegmentSeparator = "\n"

Variables

This section is empty.

Functions

func MatcherFromPatterns

func MatcherFromPatterns(patterns map[X]string) (*coreMatcher, error)

MatcherFromPatterns is the main function for building a coreMatcher from a set of patterns. It essentially just builds a trie from the patterns and then converts it to a coreMatcher.

Types

type ArrayPos

type ArrayPos struct {
	Array int32
	Pos   int32
}

ArrayPos represents a Field's position in an Event's structure. Each array in the Event should get an integer which identifies it - in flattenJSON this is accomplished by keeping a counter and giving arrays numbers starting from 0. ArrayPos exists to ensure that Quamina MatchesForEvent will not return a match where two of the matching fields are in separate elements of the same array. Array uniquely identifies an array in an Event. Pos is the Field's index in the Array.

type Field

type Field struct {
	Path       []byte
	Val        []byte
	ArrayTrail []ArrayPos
	IsNumber   bool
}

Field represents a pathname/value combination, one of the data items which is matched against Patterns by the MatchesForEvent API. Path is the \n-separated path from the event root to this field value. Val is the value, a []byte forming a textual representation of the type ArrayTrail, for each array in the Path, identifies the array and the index in it.

type Flattener

type Flattener interface {
	Flatten(event []byte, tracker SegmentsTreeTracker) ([]Field, error)
	Copy() Flattener
}

nolint:goimports,gofmt Flattener is an interface which provides methods to turn a data structure into a list of path-names and values. The following example illustrates how it works for a JSON object: { "a": 1, "b": "two", "c": true", "d": nil, "e": { "e1": 2, "e2":, 3.02e-5} "f": [33, "x"]} } should produce

"a", "1"
"b", "\"two\"",
"c", "true"
"d", "nil",
"e\ne1", "2"
"e\ne2", "3.02e-5"
"f", "33"
"f", "\"x\""

Let's call the first column, eg "d" and "e\ne1", the path. For each step i the path, e.g. "d" and "e1", the Flattener should utilize SegmentsTreeTracker to traverse the hierarchy and select only the needed fields.

type LivePatternsState

type LivePatternsState interface {
	// Add adds a new pattern or updates an old pattern.
	//
	// Note that multiple patterns can be associated with the same X.
	Add(x X, pattern string) error

	// Delete removes all patterns associated with the given X and returns the
	// number of removed patterns.
	Delete(x X) (int, error)

	// Iterate calls the given function for every stored pattern.
	Iterate(func(x X, pattern string) error) error

	// Contains returns true if x is in the live set; false otherwise.
	Contains(x X) (bool, error)
}

LivePatternsState represents the required capabilities for maintaining the set of live patterns.

type Option

type Option func(q *Quamina) error

Option is an interface type used in Quamina's New API to pass in options. By convention, Option names have a prefix of "With".

func WithFlattener

func WithFlattener(f Flattener) Option

WithFlattener allows the specification of a caller-provided Flattener instance to use on incoming Events. This option call may not be provided more than once, nor can it be combined on the same invocation of quamina.New() with the WithMediaType() option.

func WithMediaType

func WithMediaType(mediaType string) Option

WithMediaType provides a media-type to support the selection of an appropriate Flattener. This option call may not be provided more than once, nor can it be combined on the same invocation of quamina.New() with the WithFlattener() option.

func WithPatternDeletion

func WithPatternDeletion(b bool) Option

WithPatternDeletion arranges, if the argument is true, that this Quamina instance will support the DeletePatterns() method. This option call may not be provided more than once.

func WithPatternStorage

func WithPatternStorage(ps LivePatternsState) Option

WithPatternStorage supplies the Quamina instance with a LivePatternState instance to be used to store the active patterns, i.e. those that have been added with AddPattern but not deleted with DeletePattern. This option call may not be provided more than once.

func WithPatterns

func WithPatterns(patterns map[X]string) Option

WithPatterns provides a map of patterns to be added to the Quamina instance upon creation. This option allows for bulk pattern addition during initialization.

type Quamina

type Quamina struct {
	// contains filtered or unexported fields
}

Quamina instances provide the public APIs of this pattern-matching library. A single Quamina instance is not thread-safe in that it cannot safely be used simultaneously in multiple goroutines. To re-use a Quamina instance concurrently in multiple goroutines, create copies using the Copy API.

func New

func New(opts ...Option) (*Quamina, error)

New returns a new Quamina instance. Consult the APIs beginning with "With" for the options that may be used to configure the new instance.

Example
package main

import (
	"fmt"
	"log"

	"github.com/DigitalPath-Inc/quamina"
)

const userRegisteredEvent = `{
  "id": "1c0e1ce4-3d88-4786-a09d-7133c170d02a",
  "type": "UserRegistered",
  "user": {
    "name": "Doe, John",
    "premiumAccount": true
  }
}
`

const premiumUserPattern = `{
  "type":["UserRegistered"],
  "user": {"premiumAccount": [true]}
}`

func main() {
	q, err := quamina.New()
	if err != nil {
		log.Fatalf("could not create quamina instance: %v", err)
	}

	const patternName = "premium user"
	err = q.AddPattern(patternName, premiumUserPattern)
	if err != nil {
		log.Fatalf("could not add pattern: %v", err)
	}

	matches, err := q.MatchesForEvent([]byte(userRegisteredEvent))
	if err != nil {
		log.Fatalf("could not match for event: %v", err)
	}

	for _, m := range matches {
		if m == patternName {
			fmt.Printf("pattern matched for event: %q", patternName)
			return
		}
	}

	// you would typically handle no matches cases here, but in this example no
	// match is a bug, hence panic :)
	panic("no pattern match")

}
Output:

pattern matched for event: "premium user"

func (*Quamina) AddPattern

func (q *Quamina) AddPattern(x X, patternJSON string) error

AddPattern adds a pattern, identified by the x argument, to a Quamina instance. patternJSON is a JSON object. error is returned in the case that the PatternJSON is invalid JSON or has a leaf which is not provided as an array. AddPattern is single-threaded; if it is invoked concurrently from multiple goroutines (in instances created using the Copy method) calls will block until any other AddPattern call in progress succeeds.

func (*Quamina) Copy

func (q *Quamina) Copy() *Quamina

Copy produces a new Quamina instance designed to be used safely in parallel with existing instances on different goroutines. Copy'ed instances share the same underlying data structures, so a pattern added to any instance with AddPattern will be visible in all of them.

func (*Quamina) DeletePatterns

func (q *Quamina) DeletePatterns(x X) error

DeletePatterns removes patterns identified by the x argument from the Quamina instance; the effect is that return values from future calls to MatchesForEvent will not include this x value.

func (*Quamina) MatchesForEvent

func (q *Quamina) MatchesForEvent(event []byte) ([]X, error)

MatchesForEvent returns a slice of X values which identify patterns that have previously been added to this Quamina instance and which "match" the event in the sense described in README. The matches slice may be empty if no patterns match. error can be returned in case that the event is not a valid JSON object or contains invalid UTF-8 byte sequences.

type SegmentsTreeTracker

type SegmentsTreeTracker interface {
	// Get returns another level of the hierarchy, referred as "Node"
	// If a node is returned we will need to traverse into (in JSON/CBOR/ProtoBuf/etc..)
	Get(segment []byte) (SegmentsTreeTracker, bool)

	// IsRoot - are we root node?
	// NOTE: need for early exit, can be solved differently maybe.
	IsRoot() bool

	// Called by the Flattener looking at a member name in a JSON object to ascertain
	// whether this particular member of the object is mentioned in any Patterns added
	// to the Quamina instance.
	IsSegmentUsed(segment []byte) bool

	// When a Flattener reaches the last (leaf) step of a path, this returns the full
	// path-name for that Field.  This is an optimization; since these need to be calculated
	// while executing `ddPattern, we might as wewll remember them for use during Flattening.
	PathForSegment(name []byte) []byte

	// Called by the Flattener to return the number of nodes (non-leaf children) and fields
	// (field values) contained in any node.  When processing through the node, once we've
	// hit the right number of nodes and fields we can terminate the Flattening process.
	NodesCount() int
	FieldsCount() int

	// String is used only for debugging.
	String() string
}

SegmentsTreeTracker is an interface used by Flattener to represents all the paths mentioned Patterns added to a Quamina instance in AddPattern() calls. It allows a Flattener to determine which Event fields may safely be ignored, and also caches the runtime form of the Field.Path value.

Consider this JSON example:

{ "a": {"b": 1, "c": 2}}

The tree will look like that:

[ root ]
   |
 [ "a" ] -> as node
   |-> with fields of: "b" and "c"

This allow us to traverse the hierarchial data together with the segments tree, fetch a node and answer:

  • Is the current segment is used? (JSON - is the current property needs to be selected)
  • Do we need to traverse into this Node as well? (JSON - do we need traverse this object?)
  • How much fields & nodes we have to traverse in the current hierarchy until we are finished? for example: in the current level, in the tree node we have 1 node and 2 fields we finishded selecting them, can we finish traversing this node?

type X

type X any

X is used in the AddPattern and MatchesForEvent APIs to identify the patterns that are added to a Quamina instance and are reported by that instance as matching an event. Commonly, X is a string used to name the pattern.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL