earley

package
v0.0.0-...-80afc61 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 13, 2023 License: Apache-2.0 Imports: 9 Imported by: 0

README

PromQL Code Completion

promq offers code-completion in the interactive console. In order to do this, we've chosen re-implement the promql parser.

This wasn't a trivial decision but we did make this decision quite deliberately since the original promql parser is written using yacc, which is an LALR parser. LALR parsers are bottom-up parsers, which means that the syntax tree isn't actually assembled until the very end of the parse tree, which is a rather undesirable property to have when you want to offer suggestions on what may be valid syntax in the middle of a parsed string.

Significant Considerations

There were a few significant aspects that factored into our choice of parser:

  • Our choice should enable us to maintain parity with promql with relative ease, since we are writing a separate parser.
  • Grammar rules should be definable top-down, since this makes it possible to anticipate possible valid syntactical structures during the parse.
  • Since promq is a command-line tool, we expressly do not need to optimize for full-blown IDE level code-completion; our implicit expectation is that we will be interacting with smaller statements for the most part and that we would be re-parsing our tree incrementally.

Earley Chart Parser

Given the factors at play, we converged around using an earley chart parser . The more obvious downsides to this decision were namely:

  • The run-time can be really poor (n^3), if the grammar is ambiguous.
  • Choosing this (or any other parser beside the one provided in prometheus /prometheus), means that we are bifurcating promql for this tool.

However, there are some counter considerations at play:

  • Despite the poor runtime:
    • even in the worst case we are dealing with input fed in by a command line, which means there is some practical upper bound for text length.
    • In the case of our parser, n is not directly determined by string length, since we feed our earley parser a stream of lexed tokens.
  • I'm not actually sure it is possible to avoid bifurcating promql for this kind of use-case, since promql uses an LALR parser. LR parsers, in general, support a restricted set of production rules which are not the most ideal for code completion (specifically, typically these sorts of rules will match many more things than are actually valid), since LR parsers don't actually construct the syntax tree until the very end of the parse.

Earley chart parsers have a couple nice properties which makes it a particularly nice candidate for this usecase:

  1. Earley parsers support all context free languages without restriction , which means that it can support both LR and LL production rules.
  2. Earley parsers naturally build and preserve state incrementally, which makes it trivial to convert into a more fully incremental parser (this is a desired optimization/property for our parser; i.e. batch-parsing isn't really on the table as an optimization).
  3. Earley parsers are easy to implement and can also enable a setup such that it is easy to maintain parity with promql in the main prometheus/prometheus codebase.
Earley Internals
Earley algorithm

There are some key terms in earley algorithm.

  1. Terminal and non-terminal symbols: In promql.go, we defined a lot of terminal and non-terminal symbols. They are the lexical elements used in specifying the grammar rules(production rules). Terminal symbols are the elementary symbols of the grammar. Non-terminal symbols can be replaced by groups of terminal symbols according to the grammar rules.
  2. Grammar rule: In promql.go, we also defined the grammar rules. The left of the rules can be replaced by the right. There is one root rule which is the start point for any input string.
  3. Input position: The parser will the parse the input string from left to right and step by step. It will start from the position prior to the input string which is the position 0. The last position is the position after the last token.
  4. State set: For every input position, the parser generates a state set. The state set at input position k is called S(k). A state set is composed by a list of state. EarleyItem in item.go represent a single state. Each state is a tuple (X -> a • b, i):
    • The dot represents the input position. Here the dot is at position 1.
    • The state applies rule X -> ab
    • i is the origin position where the the matching of the production began.
  5. Chart: Chart records the state set of each position. The length of earley chart is len(input_string) + 1

Earley parser starts from position 0 with root grammar rule and then repeatedly executes three operations: prediction, scanning and completion:

  • Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ).
  • Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X → α a • β, j) to S(k+1).
  • Completion: For every state in S(k) of the form (Y → γ •, j), find all states in S(j) of the form (X → α • Y β, i) and add (X → α Y • β, i) to S(k).

Only new items will be added to state set. Two states with same rule, dot position and origin position will be viewed as duplicate states.

We use the last state(the dot is at the end of input string) to generate completion suggestion. In each unfinished state(there are tokens after dot) of the state set, if the symbol after dot is a terminal, then it is a potential suggestion. The suggestion generated from earley algorithm is a suggested token type. There is a mapping in completer.go which matches type to real suggested string.

Incremental parsing

Earley parser can have poor runtime sometimes. To optimize, we will not parse input string every time from beginning. For example, if the previous input is sum(metric_name_one{ and the current input is sum(metric_name_one), parser will keep the previous states that cover sum(metric_name_one and start parsing from the next one. Even if the previous input and new input are different from the first token, the first state set can be reused because the first input position is always prior to the input string.

This incremental parsing is really practical because user keeps inputting word by word and the later state set can always build on top of previous state sets.

Considered Alternatives

We considered a number of options for a parser:

Hand-rolled Parser Heuristic

Our initial prototype hand-rolled the parser completely, using a novel (albeit somewhat hacky) heuristic for determining grammar rules. The high level idea was that we would lex the token stream and then move backward from the cursor position to infer grammatical context and use that as a basis for suggesting auto-completions.

For instance, given a string:

sum(metric{label=

We would read the "=" sign first then the token "label", followed by the "{" and then "metric". Our heuristic would allow us to match grammar rules by the first "=" sign, then filter this list by rules which permissibly accept tokens which match the form 'label', then permissibly accept tokens which accept a left brace, etc.

This heuristic was nice in that it ran in linear time, but abstracted poorly to higher level grammar rules (i.e. of the form 'A = A + A | A + B').

Using an LL(*) parser generator (ANTLR)

There are a couple downsides to using ANTLR:

  • It uses java, which would mean we would need two languages (i.e. golang and java) in order to generate our support for a third language (promql). I personally find this aesthetically unsettling.
  • While it supports LL-style production rules don't handle ambiguity well, which generally means it needs to choose a single parse path (you can handle this however during traversal).

Documentation

Index

Constants

View Source
const Cursor = "\u25EC"
View Source
const (
	PromQLTokenSeparators = " []{}()+-*/%^=!><~,"
)

Variables

View Source
var (
	// non-terminals
	Root = NewNonTerminal("root", true)

	Expression         = NewNonTerminal("expression", false)
	AggrExpression     = NewNonTerminal("aggr-expression", false)
	SubqueryExpression = NewNonTerminal("subquery-expression", false)
	UnaryExpression    = NewNonTerminal("unary-expression", false)
	// bianry expressions
	ScalarBinaryExpression = NewNonTerminal("scalar-binary-expression", false)
	VectorBinaryExpression = NewNonTerminal("vector-binary-expression", false)
	// function expression
	VectorFuncExpression = NewNonTerminal("vector-function-expression", false)
	ScalarFuncExpression = NewNonTerminal("scalar-function-expression", false)
	// metrics expression
	MatrixSelector = NewNonTerminal("matrix-selector", false)
	VectorSelector = NewNonTerminal("vector-selector", false)

	// Expression types
	ScalarTypeExpression = NewNonTerminal("scalar-type-expression", false)
	VectorTypeExpression = NewNonTerminal("vector-type-expression", false)
	MatrixTypeExpression = NewNonTerminal("matrix-type-expression", false)

	LabelsExpression      = NewNonTerminal("labels-expression", false)
	LabelsMatchExpression = NewNonTerminal("labels-match-expression", false)
	LabelValueExpression  = NewNonTerminal("label-value-expression", false)
	AggrCallExpression    = NewNonTerminal("aggr-call-expression", false)
	MetricLabelArgs       = NewNonTerminal("label-args", false)

	OffsetModifier = NewNonTerminal("offset-modifier", false)

	FunctionCallBody = NewNonTerminal("function-call-body", false)
	FunctionCallArgs = NewNonTerminal("function-call-args", false)

	// Binary expressions related non-terminals:
	BinaryOperator      = NewNonTerminal("scalar-binary-operator", false)
	BinaryGroupModifier = NewNonTerminal("binary-group-modifier", false)

	// terminals
	Identifier               = NewTerminal(ID)                                  // this one is ambiguous
	MetricIdentifier         = NewTerminalWithSubType(ID, METRIC_ID)            // this one is ambiguous
	MetricLabelIdentifier    = NewTerminalWithSubType(ID, METRIC_LABEL_SUBTYPE) // this one is ambiguous
	ScalarFunctionIdentifier = NewTerminal(FUNCTION_SCALAR_ID)
	VectorFunctionIdentifier = NewTerminal(FUNCTION_VECTOR_ID)

	AggregatorOp     = NewTerminal(AGGR_OP)
	AggregateKeyword = NewTerminal(AGGR_KW)
	BoolKeyword      = NewTerminalWithSubType(KEYWORD, BOOL_KW)
	OffsetKeyword    = NewTerminalWithSubType(KEYWORD, OFFSET_KW)
	GroupKeyword     = NewTerminal(GROUP_KW)
	GroupSide        = NewTerminal(GROUP_SIDE)

	Operator           = NewTerminal(OPERATOR)
	Arithmetic         = NewTerminal(ARITHMETIC)
	UnaryOperator      = NewTerminalWithSubType(ARITHMETIC, UNARY_OP)
	SetOperator        = NewTerminal(SET)
	LabelMatchOperator = NewTerminalWithSubType(OPERATOR, LABELMATCH)
	Comparision        = NewTerminalWithSubType(OPERATOR, COMPARISION)

	LBrace   = NewTerminal(LEFT_BRACE)
	RBrace   = NewTerminal(RIGHT_BRACE)
	LBracket = NewTerminal(LEFT_BRACKET)
	RBracket = NewTerminal(RIGHT_BRACKET)
	Comma    = NewTerminal(COMMA)
	Colon    = NewTerminal(COLON)
	LParen   = NewTerminal(LEFT_PAREN)
	RParen   = NewTerminal(RIGHT_PAREN)
	Str      = NewTerminal(STRING)
	Num      = NewTerminal(NUM)
	Duration = NewTerminal(DURATION)
	Eof      = NewTerminal(EOF)

	PromQLParser = NewEarleyParser(*promQLGrammar)
)

Functions

func Copy

func Copy(c *completionContext) *completionContext

Deep copy completionContext and return a pointer

func NewPartialMatch

func NewPartialMatch(name, kind, detail string) autocomplete.Match

Types

type CompletionContext

type CompletionContext interface {
	HasMetric() bool
	GetMetric() string
	HasMetricLabel() bool
	GetMetricLabel() string
	GetUsedMetricLabelValues() sets.String
}

type ContextualToken

type ContextualToken struct {
	TokenType
	// contains filtered or unexported fields
}

type Earley

type Earley struct {
	// contains filtered or unexported fields
}

func NewEarleyParser

func NewEarleyParser(g Grammar) *Earley

func (*Earley) GetSuggestedTokenType

func (p *Earley) GetSuggestedTokenType(tokens Tokens) (types []ContextualToken)

func (*Earley) Parse

func (p *Earley) Parse(input string) *earleyChart

Parse parses the full input string. It first tokenizes the input string and then uses those tokens as atomic units in our grammar, which simplifies our parsing logic considerably.

func (*Earley) ParseTokens

func (p *Earley) ParseTokens(tokens Tokens) *earleyChart

ParseTokens parses the full input tokens from beginning

func (*Earley) PartialParse

func (p *Earley) PartialParse(tokens Tokens, chartIndex int) *earleyChart

This is the incremental bit of our parser. We can basically feed in a list of words (i.e. lexed tokens) and parse at a specific word index.

type EarleyChart

type EarleyChart interface {
	States() []*StateSet
	GetState(insertionOrderZeroIndexed int) *StateSet
	String() string
}

Earley items are simple data structures meant to contain the following bits:

(1) a rule in the grammar which has so far been valid
(2) the position in the rule that denotes how much of that rule
    has been consumed by our parser
(3) a reference to the index in our symbol list that we've consumed
    (this probably isn't fully necessary, I could imagine that we could
     restructure the earley parser such that it takes a earleyStateSet
     representing the past state, and construct the next stateSet with an arbitrary
     symbol passed in).

noinspection GoNameStartsWithPackageName

type EarleyItem

type EarleyItem struct {
	Rule    *GrammarRule
	RulePos int // dot position
	// contains filtered or unexported fields
}

EarleyItem represents A SINGLE possible parse path. More abstractly, this represents a potential grammar rule which we can validly apply. It is the basic unit of state set. noinspection GoNameStartsWithPackageName

func (*EarleyItem) DoesTokenTypeMatch

func (item *EarleyItem) DoesTokenTypeMatch(tkn Tokhan) bool

check if is terminal and if is matching

func (*EarleyItem) GetRightSymbolByIndex

func (item *EarleyItem) GetRightSymbolByIndex(i int) Symbol

func (*EarleyItem) GetRightSymbolByRulePos

func (item *EarleyItem) GetRightSymbolByRulePos() Symbol

get the next symbol after the dot

func (*EarleyItem) GetRightSymbolTypeByRulePos

func (item *EarleyItem) GetRightSymbolTypeByRulePos() *TokenType

check if is terminal and if is matching

func (*EarleyItem) String

func (item *EarleyItem) String() string

I like this bit from gearley, so I am leaving it the way it was

type EarleyNode

type EarleyNode interface {
	String() string
	// contains filtered or unexported methods
}

func NewTerminal

func NewTerminal(name TokenType) EarleyNode

func NewTerminalWithSubType

func NewTerminalWithSubType(name TokenType, subtype TokenType) EarleyNode

type Grammar

type Grammar struct {
	// contains filtered or unexported fields
}

func NewGrammar

func NewGrammar(rules ...*GrammarRule) *Grammar

type GrammarRule

type GrammarRule struct {
	// contains filtered or unexported fields
}

func NewRule

func NewRule(t NonTerminalNode, symbols ...Symbol) *GrammarRule

func (*GrammarRule) String

func (r *GrammarRule) String() string

type ItemId

type ItemId struct {
	StateSetIndex int
	ItemIndex     int
}

type NonTerminalNode

type NonTerminalNode interface {
	EarleyNode
	GetName() string
	// contains filtered or unexported methods
}

func NewNonTerminal

func NewNonTerminal(name string, root bool) NonTerminalNode

type StateSet

type StateSet struct {
	// contains filtered or unexported fields
}

An Earley StateSet represents all possible rules at a given index in an Earley chart.

func NewStateSet

func NewStateSet() *StateSet

func (*StateSet) Add

func (s *StateSet) Add(item *EarleyItem) bool

idempotent put operation

func (*StateSet) GetStates

func (s *StateSet) GetStates() (states []EarleyItem)

func (*StateSet) Length

func (s *StateSet) Length() int

func (*StateSet) String

func (s *StateSet) String() string

type StateType

type StateType string
const (
	PREDICT_STATE  StateType = "predict"
	SCAN_STATE     StateType = "scan"
	COMPLETE_STATE StateType = "complete"
)

type Symbol

type Symbol interface {
	String() string
	// contains filtered or unexported methods
}

Earley parsers parse symbols in a bottom up manner

type TokenType

type TokenType string
const (
	ID                   TokenType = "identifier"
	METRIC_ID            TokenType = "metric-identifier"
	METRIC_LABEL_SUBTYPE TokenType = "metric-label-identifier"
	FUNCTION_SCALAR_ID   TokenType = "function-scalar-identifier"
	FUNCTION_VECTOR_ID   TokenType = "function-vector-identifier"

	OPERATOR TokenType = "operator"
	//binary operators
	ARITHMETIC  TokenType = "arithmetic"
	COMPARISION TokenType = "comparision"
	SET         TokenType = "set"
	//label match operator
	LABELMATCH TokenType = "label-match"
	// unary operators
	UNARY_OP TokenType = "unary-op"

	AGGR_OP TokenType = "aggregator_operation"

	//keywords
	KEYWORD    TokenType = "keyword"
	AGGR_KW    TokenType = "aggregator_keyword"
	BOOL_KW    TokenType = "bool-keyword"
	OFFSET_KW  TokenType = "offset-keyword"
	GROUP_SIDE TokenType = "group-side"
	GROUP_KW   TokenType = "group-keyword"

	LEFT_BRACE    TokenType = "leftbrace"
	RIGHT_BRACE   TokenType = "rightbrace"
	LEFT_PAREN    TokenType = "leftparen"
	RIGHT_PAREN   TokenType = "rightparen"
	LEFT_BRACKET  TokenType = "leftbracket"
	RIGHT_BRACKET TokenType = "rightbracket"
	COMMA         TokenType = "comma"
	COLON         TokenType = "colon"
	STRING        TokenType = "string"
	NUM           TokenType = "number"
	DURATION      TokenType = "duration"
	EOF           TokenType = "EOF"
	UNKNOWN       TokenType = "unknown"
)

type Tokens

type Tokens []Tokhan

func (Tokens) Compare

func (ws Tokens) Compare(tks2 Tokens) int

func (Tokens) Last

func (ws Tokens) Last() Tokhan

func (Tokens) Print

func (ws Tokens) Print()

func (Tokens) PrintVals

func (ws Tokens) PrintVals()

func (Tokens) Types

func (ws Tokens) Types() []string

func (Tokens) Vals

func (ws Tokens) Vals() []string

type Tokhan

type Tokhan struct {
	StartPos int
	EndPos   int
	Type     TokenType
	ItemType parser.ItemType
	Val      string
	// contains filtered or unexported fields
}

Tokhan contains the essential bits of data we need for processing a single lexical unit.

func (Tokhan) String

func (t Tokhan) String() string

type TypedToken

type TypedToken interface {
	GetTokenType() TokenType
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL