earley

package

v0.0.0-...-80afc61 Latest Latest Go to latest Published: Jul 13, 2023 License: Apache-2.0 Imports: 9 Imported by: 0

README ¶

`PromQL` Code Completion

promq offers code-completion in the interactive console. In order to do this, we've chosen re-implement the promql parser.

This wasn't a trivial decision but we did make this decision quite deliberately since the original promql parser is written using yacc, which is an LALR parser. LALR parsers are bottom-up parsers, which means that the syntax tree isn't actually assembled until the very end of the parse tree, which is a rather undesirable property to have when you want to offer suggestions on what may be valid syntax in the middle of a parsed string.

Significant Considerations

There were a few significant aspects that factored into our choice of parser:

Our choice should enable us to maintain parity with promql with relative ease, since we are writing a separate parser.
Grammar rules should be definable top-down, since this makes it possible to anticipate possible valid syntactical structures during the parse.
Since promq is a command-line tool, we expressly do not need to optimize for full-blown IDE level code-completion; our implicit expectation is that we will be interacting with smaller statements for the most part and that we would be re-parsing our tree incrementally.

Earley Chart Parser

Given the factors at play, we converged around using an earley chart parser . The more obvious downsides to this decision were namely:

The run-time can be really poor (n^3), if the grammar is ambiguous.
Choosing this (or any other parser beside the one provided in prometheus /prometheus), means that we are bifurcating promql for this tool.

However, there are some counter considerations at play:

Despite the poor runtime:
- even in the worst case we are dealing with input fed in by a command line, which means there is some practical upper bound for text length.
- In the case of our parser, n is not directly determined by string length, since we feed our earley parser a stream of lexed tokens.
I'm not actually sure it is possible to avoid bifurcating promql for this kind of use-case, since promql uses an LALR parser. LR parsers, in general, support a restricted set of production rules which are not the most ideal for code completion (specifically, typically these sorts of rules will match many more things than are actually valid), since LR parsers don't actually construct the syntax tree until the very end of the parse.

Earley chart parsers have a couple nice properties which makes it a particularly nice candidate for this usecase:

Earley parsers support all context free languages without restriction , which means that it can support both LR and LL production rules.
Earley parsers naturally build and preserve state incrementally, which makes it trivial to convert into a more fully incremental parser (this is a desired optimization/property for our parser; i.e. batch-parsing isn't really on the table as an optimization).
Earley parsers are easy to implement and can also enable a setup such that it is easy to maintain parity with promql in the main prometheus/prometheus codebase.

Earley Internals

Earley algorithm

There are some key terms in earley algorithm.

Terminal and non-terminal symbols: In promql.go, we defined a lot of terminal and non-terminal symbols. They are the lexical elements used in specifying the grammar rules(production rules). Terminal symbols are the elementary symbols of the grammar. Non-terminal symbols can be replaced by groups of terminal symbols according to the grammar rules.
Grammar rule: In promql.go, we also defined the grammar rules. The left of the rules can be replaced by the right. There is one root rule which is the start point for any input string.
Input position: The parser will the parse the input string from left to right and step by step. It will start from the position prior to the input string which is the position 0. The last position is the position after the last token.
State set: For every input position, the parser generates a state set. The state set at input position k is called S(k). A state set is composed by a list of state. EarleyItem in item.go represent a single state. Each state is a tuple (X -> a • b, i):
- The dot represents the input position. Here the dot is at position 1.
- The state applies rule X -> ab
- i is the origin position where the the matching of the production began.
Chart: Chart records the state set of each position. The length of earley chart is len(input_string) + 1

Earley parser starts from position 0 with root grammar rule and then repeatedly executes three operations: prediction, scanning and completion:

Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ).
Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X → α a • β, j) to S(k+1).
Completion: For every state in S(k) of the form (Y → γ •, j), find all states in S(j) of the form (X → α • Y β, i) and add (X → α Y • β, i) to S(k).

Only new items will be added to state set. Two states with same rule, dot position and origin position will be viewed as duplicate states.

We use the last state(the dot is at the end of input string) to generate completion suggestion. In each unfinished state(there are tokens after dot) of the state set, if the symbol after dot is a terminal, then it is a potential suggestion. The suggestion generated from earley algorithm is a suggested token type. There is a mapping in completer.go which matches type to real suggested string.

Incremental parsing

Earley parser can have poor runtime sometimes. To optimize, we will not parse input string every time from beginning. For example, if the previous input is sum(metric_name_one{ and the current input is sum(metric_name_one), parser will keep the previous states that cover sum(metric_name_one and start parsing from the next one. Even if the previous input and new input are different from the first token, the first state set can be reused because the first input position is always prior to the input string.

This incremental parsing is really practical because user keeps inputting word by word and the later state set can always build on top of previous state sets.

Considered Alternatives

We considered a number of options for a parser:

Hand-rolled Parser Heuristic

Our initial prototype hand-rolled the parser completely, using a novel (albeit somewhat hacky) heuristic for determining grammar rules. The high level idea was that we would lex the token stream and then move backward from the cursor position to infer grammatical context and use that as a basis for suggesting auto-completions.

For instance, given a string:

sum(metric{label=

We would read the "=" sign first then the token "label", followed by the "{" and then "metric". Our heuristic would allow us to match grammar rules by the first "=" sign, then filter this list by rules which permissibly accept tokens which match the form 'label', then permissibly accept tokens which accept a left brace, etc.

This heuristic was nice in that it ran in linear time, but abstracted poorly to higher level grammar rules (i.e. of the form 'A = A + A | A + B').

Using an LL(*) parser generator (ANTLR)

There are a couple downsides to using ANTLR:

It uses java, which would mean we would need two languages (i.e. golang and java) in order to generate our support for a third language (promql). I personally find this aesthetically unsettling.
While it supports LL-style production rules don't handle ambiguity well, which generally means it needs to choose a single parse path (you can handle this however during traversal).

Documentation ¶

Index ¶

Constants
Variables
func Copy(c *completionContext) *completionContext
func NewPartialMatch(name, kind, detail string) autocomplete.Match
func NewPromQLCompleter(index autocomplete.QueryIndex) autocomplete.PromQLCompleter
type CompletionContext
type ContextualToken
type Earley
- func NewEarleyParser(g Grammar) *Earley
- func (p *Earley) GetSuggestedTokenType(tokens Tokens) (types []ContextualToken)
- func (p *Earley) Parse(input string) *earleyChart
- func (p *Earley) ParseTokens(tokens Tokens) *earleyChart
- func (p *Earley) PartialParse(tokens Tokens, chartIndex int) *earleyChart
type EarleyChart
type EarleyItem
- func (item *EarleyItem) DoesTokenTypeMatch(tkn Tokhan) bool
- func (item *EarleyItem) GetRightSymbolByIndex(i int) Symbol
- func (item *EarleyItem) GetRightSymbolByRulePos() Symbol
- func (item *EarleyItem) GetRightSymbolTypeByRulePos() *TokenType
- func (item *EarleyItem) String() string
type EarleyNode
- func NewTerminal(name TokenType) EarleyNode
- func NewTerminalWithSubType(name TokenType, subtype TokenType) EarleyNode
type Grammar
- func NewGrammar(rules ...*GrammarRule) *Grammar
type GrammarRule
- func NewRule(t NonTerminalNode, symbols ...Symbol) *GrammarRule
- func (r *GrammarRule) String() string
type ItemId
type NonTerminalNode
- func NewNonTerminal(name string, root bool) NonTerminalNode
type StateSet
- func NewStateSet() *StateSet
- func (s *StateSet) Add(item *EarleyItem) bool
- func (s *StateSet) GetStates() (states []EarleyItem)
- func (s *StateSet) Length() int
- func (s *StateSet) String() string
type StateType
type Symbol
type TokenType
type Tokens
- func (ws Tokens) Compare(tks2 Tokens) int
- func (ws Tokens) Last() Tokhan
- func (ws Tokens) Print()
- func (ws Tokens) PrintVals()
- func (ws Tokens) Types() []string
- func (ws Tokens) Vals() []string
type Tokhan
- func (t Tokhan) String() string
type TypedToken

Constants ¶

View Source

const Cursor = "\u25EC"

View Source

const (
	PromQLTokenSeparators = " []{}()+-*/%^=!><~,"
)

Variables ¶

View Source

var (
	// non-terminals
	Root = NewNonTerminal("root", true)

	Expression         = NewNonTerminal("expression", false)
	AggrExpression     = NewNonTerminal("aggr-expression", false)
	SubqueryExpression = NewNonTerminal("subquery-expression", false)
	UnaryExpression    = NewNonTerminal("unary-expression", false)
	// bianry expressions
	ScalarBinaryExpression = NewNonTerminal("scalar-binary-expression", false)
	VectorBinaryExpression = NewNonTerminal("vector-binary-expression", false)
	// function expression
	VectorFuncExpression = NewNonTerminal("vector-function-expression", false)
	ScalarFuncExpression = NewNonTerminal("scalar-function-expression", false)
	// metrics expression
	MatrixSelector = NewNonTerminal("matrix-selector", false)
	VectorSelector = NewNonTerminal("vector-selector", false)

	// Expression types
	ScalarTypeExpression = NewNonTerminal("scalar-type-expression", false)
	VectorTypeExpression = NewNonTerminal("vector-type-expression", false)
	MatrixTypeExpression = NewNonTerminal("matrix-type-expression", false)

	LabelsExpression      = NewNonTerminal("labels-expression", false)
	LabelsMatchExpression = NewNonTerminal("labels-match-expression", false)
	LabelValueExpression  = NewNonTerminal("label-value-expression", false)
	AggrCallExpression    = NewNonTerminal("aggr-call-expression", false)
	MetricLabelArgs       = NewNonTerminal("label-args", false)

	OffsetModifier = NewNonTerminal("offset-modifier", false)

	FunctionCallBody = NewNonTerminal("function-call-body", false)
	FunctionCallArgs = NewNonTerminal("function-call-args", false)

	// Binary expressions related non-terminals:
	BinaryOperator      = NewNonTerminal("scalar-binary-operator", false)
	BinaryGroupModifier = NewNonTerminal("binary-group-modifier", false)

	// terminals
	Identifier               = NewTerminal(ID)                                  // this one is ambiguous
	MetricIdentifier         = NewTerminalWithSubType(ID, METRIC_ID)            // this one is ambiguous
	MetricLabelIdentifier    = NewTerminalWithSubType(ID, METRIC_LABEL_SUBTYPE) // this one is ambiguous
	ScalarFunctionIdentifier = NewTerminal(FUNCTION_SCALAR_ID)
	VectorFunctionIdentifier = NewTerminal(FUNCTION_VECTOR_ID)

	AggregatorOp     = NewTerminal(AGGR_OP)
	AggregateKeyword = NewTerminal(AGGR_KW)
	BoolKeyword      = NewTerminalWithSubType(KEYWORD, BOOL_KW)
	OffsetKeyword    = NewTerminalWithSubType(KEYWORD, OFFSET_KW)
	GroupKeyword     = NewTerminal(GROUP_KW)
	GroupSide        = NewTerminal(GROUP_SIDE)

	Operator           = NewTerminal(OPERATOR)
	Arithmetic         = NewTerminal(ARITHMETIC)
	UnaryOperator      = NewTerminalWithSubType(ARITHMETIC, UNARY_OP)
	SetOperator        = NewTerminal(SET)
	LabelMatchOperator = NewTerminalWithSubType(OPERATOR, LABELMATCH)
	Comparision        = NewTerminalWithSubType(OPERATOR, COMPARISION)

	LBrace   = NewTerminal(LEFT_BRACE)
	RBrace   = NewTerminal(RIGHT_BRACE)
	LBracket = NewTerminal(LEFT_BRACKET)
	RBracket = NewTerminal(RIGHT_BRACKET)
	Comma    = NewTerminal(COMMA)
	Colon    = NewTerminal(COLON)
	LParen   = NewTerminal(LEFT_PAREN)
	RParen   = NewTerminal(RIGHT_PAREN)
	Str      = NewTerminal(STRING)
	Num      = NewTerminal(NUM)
	Duration = NewTerminal(DURATION)
	Eof      = NewTerminal(EOF)

	PromQLParser = NewEarleyParser(*promQLGrammar)
)

Functions ¶

func Copy ¶

func Copy(c *completionContext) *completionContext

Deep copy completionContext and return a pointer

func NewPartialMatch ¶

func NewPartialMatch(name, kind, detail string) autocomplete.Match

func NewPromQLCompleter ¶

func NewPromQLCompleter(index autocomplete.QueryIndex) autocomplete.PromQLCompleter

Types ¶

type CompletionContext ¶

type CompletionContext interface {
	HasMetric() bool
	GetMetric() string
	HasMetricLabel() bool
	GetMetricLabel() string
	GetUsedMetricLabelValues() sets.String
}

type ContextualToken ¶

type ContextualToken struct {
	TokenType
	// contains filtered or unexported fields
}

type Earley ¶

type Earley struct {
	// contains filtered or unexported fields
}

func NewEarleyParser ¶

func NewEarleyParser(g Grammar) *Earley

func (*Earley) GetSuggestedTokenType ¶

func (p *Earley) GetSuggestedTokenType(tokens Tokens) (types []ContextualToken)

func (*Earley) Parse ¶

func (p *Earley) Parse(input string) *earleyChart

Parse parses the full input string. It first tokenizes the input string and then uses those tokens as atomic units in our grammar, which simplifies our parsing logic considerably.

func (*Earley) ParseTokens ¶

func (p *Earley) ParseTokens(tokens Tokens) *earleyChart

ParseTokens parses the full input tokens from beginning

func (*Earley) PartialParse ¶

func (p *Earley) PartialParse(tokens Tokens, chartIndex int) *earleyChart

This is the incremental bit of our parser. We can basically feed in a list of words (i.e. lexed tokens) and parse at a specific word index.

type EarleyChart ¶

type EarleyChart interface {
	States() []*StateSet
	GetState(insertionOrderZeroIndexed int) *StateSet
	String() string
}

Earley items are simple data structures meant to contain the following bits:

(1) a rule in the grammar which has so far been valid
(2) the position in the rule that denotes how much of that rule
    has been consumed by our parser
(3) a reference to the index in our symbol list that we've consumed
    (this probably isn't fully necessary, I could imagine that we could
     restructure the earley parser such that it takes a earleyStateSet
     representing the past state, and construct the next stateSet with an arbitrary
     symbol passed in).

noinspection GoNameStartsWithPackageName

type EarleyItem ¶

type EarleyItem struct {
	Rule    *GrammarRule
	RulePos int // dot position
	// contains filtered or unexported fields
}

EarleyItem represents A SINGLE possible parse path. More abstractly, this represents a potential grammar rule which we can validly apply. It is the basic unit of state set. noinspection GoNameStartsWithPackageName

func (*EarleyItem) DoesTokenTypeMatch ¶

func (item *EarleyItem) DoesTokenTypeMatch(tkn Tokhan) bool

check if is terminal and if is matching

func (*EarleyItem) GetRightSymbolByIndex ¶

func (item *EarleyItem) GetRightSymbolByIndex(i int) Symbol

func (*EarleyItem) GetRightSymbolByRulePos ¶

func (item *EarleyItem) GetRightSymbolByRulePos() Symbol

get the next symbol after the dot

func (*EarleyItem) GetRightSymbolTypeByRulePos ¶

func (item *EarleyItem) GetRightSymbolTypeByRulePos() *TokenType

check if is terminal and if is matching

func (*EarleyItem) String ¶

func (item *EarleyItem) String() string

I like this bit from gearley, so I am leaving it the way it was

type EarleyNode ¶

type EarleyNode interface {
	String() string
	// contains filtered or unexported methods
}

func NewTerminal ¶

func NewTerminal(name TokenType) EarleyNode

func NewTerminalWithSubType ¶

func NewTerminalWithSubType(name TokenType, subtype TokenType) EarleyNode

type Grammar ¶

type Grammar struct {
	// contains filtered or unexported fields
}

func NewGrammar ¶

func NewGrammar(rules ...*GrammarRule) *Grammar

type GrammarRule ¶

type GrammarRule struct {
	// contains filtered or unexported fields
}

func NewRule ¶

func NewRule(t NonTerminalNode, symbols ...Symbol) *GrammarRule

func (*GrammarRule) String ¶

func (r *GrammarRule) String() string

type ItemId ¶

type ItemId struct {
	StateSetIndex int
	ItemIndex     int
}

type NonTerminalNode ¶

type NonTerminalNode interface {
	EarleyNode
	GetName() string
	// contains filtered or unexported methods
}

func NewNonTerminal ¶

func NewNonTerminal(name string, root bool) NonTerminalNode

type StateSet ¶

type StateSet struct {
	// contains filtered or unexported fields
}

An Earley StateSet represents all possible rules at a given index in an Earley chart.

func NewStateSet ¶

func NewStateSet() *StateSet

func (*StateSet) Add ¶

func (s *StateSet) Add(item *EarleyItem) bool

idempotent put operation

func (*StateSet) GetStates ¶

func (s *StateSet) GetStates() (states []EarleyItem)

func (*StateSet) Length ¶

func (s *StateSet) Length() int

func (*StateSet) String ¶

func (s *StateSet) String() string

type StateType ¶

type StateType string

const (
	PREDICT_STATE  StateType = "predict"
	SCAN_STATE     StateType = "scan"
	COMPLETE_STATE StateType = "complete"
)

type Symbol ¶

type Symbol interface {
	String() string
	// contains filtered or unexported methods
}

Earley parsers parse symbols in a bottom up manner

type TokenType ¶

type TokenType string

const (
	ID                   TokenType = "identifier"
	METRIC_ID            TokenType = "metric-identifier"
	METRIC_LABEL_SUBTYPE TokenType = "metric-label-identifier"
	FUNCTION_SCALAR_ID   TokenType = "function-scalar-identifier"
	FUNCTION_VECTOR_ID   TokenType = "function-vector-identifier"

	OPERATOR TokenType = "operator"
	//binary operators
	ARITHMETIC  TokenType = "arithmetic"
	COMPARISION TokenType = "comparision"
	SET         TokenType = "set"
	//label match operator
	LABELMATCH TokenType = "label-match"
	// unary operators
	UNARY_OP TokenType = "unary-op"

	AGGR_OP TokenType = "aggregator_operation"

	//keywords
	KEYWORD    TokenType = "keyword"
	AGGR_KW    TokenType = "aggregator_keyword"
	BOOL_KW    TokenType = "bool-keyword"
	OFFSET_KW  TokenType = "offset-keyword"
	GROUP_SIDE TokenType = "group-side"
	GROUP_KW   TokenType = "group-keyword"

	LEFT_BRACE    TokenType = "leftbrace"
	RIGHT_BRACE   TokenType = "rightbrace"
	LEFT_PAREN    TokenType = "leftparen"
	RIGHT_PAREN   TokenType = "rightparen"
	LEFT_BRACKET  TokenType = "leftbracket"
	RIGHT_BRACKET TokenType = "rightbracket"
	COMMA         TokenType = "comma"
	COLON         TokenType = "colon"
	STRING        TokenType = "string"
	NUM           TokenType = "number"
	DURATION      TokenType = "duration"
	EOF           TokenType = "EOF"
	UNKNOWN       TokenType = "unknown"
)

type Tokens ¶

type Tokens []Tokhan

func (Tokens) Compare ¶

func (ws Tokens) Compare(tks2 Tokens) int

func (Tokens) Last ¶

func (ws Tokens) Last() Tokhan

func (Tokens) Print ¶

func (ws Tokens) Print()

func (Tokens) PrintVals ¶

func (ws Tokens) PrintVals()

func (Tokens) Types ¶

func (ws Tokens) Types() []string

func (Tokens) Vals ¶

func (ws Tokens) Vals() []string

type Tokhan ¶

type Tokhan struct {
	StartPos int
	EndPos   int
	Type     TokenType
	ItemType parser.ItemType
	Val      string
	// contains filtered or unexported fields
}

Tokhan contains the essential bits of data we need for processing a single lexical unit.

func (Tokhan) String ¶

func (t Tokhan) String() string

type TypedToken ¶

type TypedToken interface {
	GetTokenType() TokenType
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

PromQL Code Completion

Significant Considerations

Earley Chart Parser

Earley Internals

Earley algorithm

Incremental parsing

Considered Alternatives

Hand-rolled Parser Heuristic

Using an LL(*) parser generator (ANTLR)

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Copy ¶

func NewPartialMatch ¶

func NewPromQLCompleter ¶

Types ¶

type CompletionContext ¶

type ContextualToken ¶

type Earley ¶

func NewEarleyParser ¶

func (*Earley) GetSuggestedTokenType ¶

func (*Earley) Parse ¶

func (*Earley) ParseTokens ¶

func (*Earley) PartialParse ¶

type EarleyChart ¶

type EarleyItem ¶

func (*EarleyItem) DoesTokenTypeMatch ¶

func (*EarleyItem) GetRightSymbolByIndex ¶

func (*EarleyItem) GetRightSymbolByRulePos ¶

func (*EarleyItem) GetRightSymbolTypeByRulePos ¶

func (*EarleyItem) String ¶

type EarleyNode ¶

func NewTerminal ¶

func NewTerminalWithSubType ¶

type Grammar ¶

func NewGrammar ¶

type GrammarRule ¶

func NewRule ¶

func (*GrammarRule) String ¶

type ItemId ¶

type NonTerminalNode ¶

func NewNonTerminal ¶

type StateSet ¶

func NewStateSet ¶

func (*StateSet) Add ¶

func (*StateSet) GetStates ¶

func (*StateSet) Length ¶

func (*StateSet) String ¶

type StateType ¶

type Symbol ¶

type TokenType ¶

type Tokens ¶

func (Tokens) Compare ¶

func (Tokens) Last ¶

func (Tokens) Print ¶

func (Tokens) PrintVals ¶

func (Tokens) Types ¶

func (Tokens) Vals ¶

type Tokhan ¶

func (Tokhan) String ¶

type TypedToken ¶

Source Files ¶

`PromQL` Code Completion