participle

package module
v2.0.0-alpha1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 26, 2020 License: MIT Imports: 11 Imported by: 305

README

A dead simple parser package for Go

Godoc CircleCI Go Report Card Slack chat

Introduction

The goal of this package is to provide a simple, idiomatic and elegant way of defining parsers in Go.

Participle's method of defining grammars should be familiar to any Go programmer who has used the encoding/json package: struct field tags define what and how input is mapped to those same fields. This is not unusual for Go encoders, but is unusual for a parser.

Changes

See the Change Log for details.

Tutorial

A tutorial is available, walking through the creation of an .ini parser.

Overview

A grammar is an annotated Go structure used to both define the parser grammar, and be the AST output by the parser. As an example, following is the final INI parser from the tutorial.

type INI struct {
  Properties []*Property `{ @@ }`
  Sections   []*Section  `{ @@ }`
}

type Section struct {
  Identifier string      `"[" @Ident "]"`
  Properties []*Property `{ @@ }`
}

type Property struct {
  Key   string `@Ident "="`
  Value *Value `@@`
}

type Value struct {
  String *string  `  @String`
  Number *float64 `| @Float`
}

Note: Participle also supports named struct tags (eg. Hello string `parser:"@Ident"`).

A parser is constructed from a grammar and a lexer:

parser, err := participle.Build(&INI{})

Once constructed, the parser is applied to input to produce an AST:

ast := &INI{}
err := parser.ParseString("", "size = 10", ast)
// ast == &INI{
//   Properties: []*Property{
//     {Key: "size", Value: &Value{Number: &10}},
//   },
// }

Annotation syntax

  • @<expr> Capture expression into the field.
  • @@ Recursively capture using the fields own type.
  • <identifier> Match named lexer token.
  • ( ... ) Group.
  • "..." Match the literal (note that the lexer must emit tokens matching this literal exactly).
  • "...":<identifier> Match the literal, specifying the exact lexer token type to match.
  • <expr> <expr> ... Match expressions.
  • <expr> | <expr> Match one of the alternatives.
  • !<expr> Match any token that is not the start of the expression (eg: @!";" matches anything but the ; character into the field).

The following modifiers can be used after any expression:

  • * Expression can match zero or more times.
  • + Expression must match one or more times.
  • ? Expression can match zero or once.
  • ! Require a non-empty match (this is useful with a sequence of optional matches eg. ("a"? "b"? "c"?)!).

Supported but deprecated:

  • { ... } Match 0 or more times (DEPRECATED - prefer ( ... )*).
  • [ ... ] Optional (DEPRECATED - prefer ( ... )?).

Notes:

  • Each struct is a single production, with each field applied in sequence.
  • @<expr> is the mechanism for capturing matches into the field.
  • if a struct field is not keyed with "parser", the entire struct tag will be used as the grammar fragment. This allows the grammar syntax to remain clear and simple to maintain.

Capturing

Prefixing any expression in the grammar with @ will capture matching values for that expression into the corresponding field.

For example:

// The grammar definition.
type Grammar struct {
  Hello string `@Ident`
}

// The source text to parse.
source := "world"

// After parsing, the resulting AST.
result == &Grammar{
  Hello: "world",
}

For slice and string fields, each instance of @ will accumulate into the field (including repeated patterns). Accumulation into other types is not supported.

A successful capture match into a boolean field will set the field to true.

For integer and floating point types, a successful capture will be parsed with strconv.ParseInt() and strconv.ParseBool() respectively.

Tokens can also be captured directly into fields of type lexer.Token and []lexer.Token.

Custom control of how values are captured into fields can be achieved by a field type implementing the Capture interface (Capture(values []string) error).

Additionally, any field implementing the encoding.TextUnmarshaler interface will be capturable too. One caveat is that UnmarshalText() will be called once for each captured token, so eg. @(Ident Ident Ident) will be called three times.

Streaming

Participle supports streaming parsing. Simply pass a channel of your grammar into Parse*(). The grammar will be repeatedly parsed and sent to the channel. Note that the Parse*() call will not return until parsing completes, so it should generally be started in a goroutine.

type token struct {
  Str string `  @Ident`
  Num int    `| @Int`
}

parser, err := participle.Build(&token{})

tokens := make(chan *token, 128)
err := parser.ParseString("", `hello 10 11 12 world`, tokens)
for token := range tokens {
  fmt.Printf("%#v\n", token)
}

Lexing

Participle relies on distinct lexing and parsing phases. The lexer takes raw bytes and produces tokens which the parser consumes. The parser transforms these tokens into Go values.

The default lexer is based on the Go text/scanner package and thus produces tokens for Go-like source code. This is surprisingly useful, but if you do require more control over lexing the builtin participle/lexer/stateful lexer should cover most other cases. If that in turn is not flexible enough, you can implement your own lexer.

Configure your parser with a lexer via participle.Lexer().

To use your own Lexer you will need to implement two interfaces: Definition and Lexer.

Experimental - code generation

Participle v1 now has experimental support for generating code to perform lexing. Use participle/experimental/codegen.GenerateLexer() to compile a stateful lexer to Go code.

This will generally provide around a 10x improvement in lexing performance while producing O(1) garbage.

Options

The Parser's behaviour can be configured via Options.

Examples

There are several examples included:

Example Description
BASIC A lexer, parser and interpreter for a rudimentary dialect of BASIC.
EBNF Parser for the form of EBNF used by Go.
Expr A basic mathematical expression parser and evaluator.
GraphQL Lexer+parser for GraphQL schemas
HCL A parser for the HashiCorp Configuration Language.
INI An INI file parser.
Protobuf A full Protobuf version 2 and 3 parser.
SQL A very rudimentary SQL SELECT parser.
Thrift A full Thrift parser.
TOML A TOML parser.

Included below is a full GraphQL lexer and parser:

package main

import (
	"fmt"
	"os"

	"github.com/alecthomas/kong"
	"github.com/alecthomas/repr"

	"github.com/alecthomas/participle/v2"
	"github.com/alecthomas/participle/v2/lexer"
	"github.com/alecthomas/participle/v2/lexer/stateful"
)

type File struct {
	Entries []*Entry `@@*`
}

type Entry struct {
	Type   *Type   `  @@`
	Schema *Schema `| @@`
	Enum   *Enum   `| @@`
	Scalar string  `| "scalar" @Ident`
}

type Enum struct {
	Name  string   `"enum" @Ident`
	Cases []string `"{" { @Ident } "}"`
}

type Schema struct {
	Fields []*Field `"schema" "{" { @@ } "}"`
}

type Type struct {
	Name       string   `"type" @Ident`
	Implements string   `[ "implements" @Ident ]`
	Fields     []*Field `"{" { @@ } "}"`
}

type Field struct {
	Name       string      `@Ident`
	Arguments  []*Argument `[ "(" [ @@ { "," @@ } ] ")" ]`
	Type       *TypeRef    `":" @@`
	Annotation string      `[ "@" @Ident ]`
}

type Argument struct {
	Name    string   `@Ident`
	Type    *TypeRef `":" @@`
	Default *Value   `[ "=" @@ ]`
}

type TypeRef struct {
	Array       *TypeRef `(   "[" @@ "]"`
	Type        string   `  | @Ident )`
	NonNullable bool     `[ @"!" ]`
}

type Value struct {
	Symbol string `@Ident`
}

var (
	graphQLLexer = lexer.Must(stateful.NewSimple([]stateful.Rule{
		{"Comment", `(?:#|//)[^\n]*\n?`, nil},
		{"Ident", `[a-zA-Z]\w*`, nil},
		{"Number", `(?:\d*\.)?\d+`, nil},
		{"Punct", `[-[!@#$%^&*()+_={}\|:;"'<,>.?/]|]`, nil},
		{"Whitespace", `[ \t\n\r]+`, nil},
	}))
	parser = participle.MustBuild(&File{},
		participle.Lexer(graphQLLexer),
		participle.Elide("Comment", "Whitespace"),
		participle.UseLookahead(2),
	)
)

var cli struct {
	EBNF  bool     `help"Dump EBNF."`
	Files []string `arg:"" optional:"" type:"existingfile" help:"GraphQL schema files to parse."`
}

func main() {
	ctx := kong.Parse(&cli)
	if cli.EBNF {
		fmt.Println(parser.String())
		ctx.Exit(0)
	}
	for _, file := range cli.Files {
		ast := &File{}
		r, err := os.Open(file)
		ctx.FatalIfErrorf(err)
		err = parser.Parse(file, r, ast)
		r.Close()
		repr.Println(ast)
		ctx.FatalIfErrorf(err)
	}
}

Performance

One of the included examples is a complete Thrift parser (shell-style comments are not supported). This gives a convenient baseline for comparing to the PEG based pigeon, which is the parser used by go-thrift. Additionally, the pigeon parser is utilising a generated parser, while the participle parser is built at run time.

You can run the benchmarks yourself, but here's the output on my machine:

BenchmarkParticipleThrift-12    	   5941	   201242 ns/op	 178088 B/op	   2390 allocs/op
BenchmarkGoThriftParser-12      	   3196	   379226 ns/op	 157560 B/op	   2644 allocs/op

On a real life codebase of 47K lines of Thrift, Participle takes 200ms and go- thrift takes 630ms, which aligns quite closely with the benchmarks.

Concurrency

A compiled Parser instance can be used concurrently. A LexerDefinition can be used concurrently. A Lexer instance cannot be used concurrently.

Error reporting

There are a few areas where Participle can provide useful feedback to users of your parser.

  1. Errors returned by Parser.Parse*() will be of type Error. This will contain positional information where available.
  2. Participle will make a best effort to return as much of the AST up to the error location as possible.
  3. Any node in the AST containing a field Pos lexer.Position will be automatically populated from the nearest matching token.
  4. Any node in the AST containing a field EndPos lexer.Position will be automatically populated from the token at the end of the node.
  5. Any node in the AST containing a field Tokens []lexer.Token will be automatically populated with all tokens captured by the node, including elided tokens.

These related pieces of information can be combined to provide fairly comprehensive error reporting.

Limitations

Internally, Participle is a recursive descent parser with backtracking (see UseLookahead(K)).

Among other things, this means that they do not support left recursion. Left recursion must be eliminated by restructuring your grammar.

EBNF

Participle supports outputting an EBNF grammar from a Participle parser. Once the parser is constructed simply call String().

eg. The GraphQL example gives in the following EBNF:

File = Entry* .
Entry = Type | Schema | Enum | "scalar" ident .
Type = "type" ident ("implements" ident)? "{" Field* "}" .
Field = ident ("(" (Argument ("," Argument)*)? ")")? ":" TypeRef ("@" ident)? .
Argument = ident ":" TypeRef ("=" Value)? .
TypeRef = "[" TypeRef "]" | ident "!"? .
Value = ident .
Schema = "schema" "{" Field* "}" .
Enum = "enum" ident "{" ident* "}" .

Documentation

Overview

Package participle constructs parsers from definitions in struct tags and parses directly into those structs. The approach is philosophically similar to how other marshallers work in Go, "unmarshalling" an instance of a grammar into a struct.

The supported annotation syntax is:

  • `@<expr>` Capture expression into the field.
  • `@@` Recursively capture using the fields own type.
  • `<identifier>` Match named lexer token.
  • `( ... )` Group.
  • `"..."` Match the literal (note that the lexer must emit tokens matching this literal exactly).
  • `"...":<identifier>` Match the literal, specifying the exact lexer token type to match.
  • `<expr> <expr> ...` Match expressions.
  • `<expr> | <expr>` Match one of the alternatives.

The following modifiers can be used after any expression:

  • `*` Expression can match zero or more times.
  • `+` Expression must match one or more times.
  • `?` Expression can match zero or once.
  • `!` Require a non-empty match (this is useful with a sequence of optional matches eg. `("a"? "b"? "c"?)!`).

Supported but deprecated:

  • `{ ... }` Match 0 or more times (**DEPRECATED** - prefer `( ... )*`).
  • `[ ... ]` Optional (**DEPRECATED** - prefer `( ... )?`).

Here's an example of an EBNF grammar.

type Group struct {
    Expression *Expression `"(" @@ ")"`
}

type Option struct {
    Expression *Expression `"[" @@ "]"`
}

type Repetition struct {
    Expression *Expression `"{" @@ "}"`
}

type Literal struct {
    Start string `@String` // lexer.Lexer token "String"
    End   string `("…" @String)?`
}

type Term struct {
    Name       string      `  @Ident`
    Literal    *Literal    `| @@`
    Group      *Group      `| @@`
    Option     *Option     `| @@`
    Repetition *Repetition `| @@`
}

type Sequence struct {
    Terms []*Term `@@+`
}

type Expression struct {
    Alternatives []*Sequence `@@ ("|" @@)*`
}

type Expressions []*Expression

type Production struct {
    Name        string      `@Ident "="`
    Expressions Expressions `@@+ "."`
}

type EBNF struct {
    Productions []*Production `@@*`
}

Index

Constants

This section is empty.

Variables

View Source
var (
	// MaxIterations limits the number of elements capturable by {}.
	MaxIterations = 1000000

	// NextMatch should be returned by Parseable.Parse() method implementations to indicate
	// that the node did not match and that other matches should be attempted, if appropriate.
	NextMatch = errors.New("no match") // nolint: golint
)

Functions

func FormatError

func FormatError(err Error) string

FormatError formats an error in the form "[<filename>:][<line>:<pos>:] <message>"

Types

type Capture

type Capture interface {
	Capture(values []string) error
}

Capture can be implemented by fields in order to transform captured tokens into field values.

type Error

type Error interface {
	error
	// Unadorned message.
	Message() string
	// Closest position to error location.
	Position() lexer.Position
}

Error represents an error while parsing.

The format of an Error is in the form "[<filename>:][<line>:<pos>:] <message>".

The error will contain positional information if available.

func Errorf

func Errorf(pos lexer.Position, format string, args ...interface{}) Error

Errorf creats a new Error at the given position.

func Wrapf

func Wrapf(pos lexer.Position, err error, format string, args ...interface{}) Error

Wrapf attempts to wrap an existing Error in a new message.

If "err" is a participle.Error, its positional information will be uesd.

type Mapper

type Mapper func(token lexer.Token) (lexer.Token, error)

Mapper function for mutating tokens before being applied to the AST.

type Option

type Option func(p *Parser) error

An Option to modify the behaviour of the Parser.

func CaseInsensitive

func CaseInsensitive(tokens ...string) Option

CaseInsensitive allows the specified token types to be matched case-insensitively.

func Elide

func Elide(types ...string) Option

Elide drops tokens of the specified types.

func Lexer

func Lexer(def lexer.Definition) Option

Lexer is an Option that sets the lexer to use with the given grammar.

func Map

func Map(mapper Mapper, symbols ...string) Option

Map is an Option that configures the Parser to apply a mapping function to each Token from the lexer.

This can be useful to eg. upper-case all tokens of a certain type, or dequote strings.

"symbols" specifies the token symbols that the Mapper will be applied to. If empty, all tokens will be mapped.

func Unquote

func Unquote(types ...string) Option

Unquote applies strconv.Unquote() to tokens of the given types.

Tokens of type "String" will be unquoted if no other types are provided.

func Upper

func Upper(types ...string) Option

Upper is an Option that upper-cases all tokens of the given type. Useful for case normalisation.

func UseLookahead

func UseLookahead(n int) Option

UseLookahead allows branch lookahead up to "n" tokens.

If parsing cannot be disambiguated before "n" tokens of lookahead, parsing will fail.

Note that increasing lookahead has a minor performance impact, but also reduces the accuracy of error reporting.

type ParseOption

type ParseOption func(p *parseContext)

ParseOption modifies how an individual parse is applied.

func AllowTrailing

func AllowTrailing(ok bool) ParseOption

AllowTrailing tokens without erroring.

That is, do not error if a full parse completes but additional tokens remain.

type Parseable

type Parseable interface {
	// Parse into the receiver.
	//
	// Should return NextMatch if no tokens matched and parsing should continue.
	// Nil should be returned if parsing was successful.
	Parse(lex *lexer.PeekingLexer) error
}

The Parseable interface can be implemented by any element in the grammar to provide custom parsing.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

A Parser for a particular grammar and lexer.

func Build

func Build(grammar interface{}, options ...Option) (parser *Parser, err error)

Build constructs a parser for the given grammar.

If "Lexer()" is not provided as an option, a default lexer based on text/scanner will be used. This scans typical Go- like tokens.

See documentation for details

func MustBuild

func MustBuild(grammar interface{}, options ...Option) *Parser

MustBuild calls Build(grammar, options...) and panics if an error occurs.

func (*Parser) Lex

func (p *Parser) Lex(filename string, r io.Reader) ([]lexer.Token, error)

Lex uses the parser's lexer to tokenise input.

func (*Parser) Lexer

func (p *Parser) Lexer() lexer.Definition

Lexer returns the parser's builtin lexer.

func (*Parser) Parse

func (p *Parser) Parse(filename string, r io.Reader, v interface{}, options ...ParseOption) (err error)

Parse from r into grammar v which must be of the same type as the grammar passed to Build().

This may return an Error.

func (*Parser) ParseBytes

func (p *Parser) ParseBytes(filename string, b []byte, v interface{}, options ...ParseOption) (err error)

ParseBytes from b into grammar v which must be of the same type as the grammar passed to Build().

This may return an Error.

func (*Parser) ParseFromLexer

func (p *Parser) ParseFromLexer(lex *lexer.PeekingLexer, v interface{}, options ...ParseOption) error

ParseFromLexer into grammar v which must be of the same type as the grammar passed to Build().

This may return a Error.

func (*Parser) ParseString

func (p *Parser) ParseString(filename string, s string, v interface{}, options ...ParseOption) (err error)

ParseString from s into grammar v which must be of the same type as the grammar passed to Build().

This may return an Error.

func (*Parser) String

func (p *Parser) String() string

String returns the EBNF for the grammar.

Productions are always upper case. Lexer tokens are always lower case.

type UnexpectedTokenError

type UnexpectedTokenError struct {
	Unexpected lexer.Token
	Expected   string
}

UnexpectedTokenError is returned by Parse when an unexpected token is encountered.

This is useful for composing parsers in order to detect when a sub-parser has terminated.

func (UnexpectedTokenError) Error

func (u UnexpectedTokenError) Error() string

func (UnexpectedTokenError) Message

func (u UnexpectedTokenError) Message() string

func (UnexpectedTokenError) Position

func (u UnexpectedTokenError) Position() lexer.Position

Directories

Path Synopsis
cmd
Package ebnf contains the AST and parser for parsing the form of EBNF produced by Participle.
Package ebnf contains the AST and parser for parsing the form of EBNF produced by Participle.
experimental
codegen
Package codegen generates Go code for stateful lexers.
Package codegen generates Go code for stateful lexers.
Package lexer defines interfaces and implementations used by Participle to perform lexing.
Package lexer defines interfaces and implementations used by Participle to perform lexing.
stateful
Package stateful defines a nested stateful lexer.
Package stateful defines a nested stateful lexer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL