pi

package module
v0.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 9, 2018 License: BSD-3-Clause Imports: 11 Imported by: 0

README

GoPi = Interactive Parser in GoKi / GoGi Framework

The pi package allows users to create parsers using the GoGi graphical interface system.

We call it Pi (or GoPi) because Ip is not as easy to pronounce, and also because it makes parsing as easy as pi! You can think of it as a French acronym, which are typically the reverse of English ones -- "parseur interactif". Also, it matches GoKi and GoGi.

Pi uses a robust, top-down Recursive Descent (RD) parsing technique (see WikiPedia), which is the approach used by most hand-coded parsers, which are by far the most widely used in practice (e.g., for gcc, clang, and Go) for various reasons -- see this stack overflow thread too. As far as we can tell (e.g., from this list on WikiPedia ) there are not many recursive-descent parser generators, and none that use the same robust, fast techniques that we employ in GoPi.

It seems that most parsing is dominated by a very strong sequentiality assumption -- that you must parse everything in a strictly sequential, incremental, left-to-right, one-token-at-a-time manner. If you step outside of that box (or break with the herd if you will), by loading the entire source in to RAM and processing the entire thing as a whole structured entity (which is entirely trivial these days -- even the biggest source code is typically tiny relative to RAM capacity), then much simpler, more robust solutions are possible. In other words, instead of using a "1D" solution, we actually use a 3D solution to parsing (line, char, and nesting depth). Specifically, we just carve the whole source in to statement-level chunks and then proceed to look for distinctive Key Tokens anywhere in the statement to determine what kind of statement it is, and then proceed recursively to carve that up into its respective parts, using the same approach. There is never any backtracking or shift-reduce conflicts or any of those annoying issues that plague other approaches. And no need for complicated lookup tables etc (e.g., as needed for the LL(k) approach used in the popular ANTLR parser). The whole process is very fast, simple, and transparent.

Specifically, there are three distinct passes through the source file, each creating a solid foundation upon which the next step operates, producing significantly faster, simpler, and more error-tolerant (robust) results overall.

  • Lexer -- takes the raw text and turns it into lexical Tokens that categorize a sequence of characters as things like a Name (identifier -- a string of letters and numbers without quotes) or a Literal of some type, for example a LitStr string that has some kind of quotes around it, or a LitNum which is a number. There is a nice category (Cat) and subcategory (SubCat) level of organization to these tokens (see token/token.go). Comments are absorbed in this step as well. The key advantage for subsequent steps is that any ambiguity about e.g., syntactic elements showing up in comments or strings is completely eliminated right at the start. Furthermore, the tokenized version of the file is much, much more compact and contains only the essential information for parsing.

  • StepTwo -- this is a critical second pass through the lexical tokens, performing two important things:

    • Nesting Depth -- all programming languages use some form of parens ( ) brackets [ ] and braces { } to group elements, and parsing must be sensitive to these. Instead of dealing with these issues locally at every step, we do a single pass through the entire tokenized version of the source and compute the depth of every token. Then, the token matching in parsing only needs to compare relative depth values, without having to constantly re-compute that.

    • EOS Detection -- This step detects end of statement tokens, which provide an essential first-pass rough-cut chunking of the source into statements. In C / C++ / Go these are the semicolons ; (in Go, semicolons are mostly automatically computed from tokens that appear at the end of lines -- Pi supports this as well). In Python, this is the end of line itself, unless it is not at the same nesting depth as at the start of the line.

  • Parsing -- finally we parse the tokenized source using rules that match patterns of tokens. Instead of using bottom-up local token-sequence driven parsing, as is done in tools like yacc and bison, we use the much more robust recursive descent technique, which starts in our case with those rough-cut statement chunks produced in StepTwo, and proceeds by recognizing the type of each statement, and then further carving each one into its appropriate sub-parts, again in a top-down, progressively-narrowing fashion, until everything has been categorized to the lowest level. At each step, nodes in an Abstract Syntax Tree (AST) are created, representing this same top-down broad-to-narrow parsing of the source. Thus, the highest nodes are statement-level nodes, each of which then contain the organized elements of each statement. These nodes are all in the natural functional hierarchical ordering, not in the raw left-to-right order of the source, and directly correspond to the way that the parsing proceeds. Thus, building the AST at the same time as parsing is very natural in the top-down RD framework, unlike traditional bottom-up approaches, and is a major reason that hand-coded parsers use this technique.

RD Parsing Advantages and Issues

Our top-down approach is generally much more robust: instead of depending on precise matches at every step along the way, it starts with the "big picture" by looking for the Key Token that unambiguously indicates what kind of thing we're looking at in a given region, and it then proceeds to process that region as such. If some parts of that region happen to be malformed, it can issue appropriate errors, but that will not spill over into processing of the next region. Thus, errors are automatically "sandboxed" in these regions, and do not accumulate. By contrast, in bottom-up parsers, you need to add extensive error-matching rules at every step to achieve this kind of robustness, and that is often a tricky trial-and-error process and is not inherently robust.

Solving the Associativity problem with RD parsing: Put it in Reverse!

One major problem with RD parsing is that it gets the associativity of mathematical operators backwards. To solve this problem, we simply run those rules in reverse: they scan their region from right to left instead of left to right. This is much simpler than other approaches and works perfectly -- and is again something that you wouldn't even consider from the standard sequential mindset. You just have to add a '-' minus sign at the start of the Rule to set the rule to run in reverse -- this must be set for all binary mathematical operators (e.g., BinaryExpr in the standard grammar).

Also, for RD parsing, to deal properly with the order of operations, you have to order the rules in the reverse order of precedence. Thus, it matches the lowest priority items first, and those become the "outer branch" of the AST, which then proceeds to fill in so that the highest-priority items are in the "leaves" of the tree, which are what gets processed first.

Lexing and Parsing Rules

Principle of Preemptive Specificity

A common principle across lexing and parsing rules is the principle of preemptive specificity -- all of the different rule options are arranged in order, and the first to match preempts consideration of any of the remaining rules. This is how a switch rule works in Go or C. This is a particularly simple way of dealing with many potential rules and conflicts therefrom. The overall strategy as a user is to put the most specific items first so that they will get considered, and then the general "default" cases are down at the bottom. This is hopefully very intuitive and easy to use.

In the Lexer, this is particularly important for the State elements: when you enter a different context that continues across multiple chars or lines, you push that context onto the State Stack, and then it is critical that all the rules matching those different states are at the top of the list, so they preempt any non-state-specific alternatives.

Lexing Rules

Lexing is all about the ordering -- follow the above principle of preemptive specificity.

Another common situation is resolving the ambiguity about multi-character constructs, such as := vs just : or =. In this case, you have an outer rule that matches :, and then within that, child rules that do a one-step lookahead for a = at Off=1 -- if that matches, then it is := -- you need to put that rule first, and then the default is just the plain :.

Parsing Rules

  • order matters!
Generative Expression Subdomains

There are certain subdomains that have very open-ended combinatorial "generative" expressive power. These are particular challenges for any parser, and there are a few critical issues and tips for the Pi parser.

Arithmetic with Binary and Unary Operators

You can create arbitrarily long expressions by stringing together sequences of binary and unary mathematical / logical operators. From the top-down parser's perspective, here are the key points:

  1. Each operator must be uniquely recognizable from the soup of tokens, and this critically includes distinguishing unary from binary: e.g., correctly recognizing the binary and unary - signs here: a - b * -c

  2. The operators must be organized in reverse order of priority, so that the lowest priority operations are factored out first, creating the highest-level, broadest splits of the overall expression (in the Ast tree), and then progressively finer, tighter, inner steps are parsed out. Thus, for example in this expression:

if a + b * 2 / 7 - 42 > c * d + e / 72

The broadest, first split is into the two sides of the > operator, and then each of those sides is progressively organized first into an addition operator, then the * and /.

  1. The binary operators provide the recursive generativity for the expression. E.g., Addition is specified as:
AddExpr: Expr '+' Expr

so it just finds the + token and then descends recursively to unpack each of those Expr chunks on either side, until there are no more tokens left there.

One particularly vexing situation arises if you have the possibility of mixing multiplication with de-pointering, both of which are indicated by the * symbol. In Go, this is particularly challenging because of the frequent use of type literals, including those with pointer types, in general expressions. We had to add the ability to specify an "exclusion" expression that detects this alternate use and excludes it for the general multiplication case (because multiplication is broader than de-pointer and thus comes first in the matching hierarchy).

Path-wise Operators

Another generative domain are the path-wise operators, principally the "selector" operator . and the slice operator '[' SliceExpr ']', which can be combined with method calls and other kinds of primary expressions in a very open-ended way, e.g.,:

ps.Errs[len(ps.Errs)-1].Error()[0][1].String()

In the top-down parser, it is essential to create this open-ended scenario by including pre-and-post expressions surrounding the Slice and Selector operators, which then act like the Expr groups surrounding the AddExpr operator to support recursive chaining. For Selector, the two Expr's are required, but for Slice, they are optional - that works fine:

Slice: ?PrimaryExpr '[' SliceExpr ']' ?PrimaryExpr

Without those optional exprs on either side, the top-down parser would just stop after getting either side of that expression.

As with the arithmetic case, order matters and in the same inverse way, where you want to match the broader items first. Also, those cases that are more specific and would otherwise be absorbed by a more general expression (e.g.,

Documentation

Overview

Package pi provides the main interactive parser structure for running the parse The piv sub-package provides the GUI for constructing and testing a parser

Index

Constants

View Source
const (
	Version     = "v0.5.1"
	GitCommit   = "f99e938"          // the commit JUST BEFORE the release
	VersionDate = "2018-12-09 10:34" // UTC
)

Variables

View Source
var KiT_LangFlags = kit.Enums.AddEnum(LangFlagsN, false, nil)
View Source
var StdLangProps = map[filecat.Supported]LangProps{
	filecat.Ada:        {filecat.Ada, "--", "", "", nil, nil},
	filecat.Bash:       {filecat.Bash, "# ", "", "", nil, nil},
	filecat.Csh:        {filecat.Csh, "# ", "", "", nil, nil},
	filecat.C:          {filecat.C, "// ", "/* ", " */", nil, nil},
	filecat.CSharp:     {filecat.CSharp, "// ", "/* ", " */", nil, nil},
	filecat.D:          {filecat.D, "// ", "/* ", " */", nil, nil},
	filecat.ObjC:       {filecat.ObjC, "// ", "/* ", " */", nil, nil},
	filecat.Go:         {filecat.Go, "// ", "/* ", " */", []LangFlags{IndentTab}, nil},
	filecat.Java:       {filecat.Java, "// ", "/* ", " */", nil, nil},
	filecat.JavaScript: {filecat.JavaScript, "// ", "/* ", " */", nil, nil},
	filecat.Eiffel:     {filecat.Eiffel, "--", "", "", nil, nil},
	filecat.Haskell:    {filecat.Haskell, "--", "{- ", "-}", nil, nil},
	filecat.Lisp:       {filecat.Lisp, "; ", "", "", nil, nil},
	filecat.Lua:        {filecat.Lua, "--", "---[[ ", "--]]", nil, nil},
	filecat.Makefile:   {filecat.Makefile, "# ", "", "", []LangFlags{IndentTab}, nil},
	filecat.Matlab:     {filecat.Matlab, "% ", "%{ ", " %}", nil, nil},
	filecat.OCaml:      {filecat.OCaml, "", "(* ", " *)", nil, nil},
	filecat.Pascal:     {filecat.Pascal, "// ", " ", " }", nil, nil},
	filecat.Perl:       {filecat.Perl, "# ", "", "", nil, nil},
	filecat.Python:     {filecat.Python, "# ", "", "", []LangFlags{IndentSpace}, nil},
	filecat.Php:        {filecat.Php, "// ", "/* ", " */", nil, nil},
	filecat.R:          {filecat.R, "# ", "", "", nil, nil},
	filecat.Ruby:       {filecat.Ruby, "# ", "", "", nil, nil},
	filecat.Rust:       {filecat.Rust, "// ", "/* ", " */", nil, nil},
	filecat.Scala:      {filecat.Scala, "// ", "/* ", " */", nil, nil},
	filecat.Html:       {filecat.Html, "", "<!-- ", " -->", nil, nil},
	filecat.TeX:        {filecat.TeX, "% ", "", "", nil, nil},
	filecat.Markdown:   {filecat.Markdown, "", "<!--- ", " -->", []LangFlags{IndentSpace}, nil},
}

StdLangProps is the standard compiled-in set of langauge properties

View Source
var TheLangSupport = LangSupport{}

Functions

func OpenStdParsers

func OpenStdParsers() error

OpenStdParsers opens all the standard parsers for languages, from the langs/ directory

func VersionInfo

func VersionInfo() string

VersionInfo returns Pi version information

Types

type FileState

type FileState struct {
	Src        lex.File     `json:"-" xml:"-" desc:"the source to be parsed -- also holds the full lexed tokens"`
	LexState   lex.State    `json:"_" xml:"-" desc:"state for lexing"`
	TwoState   lex.TwoState `json:"-" xml:"-" desc:"state for second pass nesting depth and EOS matching"`
	ParseState parse.State  `json:"_" xml:"-" desc:"state for parsing"`
	Ast        parse.Ast    `json:"_" xml:"-" desc:"ast output tree from parsing"`
}

FileState is the parsing state information for a given file

func NewFileState

func NewFileState() *FileState

NewFileState returns a new initialized file state

func (*FileState) Init

func (fs *FileState) Init()

Init initializes the file state

func (*FileState) LexAtEnd

func (fs *FileState) LexAtEnd() bool

LexAtEnd returns true if lexing state is now at end of source

func (*FileState) LexErrString

func (fs *FileState) LexErrString() string

LexErrString returns all the lexing errors as a string

func (*FileState) LexHasErrs

func (fs *FileState) LexHasErrs() bool

LexHasErrs returns true if there were errors from lexing

func (*FileState) LexLine

func (fs *FileState) LexLine(ln int) lex.Line

LexLine returns the lexing output for given line, combining comments and all other tokens and allocating new memory using clone

func (*FileState) LexLineString

func (fs *FileState) LexLineString() string

LexLineString returns a string rep of the current lexing output for the current line

func (*FileState) LexNextSrcLine

func (fs *FileState) LexNextSrcLine() string

LexNextSrcLine returns the next line of source that the lexer is currently at

func (*FileState) ParseAtEnd

func (fs *FileState) ParseAtEnd() bool

ParseAtEnd returns true if parsing state is now at end of source

func (*FileState) ParseErrString

func (fs *FileState) ParseErrString() string

ParseErrString returns all the parsing errors as a string

func (*FileState) ParseHasErrs

func (fs *FileState) ParseHasErrs() bool

ParseHasErrs returns true if there were errors from parsing

func (*FileState) ParseNextSrcLine

func (fs *FileState) ParseNextSrcLine() string

ParseNextSrcLine returns the next line of source that the parser is currently at

func (*FileState) ParseRuleString

func (fs *FileState) ParseRuleString(full bool) string

RuleString returns the rule info for entire source -- if full then it includes the full stack at each point -- otherwise just the top of stack

func (*FileState) PassTwoErrString

func (fs *FileState) PassTwoErrString() string

PassTwoErrString returns all the pass two errors as a string

func (*FileState) PassTwoHasErrs

func (fs *FileState) PassTwoHasErrs() bool

PassTwoHasErrs returns true if there were errors from pass two processing

func (*FileState) SetSrc

func (fs *FileState) SetSrc(src *[][]rune, fname string)

SetSrc sets source to be parsed, and filename it came from

type LangFlags

type LangFlags int

LangFlags are special properties of a given language

const (
	// NoFlags = nothing special
	NoFlags LangFlags = iota

	// IndentSpace means that spaces must be used for this language
	IndentSpace

	// IndentTab means that tabs must be used for this language
	IndentTab

	LangFlagsN
)

LangFlags

func (LangFlags) MarshalJSON

func (ev LangFlags) MarshalJSON() ([]byte, error)

func (*LangFlags) UnmarshalJSON

func (ev *LangFlags) UnmarshalJSON(b []byte) error

type LangProps

type LangProps struct {
	Lang      filecat.Supported `desc:"language -- must be a supported one from Supported list"`
	CommentLn string            `desc:"character(s) that start a single-line comment -- if empty then multi-line comment syntax will be used"`
	CommentSt string            `desc:"character(s) that start a multi-line comment or one that requires both start and end"`
	CommentEd string            `desc:"character(s) that end a multi-line comment or one that requires both start and end"`
	Flags     []LangFlags       `desc:"special properties for this language"`
	Parser    *Parser           `json:"-" xml:"-" desc:"parser for this language"`
}

LangProps contains properties of languages supported by the Pi parser framework

type LangSupport

type LangSupport struct {
}

LangSupport provides general support for supported languages. e.g., looking up lexers and parsers by name. Implements the lex.LangLexer interface to provide access to other Guest Lexers

func (*LangSupport) LangProps

func (ll *LangSupport) LangProps(lang string) (*LangProps, error)

LangProps looks up language properties by string name of language (with case-insensitive fallback). Returns false if not supported.

func (*LangSupport) Lexer

func (ll *LangSupport) Lexer(lang string) *lex.Rule

Lexer looks up Lexer for given language, using robust case sensitive and insensitive search

type Parser

type Parser struct {
	Lexer    lex.Rule    `desc:"lexer rules for first pass of lexing file"`
	PassTwo  lex.PassTwo `desc:"second pass after lexing -- computes nesting depth and EOS finding"`
	Parser   parse.Rule  `desc:"parser rules for parsing lexed tokens"`
	Filename string      `desc:"file name for overall parser"`
}

Parser is the overall parser for managing the parsing

func NewParser

func NewParser() *Parser

NewParser returns a new initialized parser

func (*Parser) DoPassTwo

func (pr *Parser) DoPassTwo(fs *FileState)

DoPassTwo does the second pass after lexing

func (*Parser) Init

func (pr *Parser) Init()

Init initializes the parser -- must be called after creation

func (*Parser) InitAll

func (pr *Parser) InitAll(fs *FileState)

InitAll initializes everything about the parser -- call this when setting up a new parser after it has been loaded etc

func (*Parser) LexAll

func (pr *Parser) LexAll(fs *FileState)

LexAll runs a complete pass of the lexer and pass two, on current state

func (*Parser) LexInit

func (pr *Parser) LexInit(fs *FileState)

LexInit gets the lexer ready to start lexing

func (*Parser) LexLine

func (pr *Parser) LexLine(fs *FileState, ln int) lex.Line

LexLine runs lexer for given single line of source, returns merged regular and token comment lines, cloned and ready for use

func (*Parser) LexNext

func (pr *Parser) LexNext(fs *FileState) *lex.Rule

LexNext does next step of lexing -- returns lowest-level rule that matched, and nil when nomatch err or at end of source input

func (*Parser) LexNextLine

func (pr *Parser) LexNextLine(fs *FileState) *lex.Rule

LexNextLine does next line of lexing -- returns lowest-level rule that matched at end, and nil when nomatch err or at end of source input

func (*Parser) LexRun

func (pr *Parser) LexRun(fs *FileState)

LexRun keeps running LextNext until it stops

func (*Parser) OpenJSON

func (pr *Parser) OpenJSON(filename string) error

OpenJSON opens lexer and parser rules to current filename, in a standard JSON-formatted file

func (*Parser) ParseNext

func (pr *Parser) ParseNext(fs *FileState) *parse.Rule

ParseNext does next step of parsing -- returns lowest-level rule that matched or nil if no match error or at end

func (*Parser) ParserInit

func (pr *Parser) ParserInit(fs *FileState) bool

ParserInit initializes the parser prior to running

func (*Parser) SaveGrammar

func (pr *Parser) SaveGrammar(filename string) error

SaveGrammar saves lexer and parser grammar rules to BNF-like .pig file

func (*Parser) SaveJSON

func (pr *Parser) SaveJSON(filename string) error

SaveJSON saves lexer and parser rules, in a standard JSON-formatted file

Directories

Path Synopsis
cmd
pie
Package lex provides all the lexing functions that transform text into lexical tokens, using token types defined in the pi/token package.
Package lex provides all the lexing functions that transform text into lexical tokens, using token types defined in the pi/token package.
Package parse does the parsing stage after lexing Package parse does the parsing stage after lexing, using a top-down recursive-descent (TDRD) strategy, with a special reverse mode to deal with left-associative binary expressions which otherwise end up being right-associative for TDRD parsing.
Package parse does the parsing stage after lexing Package parse does the parsing stage after lexing, using a top-down recursive-descent (TDRD) strategy, with a special reverse mode to deal with left-associative binary expressions which otherwise end up being right-associative for TDRD parsing.
Package piv provides the PiView object for the full GUI view of the interactive parser (pi) system.
Package piv provides the PiView object for the full GUI view of the interactive parser (pi) system.
Package token defines a complete set of all lexical tokens for any kind of language! It is based on the alecthomas/chroma / pygments lexical tokens plus all the more detailed tokens needed for actually parsing languages
Package token defines a complete set of all lexical tokens for any kind of language! It is based on the alecthomas/chroma / pygments lexical tokens plus all the more detailed tokens needed for actually parsing languages

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL