parse

package
v0.3.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 17, 2024 License: BSD-3-Clause Imports: 18 Imported by: 1

README

Interactive Parsing in Cogent Core

The parse package supports a simple and robust form of lexing and parsing based on top-down recursive descent, and allows users to create parsers using the Cogent Core graphical interface system. It is used for syntax highlighting, completion, and more advanced language-structure specific functionality in Cogent Core and in the Cogent Code IDE / editor (where we need to support multiple different languages, and can't just rely on the excellent builtin Go parser).

The parse directory is also home to various other packages including:

  • complete: basic completion / lookup infrastructure.

  • lsp: Language Server Protocol (LSP) interface (incomplete).

Overview of language support

parse/language.go defines the Language interface, which each supported language implements (at least a nil stub) -- at a minimum the Parser, ParseFile(which includes just lexing if that is all that is needed), and HighlightLine methods should be implemented, to drive syntax highlighting / coloring / tagging. Optionally, completion, lookup, etc can be implemented. See languages/golang for a full implementation, and languages/tex for a more minimal lex-only case.

parse/languagesupport.go has tables of supported languages and their properties, in LanguageProperties.

parse in general has overall management methods for coordinating the lex (lexing) and parse parsing steps.

lex also has a variety of random manual and indent functions that are useful for special-case manual parsing cases.

Parsing Strategy

Parse uses a robust, top-down Recursive Descent (RD) parsing technique (see WikiPedia), which is the approach used by most hand-coded parsers, which are by far the most widely used in practice (e.g., for gcc, clang, and Go) for various reasons -- see this [stack overflow](https://stackoverflow.com/questions/6319086/are-gcc-and-clang-parsers-really handwritten) thread too. As far as we can tell (e.g., from this list on WikiPedia ) there are not many recursive-descent parser generators, and none that use the same robust, simple techniques that we employ in parse.

Most parsing algorithms are dominated by a strong sequentiality assumption -- that you must parse everything in a strictly sequential, incremental, left-to-right, one-token-at-a-time manner. If you step outside of that box (or break with the herd if you will), by loading the entire source in to RAM and processing the entire thing as a whole structured entity (which is entirely trivial these days -- even the biggest source code is typically tiny relative to RAM capacity), then much simpler, more robust solutions are possible. In other words, instead of using a "1D" solution with a tiny pinhole window onto the code, we use a 3D solution to parsing (line, char, and nesting depth). This is not good for huge data files (where an optimized, easily parsed encoding format is appropriate), but it is great for programs, which is what parse is for.

Specifically, we carve the whole source in to statement-level chunks and then proceed to break that apart into smaller pieces by looking for distinctive lexical tokens anywhere in the statement to determine what kind of statement it is, and then proceed recursively to carve that up into its respective parts, using the same approach. There is never any backtracking or shift-reduce conflicts or any of those annoying issues that plague other approaches -- the grammar you write is very directly the grammar of the language, and doesn't require a lot of random tweaks and special cases to get it to work.

For example, here are the rules for standard binary expressions (in Go or most other languages):

        SubExpr:         -Expr '-' Expr
        AddExpr:         -Expr '+' Expr
        RemExpr:         -Expr '%' Expr
        DivExpr:         -Expr '/' Expr
        MultExpr:        -Expr '*' Expr

and here are some of the more complicated statements (in Go):

    IfStmt {
        IfStmtExpr:  'key:if' Expr '{' ?BlockList '}' ?Elses 'EOS'
        IfStmtInit:  'key:if' SimpleStmt 'EOS' Expr '{' ?BlockList '}' ?Elses 'EOS'
    }
    ForStmt {
       ForRangeExisting:  'key:for' ExprList '=' 'key:range' Expr '{' ?BlockList -'}' 'EOS'
       ForRangeNewLit:  'key:for' NameList ':=' 'key:range' @CompositeLit '{' ?BlockList -'}' 'EOS' 
       ...
    }

See the complete grammar for Go for everything, including the lexer rules (at the top).

While parse is likely to be a lot easier to use than yacc and bison, the latest version 4 of ANTLR with its ALL(*) algorithm sounds like it offers similar abilities to robustly handle intuitive grammars, and is likely more generalizable to a wider range of languages, and is probably faster overall than parse. But parse is much simpler and more transparent in terms of how it actually works (disclaimer: I have no idea whatsoever how ANTLR V4 actually works! And that's kind of the point..). Anyone should be able to understand how parse works, and tweak it as needed, etc. And it operates directly in AST-order, creating the corresponding AST on the fly as it parses, so you can interactively understand what it is doing as it goes along, making it relatively easy to create your grammar (although this process is, in truth, always a bit complicated and never as easy as one might hope). And parse is fast enough for most uses, taking just a few hundred msec for even relatively large and complex source code, and it processes the entire Go standard library in around 40 sec (on a 2016 Macbook laptop).

Three Steps of Processing

Parse does three distinct passes through the source file, each creating a solid foundation upon which the next step operates.

  • Lexer -- takes the raw text and turns it into lexical Tokens that categorize a sequence of characters as things like a Name (identifier -- a string of letters and numbers without quotes) or a Literal of some type, for example a LitStr string that has some kind of quotes around it, or a LitNum which is a number. There is a nice category (Cat) and subcategory (SubCat) level of organization to these tokens (see token/token.go). Comments are absorbed in this step as well, and stored in a separate lex output so you can always access them (e.g., for docs), without having to deal with them in the parsing process. The key advantage for subsequent steps is that any ambiguity about e.g., syntactic elements showing up in comments or strings is completely eliminated right at the start. Furthermore, the tokenized version of the file is much more compact and contains only the essential information for parsing.

  • StepTwo: this is a critical second pass through the lexical tokens, performing two important things:

    • Nesting Depth: all programming languages use some form of parentheses ( ) brackets [ ] and braces { } to group elements, and parsing must be sensitive to these. Instead of dealing with these issues locally at every step, we do a single pass through the entire tokenized version of the source and compute the depth of every token. Then, the token matching in parsing only needs to compare relative depth values, without having to constantly re-compute that. As an extra bonus, you can use this depth information in syntax highlighting (as we do in Cogent Code).

    • EOS Detection: This step detects end of statement tokens, which provide an essential first-pass rough-cut chunking of the source into statements. In C / C++ / Go and related languages, these are the semicolons ; (in Go, semicolons are mostly automatically computed from tokens that appear at the end of lines; parse supports this as well). In Python, this is the end of line itself, unless it is not at the same nesting depth as at the start of the line.

  • Parsing -- finally we parse the tokenized source using rules that match patterns of tokens, using the top-down recursive descent technique as described above, starting with those rough-cut statement chunks produced in StepTwo. At each step, nodes in an Abstract Syntax Tree (AST) are created, representing this same top-down broad-to-narrow parsing of the source. Thus, the highest nodes are statement-level nodes, each of which then contain the organized elements of each statement. These nodes are all in the natural functional hierarchical ordering, not in the raw left-to-right order of the source, and directly correspond to the way that the parsing proceeds. Thus, building the AST at the same time as parsing is very natural in the top-down RD framework, unlike traditional bottom-up approaches, and is a major reason that hand-coded parsers use this technique.

Once you have the AST, it contains the full logical structure of the program and it could be further processed in any number of ways. The full availability of the AST-level parse of a Go program is what has enabled so many useful meta-level coding tools to be developed for this language (e.g., gofmt, go doc, go fix, etc), and likewise for all the tools that leverage the clang parser for C-based languages.

In addition, parse has Actions that are applied during parsing to create lists of Symbols and Types in the source. These are useful for IDE completion lookup etc, and generally can be at least initially created during the parse step -- we currently create most of the symbols during parsing and then fill in the detailed type information in a subsequent language-specific pass through the AST.

RD Parsing Advantages and Issues

The top-down approach is generally much more robust: instead of depending on precise matches at every step along the way, which can easily get derailed by errant code at any point, it starts with the "big picture" and keeps any errors from overflowing those EOS statement boundaries (and within more specific scopes within statements as well). Thus, errors are automatically "sandboxed" in these regions, and do not accumulate. By contrast, in bottom-up parsers, you need to add extensive error-matching rules at every step to achieve this kind of robustness, and that is often a tricky trial-and-error process and is not inherently robust.

Solving the Associativity problem with RD parsing: Put it in Reverse!

One major problem with RD parsing is that it gets the associativity of mathematical operators backwards. To solve this problem, we simply run those rules in reverse: they scan their region from right to left instead of left to right. This is much simpler than other approaches and works perfectly -- and is again something that you wouldn't even consider from the standard sequential mindset. You just have to add a - minus sign at the start of the Rule to set the rule to run in reverse -- this must be set for all binary mathematical operators (e.g., BinaryExpr in the standard grammar, as you can see in the examples above).

Also, for RD parsing, to deal properly with the order of operations, you have to order the rules in the reverse order of precedence. Thus, it matches the lowest priority items first, and those become the "outer branch" of the AST, which then proceeds to fill in so that the highest-priority items are in the "leaves" of the tree, which are what gets processed last. Again, using the parseview GUI and watching the AST fill in as things are parsed gives you a better sense of how this works.

Principle of Preemptive Specificity

A common principle across lexing and parsing rules is the principle of preemptive specificity -- all of the different rule options are arranged in order, and the first to match preempts consideration of any of the remaining rules. This is how a switch rule works in Go or C. This is a particularly simple way of dealing with many potential rules and conflicts therefrom. The overall strategy as a user is to put the most specific items first so that they will get considered, and then the general "default" cases are down at the bottom. This is hopefully very intuitive and easy to use.

In the Lexer, this is particularly important for the State elements: when you enter a different context that continues across multiple chars or lines, you push that context onto the State Stack, and then it is critical that all the rules matching those different states are at the top of the list, so they preempt any non-state-specific alternatives. State is also avail in the parser but is less widely used.

Generative Expression Subdomains

There are certain subdomains that have very open-ended combinatorial "generative" expressive power. These are particular challenges for any parser, and there are a few critical issues and tips for the parser.

Arithmetic with Binary and Unary Operators

You can create arbitrarily long expressions by stringing together sequences of binary and unary mathematical / logical operators. From the top-down parser's perspective, here are the key points:

  1. Each operator must be uniquely recognizable from the soup of tokens, and this critically includes distinguishing unary from binary: e.g., correctly recognizing the binary and unary - signs here: a - b * -c

  2. The operators must be organized in reverse order of priority, so that the lowest priority operations are factored out first, creating the highest-level, broadest splits of the overall expression (in the AST tree), and then progressively finer, tighter, inner steps are parsed out. Thus, for example in this expression:

if a + b * 2 / 7 - 42 > c * d + e / 72

The broadest, first split is into the two sides of the > operator, and then each of those sides is progressively organized first into an addition operator, then the * and /.

  1. The binary operators provide the recursive generativity for the expression. E.g., Addition is specified as:
AddExpr: Expr '+' Expr

so it just finds the + token and then descends recursively to unpack each of those Expr chunks on either side, until there are no more tokens left there.

One particularly vexing situation arises if you have the possibility of mixing multiplication with de-pointering, both of which are indicated by the * symbol. In Go, this is particularly challenging because of the frequent use of type literals, including those with pointer types, in general expressions -- at a purely syntactic, local level it is ambiguous:

var MultSlice = p[2]*Rule // this is multiplication
var SliceAry = [2]*Rule{}  // this is an array literal

we resolved this by putting the literal case ahead of the general expression case because it matches the braces {} and resolves the ambiguity, but does cause a few other residual cases of ambiguity that are very low frequency.

Path-wise Operators

Another generative domain are the path-wise operators, principally the "selector" operator . and the slice operator '[' SliceExpr ']', which can be combined with method calls and other kinds of primary expressions in a very open-ended way, e.g.,:

ps.Errs[len(ps.Errs)-1].Error()[0][1].String()

In the top-down parser, it is essential to create this open-ended scenario by including pre-and-post expressions surrounding the Slice and Selector operators, which then act like the Expr groups surrounding the AddExpr operator to support recursive chaining. For Selector, the two Expr's are required, but for Slice, they are optional - that works fine:

Slice: ?PrimaryExpr '[' SliceExpr ']' ?PrimaryExpr

Without those optional exprs on either side, the top-down parser would just stop after getting either side of that expression.

As with the arithmetic case, order matters and in the same inverse way, where you want to match the broader items first.

Overall, processing these kinds of expressions takes most of the time in the parser, due to the very high branching factor for what kinds of things could be there, and a more optimized and language-specific strategy would undoubtedly work a lot better. We will go back and figure out how the Go parser deals with all this stuff at some point, and see what kinds of tricks we might be able to incorporate in a general way in parse.

There remain a few low-frequency expressions that the current Go parsing rules in parse don't handle (e.g., see the make test target in cmd/parse directory for the results of parsing the entire Go std library). One potential approach would be to do a further level of more bottom-up, lexer-level chunking of expressions at the same depth level, e.g., the a.b selector pattern, and the []slice vs. array[ab] and func(params) kinds of patterns, and then the parser can operate on top of those. Thus, the purely top-down approach seems to struggle a bit with some of these kinds of complex path-level expressions. By contrast, it really easily deals with standard arithmetic expressions, which are much more regular and have a clear precedence order.

Documentation

Overview

Package parse is the top-level package for the Cogent Core parsing system.

The code is organized into the various sub-packages, dealing with the different stages of parsing etc.

Sub-package languages has the parsers for specific languages, including Go (of course), markdown and tex (latter are lexer-only)

Note that the GUI editor framework for creating and testing parsers is currently in the piv subpackage in Cogent Code: https://github.com/cogentcore/cogent/tree/main/code/piv

Index

Constants

This section is empty.

Variables

View Source
var LanguageSupport = LanguageSupporter{}

LanguageSupport is the main language support hub for accessing parse support interfaces for each supported language

View Source
var StandardLanguageProperties = map[fileinfo.Known]*LanguageProperties{
	fileinfo.Ada:        {fileinfo.Ada, "--", "", "", nil, nil, nil},
	fileinfo.Bash:       {fileinfo.Bash, "# ", "", "", nil, nil, nil},
	fileinfo.Csh:        {fileinfo.Csh, "# ", "", "", nil, nil, nil},
	fileinfo.C:          {fileinfo.C, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.CSharp:     {fileinfo.CSharp, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.D:          {fileinfo.D, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.ObjC:       {fileinfo.ObjC, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.Go:         {fileinfo.Go, "// ", "/* ", " */", []LanguageFlags{IndentTab}, nil, nil},
	fileinfo.Java:       {fileinfo.Java, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.JavaScript: {fileinfo.JavaScript, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.Eiffel:     {fileinfo.Eiffel, "--", "", "", nil, nil, nil},
	fileinfo.Haskell:    {fileinfo.Haskell, "--", "{- ", "-}", nil, nil, nil},
	fileinfo.Lisp:       {fileinfo.Lisp, "; ", "", "", nil, nil, nil},
	fileinfo.Lua:        {fileinfo.Lua, "--", "---[[ ", "--]]", nil, nil, nil},
	fileinfo.Makefile:   {fileinfo.Makefile, "# ", "", "", []LanguageFlags{IndentTab}, nil, nil},
	fileinfo.Matlab:     {fileinfo.Matlab, "% ", "%{ ", " %}", nil, nil, nil},
	fileinfo.OCaml:      {fileinfo.OCaml, "", "(* ", " *)", nil, nil, nil},
	fileinfo.Pascal:     {fileinfo.Pascal, "// ", " ", " }", nil, nil, nil},
	fileinfo.Perl:       {fileinfo.Perl, "# ", "", "", nil, nil, nil},
	fileinfo.Python:     {fileinfo.Python, "# ", "", "", []LanguageFlags{IndentSpace}, nil, nil},
	fileinfo.Php:        {fileinfo.Php, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.R:          {fileinfo.R, "# ", "", "", nil, nil, nil},
	fileinfo.Ruby:       {fileinfo.Ruby, "# ", "", "", nil, nil, nil},
	fileinfo.Rust:       {fileinfo.Rust, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.Scala:      {fileinfo.Scala, "// ", "/* ", " */", nil, nil, nil},
	fileinfo.Html:       {fileinfo.Html, "", "<!-- ", " -->", nil, nil, nil},
	fileinfo.TeX:        {fileinfo.TeX, "% ", "", "", nil, nil, nil},
	fileinfo.Markdown:   {fileinfo.Markdown, "", "<!--- ", " -->", []LanguageFlags{IndentSpace}, nil, nil},
	fileinfo.Yaml:       {fileinfo.Yaml, "#", "", "", []LanguageFlags{IndentSpace}, nil, nil},
}

StandardLanguageProperties is the standard compiled-in set of language properties

Functions

This section is empty.

Types

type FileState

type FileState struct {

	// the source to be parsed -- also holds the full lexed tokens
	Src lexer.File `json:"-" xml:"-"`

	// state for lexing
	LexState lexer.State `json:"_" xml:"-"`

	// state for second pass nesting depth and EOS matching
	TwoState lexer.TwoState `json:"-" xml:"-"`

	// state for parsing
	ParseState parser.State `json:"-" xml:"-"`

	// ast output tree from parsing
	AST *parser.AST `json:"-" xml:"-"`

	// symbols contained within this file -- initialized at start of parsing and created by AddSymbol or PushNewScope actions.  These are then processed after parsing by the language-specific code, via Lang interface.
	Syms syms.SymMap `json:"-" xml:"-"`

	// External symbols that are entirely maintained in a language-specific way by the Lang interface code.  These are only here as a convenience and are not accessed in any way by the language-general parse code.
	ExtSyms syms.SymMap `json:"-" xml:"-"`

	// mutex protecting updates / reading of Syms symbols
	SymsMu sync.RWMutex `display:"-" json:"-" xml:"-"`

	// waitgroup for coordinating processing of other items
	WaitGp sync.WaitGroup `display:"-" json:"-" xml:"-"`

	// anonymous counter -- counts up
	AnonCtr int `display:"-" json:"-" xml:"-"`

	// path mapping cache -- for other files referred to by this file, this stores the full path associated with a logical path (e.g., in go, the logical import path -> local path with actual files) -- protected for access from any thread
	PathMap sync.Map `display:"-" json:"-" xml:"-"`
}

FileState contains the full lexing and parsing state information for a given file. It is the master state record for everything that happens in parse. One of these should be maintained for each file; texteditor.Buf has one as ParseState field.

Separate State structs are maintained for each stage (Lexing, PassTwo, Parsing) and the final output of Parsing goes into the AST and Syms fields.

The Src lexer.File field maintains all the info about the source file, and the basic tokenized version of the source produced initially by lexing and updated by the remaining passes. It has everything that is maintained at a line-by-line level.

func NewFileState

func NewFileState() *FileState

NewFileState returns a new initialized file state

func (*FileState) ClearAST added in v0.2.3

func (fs *FileState) ClearAST()

func (*FileState) Destroy

func (fs *FileState) Destroy()

func (*FileState) FindAnyChildren

func (fs *FileState) FindAnyChildren(sym *syms.Symbol, seed string, scope syms.SymMap, kids *syms.SymMap) bool

FindAnyChildren fills out map with either direct children of given symbol or those of the type of this symbol -- useful for completion. If seed is non-empty it is used as a prefix for filtering children names. Returns false if no children were found.

func (*FileState) FindChildren

func (fs *FileState) FindChildren(sym *syms.Symbol, seed string, scope syms.SymMap, kids *syms.SymMap) bool

FindChildren fills out map with direct children of given symbol If seed is non-empty it is used as a prefix for filtering children names. Returns false if no children were found.

func (*FileState) FindNamePrefixScoped

func (fs *FileState) FindNamePrefixScoped(seed string, scope syms.SymMap, matches *syms.SymMap)

FindNamePrefixScoped looks for given symbol name prefix within given map first (if non nil) and then in fs.Syms and ExtSyms maps, and any children on those global maps that are of subcategory token.NameScope (i.e., namespace, module, package, library) adds to given matches map (which can be nil), for more efficient recursive use

func (*FileState) FindNameScoped

func (fs *FileState) FindNameScoped(nm string, scope syms.SymMap) (*syms.Symbol, bool)

FindNameScoped looks for given symbol name within given map first (if non nil) and then in fs.Syms and ExtSyms maps, and any children on those global maps that are of subcategory token.NameScope (i.e., namespace, module, package, library)

func (*FileState) Init

func (fs *FileState) Init()

Init initializes the file state

func (*FileState) LexAtEnd

func (fs *FileState) LexAtEnd() bool

LexAtEnd returns true if lexing state is now at end of source

func (*FileState) LexErrReport

func (fs *FileState) LexErrReport() string

LexErrReport returns a report of all the lexing errors -- these should only occur during development of lexer so we use a detailed report format

func (*FileState) LexHasErrs

func (fs *FileState) LexHasErrs() bool

LexHasErrs returns true if there were errors from lexing

func (*FileState) LexLine

func (fs *FileState) LexLine(ln int) lexer.Line

LexLine returns the lexing output for given line, combining comments and all other tokens and allocating new memory using clone

func (*FileState) LexLineString

func (fs *FileState) LexLineString() string

LexLineString returns a string rep of the current lexing output for the current line

func (*FileState) LexNextSrcLine

func (fs *FileState) LexNextSrcLine() string

LexNextSrcLine returns the next line of source that the lexer is currently at

func (*FileState) NextAnonName

func (fs *FileState) NextAnonName(ctxt string) string

NextAnonName returns the next anonymous name for this file, using counter here and given context name (e.g., package name)

func (*FileState) ParseAtEnd

func (fs *FileState) ParseAtEnd() bool

ParseAtEnd returns true if parsing state is now at end of source

func (*FileState) ParseErrReport

func (fs *FileState) ParseErrReport() string

ParseErrReport returns at most 10 parsing errors in end-user format, sorted

func (*FileState) ParseErrReportAll

func (fs *FileState) ParseErrReportAll() string

ParseErrReportAll returns all parsing errors in end-user format, sorted

func (*FileState) ParseErrReportDetailed

func (fs *FileState) ParseErrReportDetailed() string

ParseErrReportDetailed returns at most 10 parsing errors in detailed format, sorted

func (*FileState) ParseHasErrs

func (fs *FileState) ParseHasErrs() bool

ParseHasErrs returns true if there were errors from parsing

func (*FileState) ParseNextSrcLine

func (fs *FileState) ParseNextSrcLine() string

ParseNextSrcLine returns the next line of source that the parser is currently at

func (*FileState) ParseRuleString

func (fs *FileState) ParseRuleString(full bool) string

RuleString returns the rule info for entire source -- if full then it includes the full stack at each point -- otherwise just the top of stack

func (*FileState) PassTwoErrReport

func (fs *FileState) PassTwoErrReport() string

PassTwoErrString returns all the pass two errors as a string -- these should only occur during development so we use a detailed report format

func (*FileState) PassTwoHasErrs

func (fs *FileState) PassTwoHasErrs() bool

PassTwoHasErrs returns true if there were errors from pass two processing

func (*FileState) PathMapLoad

func (fs *FileState) PathMapLoad(path string) (string, bool)

PathMapLoad does a mutex-protected load of PathMap for given string, returning value and true if found

func (*FileState) PathMapStore

func (fs *FileState) PathMapStore(path, abs string)

PathMapStore does a mutex-protected store of abs path for given path key

func (*FileState) SetSrc

func (fs *FileState) SetSrc(src [][]rune, fname, basepath string, sup fileinfo.Known)

SetSrc sets source to be parsed, and filename it came from, and also the base path for project for reporting filenames relative to (if empty, path to filename is used)

type FileStates

type FileStates struct {

	// the filename
	Filename string

	// the known file type, if known (typically only known files are processed)
	Known fileinfo.Known

	// base path for reporting file names -- this must be set externally e.g., by gide for the project root path
	BasePath string

	// index of the state that is done
	DoneIndex int

	// one filestate
	FsA FileState

	// one filestate
	FsB FileState

	// mutex locking the switching of Done vs. Proc states
	SwitchMu sync.Mutex

	// mutex locking the parsing of Proc state -- reading states can happen fine with this locked, but no switching
	ProcMu sync.Mutex

	// extra meta data associated with this FileStates
	Meta map[string]string
}

FileStates contains two FileState's: one is being processed while the other is being used externally. The FileStates maintains a common set of file information set in each of the FileState items when they are used.

func NewFileStates

func NewFileStates(fname, basepath string, sup fileinfo.Known) *FileStates

NewFileStates returns a new FileStates for given filename, basepath, and known file type.

func (*FileStates) DeleteMetaData

func (fs *FileStates) DeleteMetaData(key string)

DeleteMetaData deletes given meta data record

func (*FileStates) Done

func (fs *FileStates) Done() *FileState

Done returns the filestate that is done being updated, and is ready for use by external clients etc. Proc is the other one which is currently being processed by the parser and is not ready to be used externally. The state is accessed under a lock, and as long as any use of state is fast enough, it should be usable over next two switches (typically true).

func (*FileStates) DoneNoLock

func (fs *FileStates) DoneNoLock() *FileState

DoneNoLock returns the filestate that is done being updated, and is ready for use by external clients etc. Proc is the other one which is currently being processed by the parser and is not ready to be used externally. The state is accessed under a lock, and as long as any use of state is fast enough, it should be usable over next two switches (typically true).

func (*FileStates) EndProc

func (fs *FileStates) EndProc()

EndProc is called when primary processing (parsing) has been completed -- there still may be ongoing updating of symbols after this point but parse is done. This calls Switch to move Proc over to done, under cover of ProcMu Lock

func (*FileStates) MetaData

func (fs *FileStates) MetaData(key string) (string, bool)

MetaData returns given meta data string for given key, returns true if present, false if not

func (*FileStates) Proc

func (fs *FileStates) Proc() *FileState

Proc returns the filestate that is currently being processed by the parser etc and is not ready for external use. Access is protected by a lock so it will wait if currently switching. The state is accessed under a lock, and as long as any use of state is fast enough, it should be usable over next two switches (typically true).

func (*FileStates) ProcNoLock

func (fs *FileStates) ProcNoLock() *FileState

ProcNoLock returns the filestate that is currently being processed by the parser etc and is not ready for external use. Access is protected by a lock so it will wait if currently switching. The state is accessed under a lock, and as long as any use of state is fast enough, it should be usable over next two switches (typically true).

func (*FileStates) SetMetaData

func (fs *FileStates) SetMetaData(key, value string)

SetMetaData sets given meta data record

func (*FileStates) SetSrc

func (fs *FileStates) SetSrc(fname, basepath string, sup fileinfo.Known)

SetSrc sets the source that is processed by this FileStates if basepath is empty then it is set to the path for the filename.

func (*FileStates) StartProc

func (fs *FileStates) StartProc(txt []byte) *FileState

StartProc should be called when starting to process the file, and returns the FileState to use for processing. It locks the Proc state, sets the current source code, and returns the filestate for subsequent processing.

func (*FileStates) Switch

func (fs *FileStates) Switch()

Switch switches so that the current Proc() filestate is now the Done() it is assumed to be called under ProcMu.Locking cover, and also does the Swtich locking.

type Language added in v0.2.3

type Language interface {
	// Parser returns the [Parser] for this language
	Parser() *Parser

	// ParseFile does the complete processing of a given single file, given by txt bytes,
	// as appropriate for the language -- e.g., runs the lexer followed by the parser, and
	// manages any symbol output from parsing as appropriate for the language / format.
	// This is to be used for files of "primary interest" -- it does full type inference
	// and symbol resolution etc.  The Proc() FileState is locked during parsing,
	// and Switch is called after, so Done() will contain the processed info after this call.
	// If txt is nil then any existing source in fs is used.
	ParseFile(fs *FileStates, txt []byte)

	// HighlightLine does the lexing and potentially parsing of a given line of the file,
	// for purposes of syntax highlighting -- uses Done() FileState of existing context
	// if available from prior lexing / parsing. Line is in 0-indexed "internal" line indexes,
	// and provides relevant context for the overall parsing, which is performed
	// on the given line of text runes, and also updates corresponding source in FileState
	// (via a copy).  If txt is nil then any existing source in fs is used.
	HighlightLine(fs *FileStates, line int, txt []rune) lexer.Line

	// CompleteLine provides the list of relevant completions for given text
	// which is at given position within the file.
	// Typically the language will call ParseLine on that line, and use the AST
	// to guide the selection of relevant symbols that can complete the code at
	// the given point.
	CompleteLine(fs *FileStates, text string, pos lexer.Pos) complete.Matches

	// CompleteEdit returns the completion edit data for integrating the
	// selected completion into the source
	CompleteEdit(fs *FileStates, text string, cp int, comp complete.Completion, seed string) (ed complete.Edit)

	// Lookup returns lookup results for given text which is at given position
	// within the file.  This can either be a file and position in file to
	// open and view, or direct text to show.
	Lookup(fs *FileStates, text string, pos lexer.Pos) complete.Lookup

	// IndentLine returns the indentation level for given line based on
	// previous line's indentation level, and any delta change based on
	// e.g., brackets starting or ending the previous or current line, or
	// other language-specific keywords.  See lexer.BracketIndentLine for example.
	// Indent level is in increments of tabSz for spaces, and tabs for tabs.
	// Operates on rune source with markup lex tags per line.
	IndentLine(fs *FileStates, src [][]rune, tags []lexer.Line, ln int, tabSz int) (pInd, delInd, pLn int, ichr indent.Character)

	// AutoBracket returns what to do when a user types a starting bracket character
	// (bracket, brace, paren) while typing.
	// pos = position where bra will be inserted, and curLn is the current line
	// match = insert the matching ket, and newLine = insert a new line.
	AutoBracket(fs *FileStates, bra rune, pos lexer.Pos, curLn []rune) (match, newLine bool)

	// ParseDir does the complete processing of a given directory, optionally including
	// subdirectories, and optionally forcing the re-processing of the directory(s),
	// instead of using cached symbols.  Typically the cache will be used unless files
	// have a more recent modification date than the cache file.  This returns the
	// language-appropriate set of symbols for the directory(s), which could then provide
	// the symbols for a given package, library, or module at that path.
	ParseDir(fs *FileState, path string, opts LanguageDirOptions) *syms.Symbol

	// LexLine is a lower-level call (mostly used internally to the language) that
	// does just the lexing of a given line of the file, using existing context
	// if available from prior lexing / parsing.
	// Line is in 0-indexed "internal" line indexes.
	// The rune source is updated from the given text if non-nil.
	LexLine(fs *FileState, line int, txt []rune) lexer.Line

	// ParseLine is a lower-level call (mostly used internally to the language) that
	// does complete parser processing of a single line from given file, and returns
	// the FileState for just that line.  Line is in 0-indexed "internal" line indexes.
	// The rune source information is assumed to have already been updated in FileState
	// Existing context information from full-file parsing is used as appropriate, but
	// the results will NOT be used to update any existing full-file AST representation --
	// should call ParseFile to update that as appropriate.
	ParseLine(fs *FileState, line int) *FileState
}

Language provides a general interface for language-specific management of the lexing, parsing, and symbol lookup process. The parse lexer and parser machinery is entirely language-general but specific languages may need specific ways of managing these processes, and processing their outputs, to best support the features of those languages. That is what this interface provides.

Each language defines a type supporting this interface, which is in turn registered with the StdLangProperties map. Each supported language has its own .go file in this parse package that defines its own implementation of the interface and any other associated functionality.

The Language is responsible for accessing the appropriate Parser for this language (initialized and managed via LangSupport.OpenStandard() etc) and the FileState structure contains all the input and output state information for a given file.

This interface is likely to evolve as we expand the range of supported languages.

type LanguageDirOptions added in v0.2.3

type LanguageDirOptions struct {

	// process subdirectories -- otherwise not
	Subdirs bool

	// rebuild the symbols by reprocessing from scratch instead of using cache
	Rebuild bool

	// do not update the cache with results from processing
	Nocache bool
}

LanguageDirOptions provides options for the [Language.ParseDir] method

type LanguageFlags added in v0.2.3

type LanguageFlags int32 //enums:enum

LanguageFlags are special properties of a given language

const (
	// NoFlags = nothing special
	NoFlags LanguageFlags = iota

	// IndentSpace means that spaces must be used for this language
	IndentSpace

	// IndentTab means that tabs must be used for this language
	IndentTab

	// ReAutoIndent causes current line to be re-indented during AutoIndent for Enter
	// (newline) -- this should only be set for strongly indented languages where
	// the previous + current line can tell you exactly what indent the current line
	// should be at.
	ReAutoIndent
)

LangFlags

const LanguageFlagsN LanguageFlags = 4

LanguageFlagsN is the highest valid value for type LanguageFlags, plus one.

func LanguageFlagsValues added in v0.2.3

func LanguageFlagsValues() []LanguageFlags

LanguageFlagsValues returns all possible values for the type LanguageFlags.

func (LanguageFlags) Desc added in v0.2.3

func (i LanguageFlags) Desc() string

Desc returns the description of the LanguageFlags value.

func (LanguageFlags) Int64 added in v0.2.3

func (i LanguageFlags) Int64() int64

Int64 returns the LanguageFlags value as an int64.

func (LanguageFlags) MarshalText added in v0.2.3

func (i LanguageFlags) MarshalText() ([]byte, error)

MarshalText implements the encoding.TextMarshaler interface.

func (*LanguageFlags) SetInt64 added in v0.2.3

func (i *LanguageFlags) SetInt64(in int64)

SetInt64 sets the LanguageFlags value from an int64.

func (*LanguageFlags) SetString added in v0.2.3

func (i *LanguageFlags) SetString(s string) error

SetString sets the LanguageFlags value from its string representation, and returns an error if the string is invalid.

func (LanguageFlags) String added in v0.2.3

func (i LanguageFlags) String() string

String returns the string representation of this LanguageFlags value.

func (*LanguageFlags) UnmarshalText added in v0.2.3

func (i *LanguageFlags) UnmarshalText(text []byte) error

UnmarshalText implements the encoding.TextUnmarshaler interface.

func (LanguageFlags) Values added in v0.2.3

func (i LanguageFlags) Values() []enums.Enum

Values returns all possible values for the type LanguageFlags.

type LanguageProperties added in v0.2.3

type LanguageProperties struct {

	// known language -- must be a supported one from Known list
	Known fileinfo.Known

	// character(s) that start a single-line comment -- if empty then multi-line comment syntax will be used
	CommentLn string

	// character(s) that start a multi-line comment or one that requires both start and end
	CommentSt string

	// character(s) that end a multi-line comment or one that requires both start and end
	CommentEd string

	// special properties for this language -- as an explicit list of options to make them easier to see and set in defaults
	Flags []LanguageFlags

	// Lang interface for this language
	Lang Language `json:"-" xml:"-"`

	// parser for this language -- initialized in OpenStandard
	Parser *Parser `json:"-" xml:"-"`
}

LanguageProperties contains properties of languages supported by the parser framework

func (*LanguageProperties) HasFlag added in v0.2.3

func (lp *LanguageProperties) HasFlag(flg LanguageFlags) bool

HasFlag returns true if given flag is set in Flags

type LanguageSupporter added in v0.2.3

type LanguageSupporter struct{}

LanguageSupporter provides general support for supported languages. e.g., looking up lexers and parsers by name. Also implements the lexer.LangLexer interface to provide access to other Guest Lexers

func (*LanguageSupporter) LexerByName added in v0.2.3

func (ll *LanguageSupporter) LexerByName(lang string) *lexer.Rule

LexerByName looks up Lexer for given language by name (with case-insensitive fallback). Returns nil if not supported.

func (*LanguageSupporter) OpenStandard added in v0.2.3

func (ll *LanguageSupporter) OpenStandard() error

OpenStandard opens all the standard parsers for languages, from the langs/ directory

func (*LanguageSupporter) Properties added in v0.2.3

func (ll *LanguageSupporter) Properties(sup fileinfo.Known) (*LanguageProperties, error)

Properties looks up language properties by fileinfo.Known const int type

func (*LanguageSupporter) PropertiesByName added in v0.2.3

func (ll *LanguageSupporter) PropertiesByName(lang string) (*LanguageProperties, error)

PropertiesByName looks up language properties by string name of language (with case-insensitive fallback). Returns error if not supported.

type Parser

type Parser struct {

	// lexer rules for first pass of lexing file
	Lexer *lexer.Rule

	// second pass after lexing -- computes nesting depth and EOS finding
	PassTwo lexer.PassTwo

	// parser rules for parsing lexed tokens
	Parser *parser.Rule

	// file name for overall parser (not file being parsed!)
	Filename string

	// if true, reports errors after parsing, to stdout
	ReportErrs bool

	// when loaded from file, this is the modification time of the parser -- re-processes cache if parser is newer than cached files
	ModTime time.Time `json:"-" xml:"-"`
}

Parser is the overall parser for managing the parsing

func NewParser

func NewParser() *Parser

NewParser returns a new initialized parser

func (*Parser) DoPassTwo

func (pr *Parser) DoPassTwo(fs *FileState)

DoPassTwo does the second pass after lexing

func (*Parser) Init

func (pr *Parser) Init()

Init initializes the parser -- must be called after creation

func (*Parser) InitAll

func (pr *Parser) InitAll()

InitAll initializes everything about the parser -- call this when setting up a new parser after it has been loaded etc

func (*Parser) LexAll

func (pr *Parser) LexAll(fs *FileState)

LexAll runs a complete pass of the lexer and pass two, on current state

func (*Parser) LexInit

func (pr *Parser) LexInit(fs *FileState)

LexInit gets the lexer ready to start lexing

func (*Parser) LexLine

func (pr *Parser) LexLine(fs *FileState, ln int, txt []rune) lexer.Line

LexLine runs lexer for given single line of source, which is updated from the given text (if non-nil) Returns merged regular and token comment lines, cloned and ready for use.

func (*Parser) LexNext

func (pr *Parser) LexNext(fs *FileState) *lexer.Rule

LexNext does next step of lexing -- returns lowest-level rule that matched, and nil when nomatch err or at end of source input

func (*Parser) LexNextLine

func (pr *Parser) LexNextLine(fs *FileState) *lexer.Rule

LexNextLine does next line of lexing -- returns lowest-level rule that matched at end, and nil when nomatch err or at end of source input

func (*Parser) LexRun

func (pr *Parser) LexRun(fs *FileState)

LexRun keeps running LextNext until it stops

func (*Parser) OpenJSON

func (pr *Parser) OpenJSON(filename string) error

OpenJSON opens lexer and parser rules from the given filename, in a standard JSON-formatted file

func (*Parser) ParseAll

func (pr *Parser) ParseAll(fs *FileState)

ParseAll does full parsing, including ParseInit and ParseRun, assuming LexAll has been done already

func (*Parser) ParseLine

func (pr *Parser) ParseLine(fs *FileState, ln int) *FileState

ParseLine runs parser for given single line of source does Parsing in a separate FileState and returns that with AST etc (or nil if nothing). Assumes LexLine has already been run on given line.

func (*Parser) ParseNext

func (pr *Parser) ParseNext(fs *FileState) *parser.Rule

ParseNext does next step of parsing -- returns lowest-level rule that matched or nil if no match error or at end

func (*Parser) ParseRun

func (pr *Parser) ParseRun(fs *FileState)

ParseRun continues running the parser until the end of the file

func (*Parser) ParseString

func (pr *Parser) ParseString(str string, fname string, sup fileinfo.Known) *FileState

ParseString runs lexer and parser on given string of text, returning FileState of results (can be nil if string is empty or no lexical tokens). Also takes supporting contextual info for file / language that this string is associated with (only for reference)

func (*Parser) ParserInit

func (pr *Parser) ParserInit(fs *FileState) bool

ParserInit initializes the parser prior to running

func (*Parser) ReadJSON

func (pr *Parser) ReadJSON(b []byte) error

ReadJSON opens lexer and parser rules from Bytes, in a standard JSON-formatted file

func (*Parser) SaveGrammar

func (pr *Parser) SaveGrammar(filename string) error

SaveGrammar saves lexer and parser grammar rules to BNF-like .parsegrammar file

func (*Parser) SaveJSON

func (pr *Parser) SaveJSON(filename string) error

SaveJSON saves lexer and parser rules, in a standard JSON-formatted file

Directories

Path Synopsis
cmd
update
Command update updates all of the .parse files within or beneath the current directory by opening and saving them.
Command update updates all of the .parse files within or beneath the current directory by opening and saving them.
tex
Package lexer provides all the lexing functions that transform text into lexical tokens, using token types defined in the token package.
Package lexer provides all the lexing functions that transform text into lexical tokens, using token types defined in the token package.
Package lsp contains types for the Language Server Protocol LSP: https://microsoft.github.io/language-server-protocol/specification and mappings from these elements into the token.Tokens types which are used internally in parse.
Package lsp contains types for the Language Server Protocol LSP: https://microsoft.github.io/language-server-protocol/specification and mappings from these elements into the token.Tokens types which are used internally in parse.
Package parse does the parsing stage after lexing
Package parse does the parsing stage after lexing
Package supportedlanguages includes all the supported languages for parse -- need to import this package to get those all included in a given target
Package supportedlanguages includes all the supported languages for parse -- need to import this package to get those all included in a given target
Package syms defines the symbols and their properties that are accumulated from a parsed file, and are then used for e.g., completion lookup, etc.
Package syms defines the symbols and their properties that are accumulated from a parsed file, and are then used for e.g., completion lookup, etc.
Package token defines a complete set of all lexical tokens for any kind of language! It is based on the alecthomas/chroma / pygments lexical tokens plus all the more detailed tokens needed for actually parsing languages
Package token defines a complete set of all lexical tokens for any kind of language! It is based on the alecthomas/chroma / pygments lexical tokens plus all the more detailed tokens needed for actually parsing languages

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL