lexer

package module

v0.0.0-...-e884d4b Latest Latest Go to latest Published: Nov 22, 2018 License: BSD-3-Clause Imports: 15 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cznic/lexer

Links

Open Source Insights

README ¶

github.com/cznic/lexer has moved to modernc.org/lexer (vcs).

Please update your import paths to modernc.org/lexer.

This repo is now archived.

Documentation ¶

Overview ¶

Package lexer provides generating actionless scanners (lexeme recognizers) at run time.

Scanners are defined by regular expressions and/or lexical grammars, mapping between those definitions, token numeric identifiers and an optional set of starting id sets, providing simmilar functionality as switching start states in *nix LEX. The generated FSMs are Unicode arune based and all unicode.Categories and unicode.Scripts are supported by the regexp syntax using the \p{name} construct.

Syntax supported by ParseRE (ATM a very basic subset of RE2, docs bellow are a mod of: http://code.google.com/p/re2/wiki/Syntax, original docs license unclear)

Single characters:

.            any character, excluding newline
[xyz]        character class
[^xyz]       negated character class
\p{Greek}    Unicode character class
\P{Greek}    negated Unicode character class

Composites:

xy           x followed by y
x|y          x or y

Repetitions:

x*           zero or more x
x+           one or more x
x?           zero or one x

Grouping:

(re)         group

Empty strings:

^            at beginning of text or line
$            at end of text or line
\A           at beginning of text
\z           at end of text

Escape sequences:

\a           bell (≡ \007)
\b           backspace (≡ \010)
\f           form feed (≡ \014)
\n           newline (≡ \012)
\r           carriage return (≡ \015)
\t           horizontal tab (≡ \011)
\v           vertical tab character (≡ \013)
\M           M is one of metachars \.+*?()|[]^$
\xhh         arune \u00hh, h is a hex digit

Character class elements:

x            single Unicode character
A-Z          Unicode character range (inclusive)

Unicode character class names--general category:

Cc           control
Cf           format
Co           private use
Cs           surrogate
letter       Lu, Ll, Lt, Lm, or Lo
Ll           lowercase letter
Lm           modifier letter
Lo           other letter
Lt           titlecase letter
Lu           uppercase letter
Mc           spacing mark
Me           enclosing mark
Mn           non-spacing mark
Nd           decimal number
Nl           letter number
No           other number
Pc           connector punctuation
Pd           dash punctuation
Pe           close punctuation
Pf           final punctuation
Pi           initial punctuation
Po           other punctuation
Ps           open punctuation
Sc           currency symbol
Sk           modifier symbol
Sm           math symbol
So           other symbol
Zl           line separator
Zp           paragraph separator
Zs           space separator

Unicode character class names--scripts:

Arabic                 Arabic
Armenian               Armenian
Avestan                Avestan
Balinese               Balinese
Bamum                  Bamum
Bengali                Bengali
Bopomofo               Bopomofo
Braille                Braille
Buginese               Buginese
Buhid                  Buhid
Canadian_Aboriginal    Canadian Aboriginal
Carian                 Carian
Common                 Common
Coptic                 Coptic
Cuneiform              Cuneiform
Cypriot                Cypriot
Cyrillic               Cyrillic
Deseret                Deseret
Devanagari             Devanagari
Egyptian_Hieroglyphs   Egyptian Hieroglyphs
Ethiopic               Ethiopic
Georgian               Georgian
Glagolitic             Glagolitic
Gothic                 Gothic
Greek                  Greek
Gujarati               Gujarati
Gurmukhi               Gurmukhi
Hangul                 Hangul
Han                    Han
Hanunoo                Hanunoo
Hebrew                 Hebrew
Hiragana               Hiragana
Cham                   Cham
Cherokee               Cherokee
Imperial_Aramaic       Imperial Aramaic
Inherited              Inherited
Inscriptional_Pahlavi  Inscriptional Pahlavi
Inscriptional_Parthian Inscriptional Parthian
Javanese               Javanese
Kaithi                 Kaithi
Kannada                Kannada
Katakana               Katakana
Kayah_Li               Kayah Li
Kharoshthi             Kharoshthi
Khmer                  Khmer
Lao                    Lao
Latin                  Latin
Lepcha                 Lepcha
Limbu                  Limbu
Linear_B               Linear B
Lisu                   Lisu
Lycian                 Lycian
Lydian                 Lydian
Malayalam              Malayalam
Meetei_Mayek           Meetei Mayek
Mongolian              Mongolian
Myanmar                Myanmar
New_Tai_Lue            New Tai Lue
Nko                    Nko
Ogham                  Ogham
Old_Italic             Old Italic
Old_Persian            Old Persian
Old_South_Arabian      Old South Arabian
Old_Turkic             Old Turkic
Ol_Chiki               Ol Chiki
Oriya                  Oriya
Osmanya                Osmanya
Phags_Pa               Phags Pa
Phoenician             Phoenician
Rejang                 Rejang
Runic                  Runic
Samaritan              Samaritan
Saurashtra             Saurashtra
Shavian                Shavian
Sinhala                Sinhala
Sundanese              Sundanese
Syloti_Nagri           Syloti Nagri
Syriac                 Syriac
Tagalog                Tagalog
Tagbanwa               Tagbanwa
Tai_Le                 Tai Le
Tai_Tham               Tai Tham
Tai_Viet               Tai Viet
Tamil                  Tamil
Telugu                 Telugu
Thaana                 Thaana
Thai                   Thai
Tibetan                Tibetan
Tifinagh               Tifinagh
Ugaritic               Ugaritic
Vai                    Vai
Yi                     Yi

Index ¶

type AssertEdge
- func NewAssertEdge(target *NfaState, asserts EdgeAssert) *AssertEdge
- func (e *AssertEdge) Accepts(s *ScannerSource) bool
- func (e *AssertEdge) String() (s string)
type EOFReader
- func (r EOFReader) ReadRune() (arune rune, size int, err error)
type EdgeAssert
type Edger
type EpsilonEdge
- func (e *EpsilonEdge) Accepts(s *ScannerSource) bool
- func (e *EpsilonEdge) Priority() int
- func (e *EpsilonEdge) SetTarget(s *NfaState) (old *NfaState)
- func (e *EpsilonEdge) String() (s string)
- func (e *EpsilonEdge) Target() *NfaState
type Lexer
- func CompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer, err error)
- func MustCompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer)
- func (lx *Lexer) Scanner(fname string, r io.RuneReader) *Scanner
- func (lx *Lexer) String() (s string)
type Nfa
- func (n *Nfa) AddState(s *NfaState) *NfaState
- func (n *Nfa) NewState() (s *NfaState)
- func (n *Nfa) OneOrMore(in, out *NfaState) (from, to *NfaState)
- func (n *Nfa) ParseRE(name, re string) (in, out *NfaState, err error)
- func (n Nfa) String() (s string)
- func (n *Nfa) ZeroOrMore(in, out *NfaState) (from, to *NfaState)
- func (n *Nfa) ZeroOrOne(in, out *NfaState) (from, to *NfaState)
type NfaState
- func (n *NfaState) AddConsuming(edge Edger) Edger
- func (n *NfaState) AddNonConsuming(edge Edger) Edger
- func (n *NfaState) String() (s string)
type RangesEdge
- func NewRangesEdge(target *NfaState, invert bool, ranges *unicode.RangeTable) *RangesEdge
- func (e *RangesEdge) Accepts(s *ScannerSource) bool
- func (e *RangesEdge) String() (s string)
type RuneEdge
- func NewRuneEdge(target *NfaState, arune rune) *RuneEdge
- func (e *RuneEdge) Accepts(s *ScannerSource) bool
- func (e *RuneEdge) String() string
type Scanner
- func (s *Scanner) Begin(state StartSetID)
- func (s *Scanner) Include(fname string, r io.RuneReader)
- func (s *Scanner) PopState()
- func (s *Scanner) Position() token.Position
- func (s *Scanner) PushState(newState StartSetID)
- func (s *Scanner) Scan() (arune rune, ok bool)
- func (s *Scanner) Token() []rune
- func (s *Scanner) TokenStart() token.Position
- func (s *Scanner) TopState() StartSetID
type ScannerRune
type ScannerSource
- func NewScannerSource(fname string, r io.RuneReader) *ScannerSource
- func (s *ScannerSource) Accept(arune rune) bool
- func (s *ScannerSource) Collect() (arunes []rune)
- func (s *ScannerSource) CollectString() string
- func (s *ScannerSource) Current() rune
- func (s *ScannerSource) CurrentRune() ScannerRune
- func (s *ScannerSource) Include(fname string, r io.RuneReader)
- func (s *ScannerSource) Move()
- func (s *ScannerSource) Next() rune
- func (s *ScannerSource) NextRune() ScannerRune
- func (s *ScannerSource) Position() token.Position
- func (s *ScannerSource) Prev() rune
- func (s *ScannerSource) PrevRune() ScannerRune
type Source
- func NewSource(fname string, r io.RuneReader) *Source
- func (s *Source) Include(fname string, r io.RuneReader)
- func (s *Source) Position() token.Position
- func (s *Source) Read() (r ScannerRune)
type StartSetID

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AssertEdge ¶

type AssertEdge struct {
	EpsilonEdge
	Asserts EdgeAssert
}

AssertEdge is a non consuming edge which asserts line/text start/end.

func NewAssertEdge ¶

func NewAssertEdge(target *NfaState, asserts EdgeAssert) *AssertEdge

NewAssertEdge returns a new AssertdEdge pointing to target, asserting asserts.

func (*AssertEdge) Accepts ¶

func (e *AssertEdge) Accepts(s *ScannerSource) bool

Accepts is the AssertEdge implementation of the Edger interface.

func (*AssertEdge) String ¶

func (e *AssertEdge) String() (s string)

type EOFReader ¶

type EOFReader int

EOFReader implements a RuneReader allways returning 0 (EOF)

func (EOFReader) ReadRune ¶

func (r EOFReader) ReadRune() (arune rune, size int, err error)

type EdgeAssert ¶

type EdgeAssert int

const (
	TextStart EdgeAssert = iota
	TextEnd
	LineStart
	LineEnd
)

type Edger ¶

type Edger interface {
	Accepts(s *ScannerSource) bool // Accepts() returns wheter an edge accepts the ScannerSource present state.
	Priority() int                 // Priority returns the priority tag of an edge (lower value wins).
	Target() *NfaState             // Target() returns the edge's target NFA state.
	String() string
	SetTarget(s *NfaState) *NfaState // SetTarget() assigns s as a new target and returns the original Target
}

Edger interface defines the method set for all NFA edge types.

type EpsilonEdge ¶

type EpsilonEdge struct {
	Prio int
	Targ *NfaState
}

EpsilonEdge is a non consuming, always accepting NFA edge.

func (*EpsilonEdge) Accepts ¶

func (e *EpsilonEdge) Accepts(s *ScannerSource) bool

Accepts is the EpsilonEdge implementation of the Edger interface.

func (*EpsilonEdge) Priority ¶

func (e *EpsilonEdge) Priority() int

Priority is the EpsilonEdge implementation of the Edger interface.

func (*EpsilonEdge) SetTarget ¶

func (e *EpsilonEdge) SetTarget(s *NfaState) (old *NfaState)

func (*EpsilonEdge) String ¶

func (e *EpsilonEdge) String() (s string)

func (*EpsilonEdge) Target ¶

func (e *EpsilonEdge) Target() *NfaState

Target is the EpsilonEdge implementation of the Edger interface.

type Lexer ¶

type Lexer struct {
	// contains filtered or unexported fields
}

func CompileLexer ¶

func CompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer, err error)

TODO:full docs

func MustCompileLexer ¶

func MustCompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer)

MustCompileLexer is like CompileLexer but panics if the definitions cannot be compiled. It simplifies safe initialization of global variables holding compiled Lexers.

func (*Lexer) Scanner ¶

func (lx *Lexer) Scanner(fname string, r io.RuneReader) *Scanner

Scanner returns a new Scanner which can run the Lexer FSM. A Scanner is not safe for concurent access but many Scanners can safely share the same Lexer.

The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.

func (*Lexer) String ¶

func (lx *Lexer) String() (s string)

type Nfa ¶

type Nfa []*NfaState

Nfa is a set of NfaStates.

func (*Nfa) AddState ¶

func (n *Nfa) AddState(s *NfaState) *NfaState

AddState adds and existing NfaState to Nfa. One NfaState should not appear in more than one Nfa because the NfaState Index property should always reflect its position in the owner Nfa.

func (*Nfa) NewState ¶

func (n *Nfa) NewState() (s *NfaState)

NewState returns a newly created NfaState and adds it to the Nfa.

func (*Nfa) OneOrMore ¶

func (n *Nfa) OneOrMore(in, out *NfaState) (from, to *NfaState)

OneOrMore converts a Nfa component C to C+

func (*Nfa) ParseRE ¶

func (n *Nfa) ParseRE(name, re string) (in, out *NfaState, err error)

ParseRE compiles a regular expression re into Nfa, returns the re component starting and accepting states or an Error if any.

func (Nfa) String ¶

func (n Nfa) String() (s string)

func (*Nfa) ZeroOrMore ¶

func (n *Nfa) ZeroOrMore(in, out *NfaState) (from, to *NfaState)

ZeroOrMore converts a Nfa component C to C*

func (*Nfa) ZeroOrOne ¶

func (n *Nfa) ZeroOrOne(in, out *NfaState) (from, to *NfaState)

ZeroOrOne converts a Nfa component C to C?

type NfaState ¶

type NfaState struct {
	Index        uint    // Index of this state in its owning NFA.
	Consuming    []Edger // The NFA state non consuming edge set.
	NonConsuming []Edger // The NFA state consuming edge set.
}

NfaState desribes a single NFA state.

func (*NfaState) AddConsuming ¶

func (n *NfaState) AddConsuming(edge Edger) Edger

AddConsuming adds an Edger to the state's consuming edge set and returns the Edger. No checks are made if the edge really is a consuming egde.

func (*NfaState) AddNonConsuming ¶

func (n *NfaState) AddNonConsuming(edge Edger) Edger

AddNonConsuming adds an Edger to the state's non consuming edge set and returns the Edger. No checks are made if the edge really is a non consuming edge.

func (*NfaState) String ¶

func (n *NfaState) String() (s string)

type RangesEdge ¶

type RangesEdge struct {
	EpsilonEdge
	Invert bool                // Accepts all but Ranges as in [^exp]
	Ranges *unicode.RangeTable // Accepted arune set
}

RangesEdge is a consuming egde which accepts arune ranges except \U+0000.

func NewRangesEdge ¶

func NewRangesEdge(target *NfaState, invert bool, ranges *unicode.RangeTable) *RangesEdge

NewRangesEdge returns a new RangesEdge pointing to target which accepts ranges.

func (*RangesEdge) Accepts ¶

func (e *RangesEdge) Accepts(s *ScannerSource) bool

Accepts is the RangesEdge implementation of the Edger interface.

func (*RangesEdge) String ¶

func (e *RangesEdge) String() (s string)

type RuneEdge ¶

type RuneEdge struct {
	EpsilonEdge
	Rune rune
}

RuneEdge is a consuming egde which accepts a single arune.

func NewRuneEdge ¶

func NewRuneEdge(target *NfaState, arune rune) *RuneEdge

NewRuneEdge returns a new RuneEdge pointing to target which accepts arune.

func (*RuneEdge) Accepts ¶

func (e *RuneEdge) Accepts(s *ScannerSource) bool

Accepts is the RuneEdge implementation of the Edger interface.

func (*RuneEdge) String ¶

func (e *RuneEdge) String() string

type Scanner ¶

type Scanner struct {
	// contains filtered or unexported fields
}

func (*Scanner) Begin ¶

func (s *Scanner) Begin(state StartSetID)

Begin switches the Scanner's start state (start set).

func (*Scanner) Include ¶

func (s *Scanner) Include(fname string, r io.RuneReader)

Include includes a RuneReader having fname. Recursive including is not checked. Include discards the one arune lookahead data if there are any. Lookahead data exists iff Next() has been called and Move() has not yet been called afterwards.

func (*Scanner) PopState ¶

func (s *Scanner) PopState()

PopState pops the top of the stack and switches to it via Begin().

func (*Scanner) Position ¶

func (s *Scanner) Position() token.Position

Position returns the current Scanner position, i.e. after a Scan() it returns the position after the current token.

func (*Scanner) PushState ¶

func (s *Scanner) PushState(newState StartSetID)

PushState pushes the current start condition onto the top of the start condition stack and switches to newState as though you had used Begin(newState).

func (*Scanner) Scan ¶

func (s *Scanner) Scan() (arune rune, ok bool)

Scan scans the Scanner source, consumes arunes as long as there is a chance to recognize a token (i.e. until the Scanner FSM stops).

If the scanner is starting a Scan at EOF:
    Return 0, false.

If a valid token was recognized:
    If the token's numeric id is >= 0:
        Return id, true.
    If the id is < 0:
        If the Scan has consumed at least one arune:
            Scan restarts discarding any consumed arunes.
        If the Scan has not consumed any arune:
            Scanner is stalled¹. Move on by one arune, return unicode.ReplacementChar, false.

If a valid token was not recognized:
    If the Scanner has not consumed any arune:
        Return the current arune, false.² Move on by one arune.
    If the Scanner has moved by exactly one arune:
        Return that arune, false.²
    If the Scanner has consumed more than one arune:
        Return unicode.ReplacementChar, false.

The actual arunes consumed by the last Scan can be retrieved by Token.

If the assigned token ids do not overlap with the otherwise expected arunes, i.e. their ids are e.g. in the Unicode private usage area, then it is possible, as any other unsuccessful scan will return either zero (EOF) or unicode.ReplacementChar, to ignore the returned ok value and drive a parser only by the arune/token id value. This is presumably the easier way for e.g. goyacc.

¹The FSM has stopped in an accepting state without consuming any arunes. Caused by using (re)* or (re)? for negative numeric id (i.e. ignored) tokens. Better avoid that.

²Intended for processing single arune tokens (e.g. a semicolon) without defining the regexp and token id for it. Examples of such usage can be found in many .y files.

func (*Scanner) Token ¶

func (s *Scanner) Token() []rune

Token returns the arunes consumed by last Scan. Repeated Scans for ignored tokens (id < 0) are discarded.

func (*Scanner) TokenStart ¶

func (s *Scanner) TokenStart() token.Position

TokenStart returns the starting position of the token returned by last Scan.

func (*Scanner) TopState ¶

func (s *Scanner) TopState() StartSetID

TopState returns the top of the stack without altering the stack's contents.

type ScannerRune ¶

type ScannerRune struct {
	Position token.Position // Starting position of Rune
	Rune     rune           // Rune value
	Size     int            // Rune size
	Err      error          // os.EOF or nil. Any other value invalidates all other fields of a ScannerRune.
}

ScannerRune is a struct holding info about a arune and its origin

type ScannerSource ¶

type ScannerSource struct {
	// contains filtered or unexported fields
}

ScannerSource is a Source with one ScannerRune look behind and an on demand one ScannerRune lookahead.

func NewScannerSource ¶

func NewScannerSource(fname string, r io.RuneReader) *ScannerSource

NewScannerSource returns a new ScannerSource from a RuneReader having fname. The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.

func (*ScannerSource) Accept ¶

func (s *ScannerSource) Accept(arune rune) bool

Accept checks if arune matches Current. If true then does Move.

func (*ScannerSource) Collect ¶

func (s *ScannerSource) Collect() (arunes []rune)

Collect returns all arunes seen by the ScannerSource since last Collect or CollectString. Either Collect or CollectString can be called but only one of them as both clears the collector.

func (*ScannerSource) CollectString ¶

func (s *ScannerSource) CollectString() string

CollectString returns all arunes seen by the ScannerSource since last CollectString or Collect as a string. Either Collect or CollectString can be called but only one of them as both clears the collector.

func (*ScannerSource) Current ¶

func (s *ScannerSource) Current() rune

CurrentRune returns the current ScannerSource arune. At EOF it's zero.

func (*ScannerSource) CurrentRune ¶

func (s *ScannerSource) CurrentRune() ScannerRune

Current returns the current ScannerSource ScannerRune.

func (*ScannerSource) Include ¶

func (s *ScannerSource) Include(fname string, r io.RuneReader)

Include includes a RuneReader having fname. Recursive including is not checked. Include discards the one arune lookahead data if there are any. Lookahead data exists iff Next() has been called and Move() has not yet been called afterwards.

func (*ScannerSource) Move ¶

func (s *ScannerSource) Move()

Move moves ScannerSource one arune ahead.

func (*ScannerSource) Next ¶

func (s *ScannerSource) Next() rune

NextRune returns ScannerSource next (lookahead) arune. It is zero if next is EOF

func (*ScannerSource) NextRune ¶

func (s *ScannerSource) NextRune() ScannerRune

Next returns ScannerSource next (lookahead) ScannerRune. It's Rune is zero if next is EOF.

func (*ScannerSource) Position ¶

func (s *ScannerSource) Position() token.Position

Position returns the current ScannerSource position, i.e. after a Move() it returns the position after CurrentRune.

func (*ScannerSource) Prev ¶

func (s *ScannerSource) Prev() rune

PrevRune returns the previous (look behind) ScannerRune arune. Before first Move() its zero.

func (*ScannerSource) PrevRune ¶

func (s *ScannerSource) PrevRune() ScannerRune

Prev returns then previous (look behind) ScanerRune. Before first Move() its Rune is zero and Position.IsValid == false

type Source ¶

type Source struct {
	// contains filtered or unexported fields
}

Source provides a stack of arune streams with position information.

func NewSource ¶

func NewSource(fname string, r io.RuneReader) *Source

NewSource returns a new Source from a RuneReader having fname. The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.

func (*Source) Include ¶

func (s *Source) Include(fname string, r io.RuneReader)

Include includes a RuneReader having fname. Recursive including is not checked.

func (*Source) Position ¶

func (s *Source) Position() token.Position

Position return the position of the next Read.

func (*Source) Read ¶

func (s *Source) Read() (r ScannerRune)

Read returns the next Source ScannerRune.

type StartSetID ¶

type StartSetID int

StartSetID is a type of a lexer start set identificator. It is used by Begin and PushState.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL