Documentation
¶
Overview ¶
Package lexer provides generating actionless scanners (lexeme recognizers) at run time.
Scanners are defined by regular expressions and/or lexical grammars, mapping between those definitions, token numeric identifiers and an optional set of starting id sets, providing simmilar functionality as switching start states in *nix LEX. The generated FSMs are Unicode arune based and all unicode.Categories and unicode.Scripts are supported by the regexp syntax using the \p{name} construct.
Syntax supported by ParseRE (ATM a very basic subset of RE2, docs bellow are a mod of: http://code.google.com/p/re2/wiki/Syntax, original docs license unclear)
Single characters:
. any character, excluding newline [xyz] character class [^xyz] negated character class \p{Greek} Unicode character class \P{Greek} negated Unicode character class
Composites:
xy x followed by y x|y x or y
Repetitions:
x* zero or more x x+ one or more x x? zero or one x
Grouping:
(re) group
Empty strings:
^ at beginning of text or line $ at end of text or line \A at beginning of text \z at end of text
Escape sequences:
\a bell (≡ \007) \b backspace (≡ \010) \f form feed (≡ \014) \n newline (≡ \012) \r carriage return (≡ \015) \t horizontal tab (≡ \011) \v vertical tab character (≡ \013) \M M is one of metachars \.+*?()|[]^$ \xhh arune \u00hh, h is a hex digit
Character class elements:
x single Unicode character A-Z Unicode character range (inclusive)
Unicode character class names--general category:
Cc control Cf format Co private use Cs surrogate letter Lu, Ll, Lt, Lm, or Lo Ll lowercase letter Lm modifier letter Lo other letter Lt titlecase letter Lu uppercase letter Mc spacing mark Me enclosing mark Mn non-spacing mark Nd decimal number Nl letter number No other number Pc connector punctuation Pd dash punctuation Pe close punctuation Pf final punctuation Pi initial punctuation Po other punctuation Ps open punctuation Sc currency symbol Sk modifier symbol Sm math symbol So other symbol Zl line separator Zp paragraph separator Zs space separator
Unicode character class names--scripts:
Arabic Arabic Armenian Armenian Avestan Avestan Balinese Balinese Bamum Bamum Bengali Bengali Bopomofo Bopomofo Braille Braille Buginese Buginese Buhid Buhid Canadian_Aboriginal Canadian Aboriginal Carian Carian Common Common Coptic Coptic Cuneiform Cuneiform Cypriot Cypriot Cyrillic Cyrillic Deseret Deseret Devanagari Devanagari Egyptian_Hieroglyphs Egyptian Hieroglyphs Ethiopic Ethiopic Georgian Georgian Glagolitic Glagolitic Gothic Gothic Greek Greek Gujarati Gujarati Gurmukhi Gurmukhi Hangul Hangul Han Han Hanunoo Hanunoo Hebrew Hebrew Hiragana Hiragana Cham Cham Cherokee Cherokee Imperial_Aramaic Imperial Aramaic Inherited Inherited Inscriptional_Pahlavi Inscriptional Pahlavi Inscriptional_Parthian Inscriptional Parthian Javanese Javanese Kaithi Kaithi Kannada Kannada Katakana Katakana Kayah_Li Kayah Li Kharoshthi Kharoshthi Khmer Khmer Lao Lao Latin Latin Lepcha Lepcha Limbu Limbu Linear_B Linear B Lisu Lisu Lycian Lycian Lydian Lydian Malayalam Malayalam Meetei_Mayek Meetei Mayek Mongolian Mongolian Myanmar Myanmar New_Tai_Lue New Tai Lue Nko Nko Ogham Ogham Old_Italic Old Italic Old_Persian Old Persian Old_South_Arabian Old South Arabian Old_Turkic Old Turkic Ol_Chiki Ol Chiki Oriya Oriya Osmanya Osmanya Phags_Pa Phags Pa Phoenician Phoenician Rejang Rejang Runic Runic Samaritan Samaritan Saurashtra Saurashtra Shavian Shavian Sinhala Sinhala Sundanese Sundanese Syloti_Nagri Syloti Nagri Syriac Syriac Tagalog Tagalog Tagbanwa Tagbanwa Tai_Le Tai Le Tai_Tham Tai Tham Tai_Viet Tai Viet Tamil Tamil Telugu Telugu Thaana Thaana Thai Thai Tibetan Tibetan Tifinagh Tifinagh Ugaritic Ugaritic Vai Vai Yi Yi
Index ¶
- type AssertEdge
- type EOFReader
- type EdgeAssert
- type Edger
- type EpsilonEdge
- type Lexer
- type Nfa
- func (n *Nfa) AddState(s *NfaState) *NfaState
- func (n *Nfa) NewState() (s *NfaState)
- func (n *Nfa) OneOrMore(in, out *NfaState) (from, to *NfaState)
- func (n *Nfa) ParseRE(name, re string) (in, out *NfaState, err error)
- func (n Nfa) String() (s string)
- func (n *Nfa) ZeroOrMore(in, out *NfaState) (from, to *NfaState)
- func (n *Nfa) ZeroOrOne(in, out *NfaState) (from, to *NfaState)
- type NfaState
- type RangesEdge
- type RuneEdge
- type Scanner
- func (s *Scanner) Begin(state StartSetID)
- func (s *Scanner) Include(fname string, r io.RuneReader)
- func (s *Scanner) PopState()
- func (s *Scanner) Position() token.Position
- func (s *Scanner) PushState(newState StartSetID)
- func (s *Scanner) Scan() (arune rune, ok bool)
- func (s *Scanner) Token() []rune
- func (s *Scanner) TokenStart() token.Position
- func (s *Scanner) TopState() StartSetID
- type ScannerRune
- type ScannerSource
- func (s *ScannerSource) Accept(arune rune) bool
- func (s *ScannerSource) Collect() (arunes []rune)
- func (s *ScannerSource) CollectString() string
- func (s *ScannerSource) Current() rune
- func (s *ScannerSource) CurrentRune() ScannerRune
- func (s *ScannerSource) Include(fname string, r io.RuneReader)
- func (s *ScannerSource) Move()
- func (s *ScannerSource) Next() rune
- func (s *ScannerSource) NextRune() ScannerRune
- func (s *ScannerSource) Position() token.Position
- func (s *ScannerSource) Prev() rune
- func (s *ScannerSource) PrevRune() ScannerRune
- type Source
- type StartSetID
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AssertEdge ¶
type AssertEdge struct { EpsilonEdge Asserts EdgeAssert }
AssertEdge is a non consuming edge which asserts line/text start/end.
func NewAssertEdge ¶
func NewAssertEdge(target *NfaState, asserts EdgeAssert) *AssertEdge
NewAssertEdge returns a new AssertdEdge pointing to target, asserting asserts.
func (*AssertEdge) Accepts ¶
func (e *AssertEdge) Accepts(s *ScannerSource) bool
Accepts is the AssertEdge implementation of the Edger interface.
func (*AssertEdge) String ¶
func (e *AssertEdge) String() (s string)
type EdgeAssert ¶
type EdgeAssert int
const ( TextStart EdgeAssert = iota TextEnd LineStart LineEnd )
type Edger ¶
type Edger interface { Accepts(s *ScannerSource) bool // Accepts() returns wheter an edge accepts the ScannerSource present state. Priority() int // Priority returns the priority tag of an edge (lower value wins). Target() *NfaState // Target() returns the edge's target NFA state. String() string SetTarget(s *NfaState) *NfaState // SetTarget() assigns s as a new target and returns the original Target }
Edger interface defines the method set for all NFA edge types.
type EpsilonEdge ¶
EpsilonEdge is a non consuming, always accepting NFA edge.
func (*EpsilonEdge) Accepts ¶
func (e *EpsilonEdge) Accepts(s *ScannerSource) bool
Accepts is the EpsilonEdge implementation of the Edger interface.
func (*EpsilonEdge) Priority ¶
func (e *EpsilonEdge) Priority() int
Priority is the EpsilonEdge implementation of the Edger interface.
func (*EpsilonEdge) SetTarget ¶
func (e *EpsilonEdge) SetTarget(s *NfaState) (old *NfaState)
func (*EpsilonEdge) String ¶
func (e *EpsilonEdge) String() (s string)
func (*EpsilonEdge) Target ¶
func (e *EpsilonEdge) Target() *NfaState
Target is the EpsilonEdge implementation of the Edger interface.
type Lexer ¶
type Lexer struct {
// contains filtered or unexported fields
}
func CompileLexer ¶
func CompileLexer(starts [][]int, tokdefs map[string]int, grammar, start string) (lexer *Lexer, err error)
TODO:full docs
func MustCompileLexer ¶
MustCompileLexer is like CompileLexer but panics if the definitions cannot be compiled. It simplifies safe initialization of global variables holding compiled Lexers.
func (*Lexer) Scanner ¶
func (lx *Lexer) Scanner(fname string, r io.RuneReader) *Scanner
Scanner returns a new Scanner which can run the Lexer FSM. A Scanner is not safe for concurent access but many Scanners can safely share the same Lexer.
The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.
type Nfa ¶
type Nfa []*NfaState
Nfa is a set of NfaStates.
func (*Nfa) AddState ¶
AddState adds and existing NfaState to Nfa. One NfaState should not appear in more than one Nfa because the NfaState Index property should always reflect its position in the owner Nfa.
func (*Nfa) ParseRE ¶
ParseRE compiles a regular expression re into Nfa, returns the re component starting and accepting states or an Error if any.
func (*Nfa) ZeroOrMore ¶
ZeroOrMore converts a Nfa component C to C*
type NfaState ¶
type NfaState struct { Index uint // Index of this state in its owning NFA. Consuming []Edger // The NFA state non consuming edge set. NonConsuming []Edger // The NFA state consuming edge set. }
NfaState desribes a single NFA state.
func (*NfaState) AddConsuming ¶
AddConsuming adds an Edger to the state's consuming edge set and returns the Edger. No checks are made if the edge really is a consuming egde.
func (*NfaState) AddNonConsuming ¶
AddNonConsuming adds an Edger to the state's non consuming edge set and returns the Edger. No checks are made if the edge really is a non consuming edge.
type RangesEdge ¶
type RangesEdge struct { EpsilonEdge Invert bool // Accepts all but Ranges as in [^exp] Ranges *unicode.RangeTable // Accepted arune set }
RangesEdge is a consuming egde which accepts arune ranges except \U+0000.
func NewRangesEdge ¶
func NewRangesEdge(target *NfaState, invert bool, ranges *unicode.RangeTable) *RangesEdge
NewRangesEdge returns a new RangesEdge pointing to target which accepts ranges.
func (*RangesEdge) Accepts ¶
func (e *RangesEdge) Accepts(s *ScannerSource) bool
Accepts is the RangesEdge implementation of the Edger interface.
func (*RangesEdge) String ¶
func (e *RangesEdge) String() (s string)
type RuneEdge ¶
type RuneEdge struct { EpsilonEdge Rune rune }
RuneEdge is a consuming egde which accepts a single arune.
func NewRuneEdge ¶
NewRuneEdge returns a new RuneEdge pointing to target which accepts arune.
func (*RuneEdge) Accepts ¶
func (e *RuneEdge) Accepts(s *ScannerSource) bool
Accepts is the RuneEdge implementation of the Edger interface.
type Scanner ¶
type Scanner struct {
// contains filtered or unexported fields
}
func (*Scanner) Begin ¶
func (s *Scanner) Begin(state StartSetID)
Begin switches the Scanner's start state (start set).
func (*Scanner) Include ¶
func (s *Scanner) Include(fname string, r io.RuneReader)
Include includes a RuneReader having fname. Recursive including is not checked. Include discards the one arune lookahead data if there are any. Lookahead data exists iff Next() has been called and Move() has not yet been called afterwards.
func (*Scanner) PopState ¶
func (s *Scanner) PopState()
PopState pops the top of the stack and switches to it via Begin().
func (*Scanner) Position ¶
Position returns the current Scanner position, i.e. after a Scan() it returns the position after the current token.
func (*Scanner) PushState ¶
func (s *Scanner) PushState(newState StartSetID)
PushState pushes the current start condition onto the top of the start condition stack and switches to newState as though you had used Begin(newState).
func (*Scanner) Scan ¶
Scan scans the Scanner source, consumes arunes as long as there is a chance to recognize a token (i.e. until the Scanner FSM stops).
If the scanner is starting a Scan at EOF: Return 0, false. If a valid token was recognized: If the token's numeric id is >= 0: Return id, true. If the id is < 0: If the Scan has consumed at least one arune: Scan restarts discarding any consumed arunes. If the Scan has not consumed any arune: Scanner is stalled¹. Move on by one arune, return unicode.ReplacementChar, false. If a valid token was not recognized: If the Scanner has not consumed any arune: Return the current arune, false.² Move on by one arune. If the Scanner has moved by exactly one arune: Return that arune, false.² If the Scanner has consumed more than one arune: Return unicode.ReplacementChar, false.
The actual arunes consumed by the last Scan can be retrieved by Token.
If the assigned token ids do not overlap with the otherwise expected arunes, i.e. their ids are e.g. in the Unicode private usage area, then it is possible, as any other unsuccessful scan will return either zero (EOF) or unicode.ReplacementChar, to ignore the returned ok value and drive a parser only by the arune/token id value. This is presumably the easier way for e.g. goyacc.
¹The FSM has stopped in an accepting state without consuming any arunes. Caused by using (re)* or (re)? for negative numeric id (i.e. ignored) tokens. Better avoid that.
²Intended for processing single arune tokens (e.g. a semicolon) without defining the regexp and token id for it. Examples of such usage can be found in many .y files.
func (*Scanner) Token ¶
Token returns the arunes consumed by last Scan. Repeated Scans for ignored tokens (id < 0) are discarded.
func (*Scanner) TokenStart ¶
TokenStart returns the starting position of the token returned by last Scan.
func (*Scanner) TopState ¶
func (s *Scanner) TopState() StartSetID
TopState returns the top of the stack without altering the stack's contents.
type ScannerRune ¶
type ScannerRune struct { Position token.Position // Starting position of Rune Rune rune // Rune value Size int // Rune size Err error // os.EOF or nil. Any other value invalidates all other fields of a ScannerRune. }
ScannerRune is a struct holding info about a arune and its origin
type ScannerSource ¶
type ScannerSource struct {
// contains filtered or unexported fields
}
ScannerSource is a Source with one ScannerRune look behind and an on demand one ScannerRune lookahead.
func NewScannerSource ¶
func NewScannerSource(fname string, r io.RuneReader) *ScannerSource
NewScannerSource returns a new ScannerSource from a RuneReader having fname. The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.
func (*ScannerSource) Accept ¶
func (s *ScannerSource) Accept(arune rune) bool
Accept checks if arune matches Current. If true then does Move.
func (*ScannerSource) Collect ¶
func (s *ScannerSource) Collect() (arunes []rune)
Collect returns all arunes seen by the ScannerSource since last Collect or CollectString. Either Collect or CollectString can be called but only one of them as both clears the collector.
func (*ScannerSource) CollectString ¶
func (s *ScannerSource) CollectString() string
CollectString returns all arunes seen by the ScannerSource since last CollectString or Collect as a string. Either Collect or CollectString can be called but only one of them as both clears the collector.
func (*ScannerSource) Current ¶
func (s *ScannerSource) Current() rune
CurrentRune returns the current ScannerSource arune. At EOF it's zero.
func (*ScannerSource) CurrentRune ¶
func (s *ScannerSource) CurrentRune() ScannerRune
Current returns the current ScannerSource ScannerRune.
func (*ScannerSource) Include ¶
func (s *ScannerSource) Include(fname string, r io.RuneReader)
Include includes a RuneReader having fname. Recursive including is not checked. Include discards the one arune lookahead data if there are any. Lookahead data exists iff Next() has been called and Move() has not yet been called afterwards.
func (*ScannerSource) Move ¶
func (s *ScannerSource) Move()
Move moves ScannerSource one arune ahead.
func (*ScannerSource) Next ¶
func (s *ScannerSource) Next() rune
NextRune returns ScannerSource next (lookahead) arune. It is zero if next is EOF
func (*ScannerSource) NextRune ¶
func (s *ScannerSource) NextRune() ScannerRune
Next returns ScannerSource next (lookahead) ScannerRune. It's Rune is zero if next is EOF.
func (*ScannerSource) Position ¶
func (s *ScannerSource) Position() token.Position
Position returns the current ScannerSource position, i.e. after a Move() it returns the position after CurrentRune.
func (*ScannerSource) Prev ¶
func (s *ScannerSource) Prev() rune
PrevRune returns the previous (look behind) ScannerRune arune. Before first Move() its zero.
func (*ScannerSource) PrevRune ¶
func (s *ScannerSource) PrevRune() ScannerRune
Prev returns then previous (look behind) ScanerRune. Before first Move() its Rune is zero and Position.IsValid == false
type Source ¶
type Source struct {
// contains filtered or unexported fields
}
Source provides a stack of arune streams with position information.
func NewSource ¶
func NewSource(fname string, r io.RuneReader) *Source
NewSource returns a new Source from a RuneReader having fname. The RuneReader can be nil. Then an EOFReader is supplied and the real RuneReader(s) can be Included anytime afterwards.
func (*Source) Include ¶
func (s *Source) Include(fname string, r io.RuneReader)
Include includes a RuneReader having fname. Recursive including is not checked.
func (*Source) Read ¶
func (s *Source) Read() (r ScannerRune)
Read returns the next Source ScannerRune.
type StartSetID ¶
type StartSetID int
StartSetID is a type of a lexer start set identificator. It is used by Begin and PushState.