Documentation ¶
Overview ¶
Package lex is a Unicode-friendly run time library for golex[0] generated lexical analyzers[1].
Changelog ¶
2015-04-08: Initial release.
Character classes ¶
Golex internally handles only 8 bit "characters". Many Unicode-aware tokenizers do not actually need to recognize every Unicode rune, but only some particular partitions/subsets. Like, for example, a particular Unicode category, say upper case letters: Lu.
The idea is to convert all runes in a particular set as a single 8 bit character allocated outside the ASCII range of codes. The token value, a string of runes and their exact positions is collected as usual (see the Token and TokenBytes method), but the tokenizer DFA is simpler (and thus smaller and perhaps also faster) when this technique is used. In the example program (see below), recognizing (and skipping) white space, integer literals, one keyword and Go identifiers requires only an 8 state DFA[5].
To provide the conversion from runes to character classes, "install" your converting function using the RuneClass option.
References ¶
-
[0]: http://godoc.org/modernc.org/golex [1]: http://en.wikipedia.org/wiki/Lexical_analysis [2]: http://golang.org/cmd/yacc/ [3]: https://modernc.org/golex/blob/master/lex/example.l [4]: http://golang.org/pkg/io/#RuneReader [5]: https://modernc.org/golex/blob/master/lex/dfa
Index ¶
- Constants
- func DefaultRuneClass(r rune) int
- type Char
- type CharReader
- type Lexer
- func (l *Lexer) Abort() (int, bool)
- func (l *Lexer) Enter() int
- func (l *Lexer) Error(msg string)
- func (l *Lexer) Lookahead() Char
- func (l *Lexer) Mark()
- func (l *Lexer) Next() int
- func (l *Lexer) Offset() int
- func (l *Lexer) Rule0() int
- func (l *Lexer) Token() []Char
- func (l *Lexer) TokenBytes(builder func(*bytes.Buffer)) []byte
- func (l *Lexer) Unget(c ...Char)
- type Option
Constants ¶
const ( BOMError = iota // BOM is an error anywhere. BOMIgnoreFirst // Skip BOM if at beginning, report as error if anywhere else. BOMPassAll // No special handling of BOM. BOMPassFirst // No special handling of BOM if at beginning, report as error if anywhere else. )
BOM handling modes which can be set by the BOMMode Option. Default is BOMIgnoreFirst.
const ( NonASCII = 0x80 // DefaultRuneClass returns NonASCII for non ASCII runes. RuneEOF = -1 // Distinct from any valid Unicode rune value. )
Variables ¶
This section is empty.
Functions ¶
func DefaultRuneClass ¶
DefaultRuneClass returns the character class of r. If r is an ASCII code then its class equals the ASCII code. Any other rune is of class NonASCII.
DefaultRuneClass is the default implementation Lexer will use to convert runes (21 bit entities) to scanner classes (8 bit entities).
Non ASCII aware lexical analyzers will typically use their own categorization function. To assign such custom function use the RuneClass option.
Types ¶
type Char ¶
type Char struct { Rune rune // contains filtered or unexported fields }
Char represents a rune and its position.
type CharReader ¶
CharReader is a RuneReader providing additionally explicit position information by returning a Char instead of a rune as its first result.
type Lexer ¶
type Lexer struct { File *token.File // The *token.File passed to New. First Char // First remembers the lookahead char when Rule0 was invoked. Last Char // Last remembers the last Char returned by Next. Prev Char // Prev remembers the Char previous to Last. // contains filtered or unexported fields }
Lexer suports golex[0] generated lexical analyzers.
func New ¶
New returns a new *Lexer. The result can be amended using opts.
Non Unicode Input ¶
To consume sources in other encodings and still have exact position information, pass an io.RuneReader which returns the next input character reencoded as an Unicode rune but returns the size (number of bytes used to encode it) of the original character, not the size of its UTF-8 representation after converted to an Unicode rune. Size is the second returned value of io.RuneReader.ReadRune method[4].
When src optionally implements CharReader its ReadChar method is used instead of io.ReadRune.
func (*Lexer) Abort ¶
Abort handles the situation when the scanner does not successfully recognize any token or when an attempt to find the longest match "overruns" from an accepting state only to never reach an accepting state again. In the first case the scanner was never in an accepting state since last call to Rule0 and then (true, previousLookahead rune) is returned, effectively consuming a single Char token, avoiding scanner stall. Otherwise there was at least one accepting scanner state marked using Mark. In this case Abort rollbacks the lexer state to the marked state and returns (false, 0). The scanner must then execute a prescribed goto statement. For example:
%yyc c %yyn c = l.Next() %yym l.Mark() %{ package foo import (...) type lexer struct { *lex.Lexer ... } func newLexer(...) *lexer { return &lexer{ lex.NewLexer(...), ... } } func (l *lexer) scan() int { c := l.Enter() %} ... more lex defintions %% c = l.Rule0() ... lex rules %% if c, ok := l.Abort(); ok { return c } goto yyAction }
func (*Lexer) Enter ¶
Enter ensures the lexer has a valid lookahead Char and returns its class. Typical use in an .l file
func (l *lexer) scan() lex.Char { c := l.Enter() ...
func (*Lexer) Mark ¶
func (l *Lexer) Mark()
Mark records the current state of scanner as accepting. It implements the golex macro %yym. Typical usage in an .l file:
%yym l.Mark()
func (*Lexer) Next ¶
Next advances the scanner for one rune and returns the respective character class of the new lookahead. Typical usage in an .l file:
%yyn c = l.Next()
func (*Lexer) Rule0 ¶
Rule0 initializes the scanner state before the attempt to recognize a token starts. The token collecting buffer is cleared. Rule0 records the current lookahead in l.First and returns its class. Typical usage in an .l file:
... lex definitions %% c := l.Rule0() first-pattern-regexp
func (*Lexer) TokenBytes ¶
TokenBytes returns the UTF-8 encoding of Token. If builder is not nil then it's called instead to build the encoded token byte value into the buffer passed to it.
The Result is R/O.
type Option ¶
Option is a function which can be passed as an optional argument to New.
func BOMMode ¶
BOMMode option selects how the lexer handles BOMs. See the BOM* constants for details.