lexer

package module

v0.2.5 Latest Latest Go to latest Published: Jan 5, 2016 License: MIT Imports: 5 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/PieterD/lexer

Links

Open Source Insights

README ¶

lexer

A basic lexer toolkit

The design is essentially Rob Pike's lexer from his talk "Lexical Scanning in Go" See https://www.youtube.com/watch?v=HxaD_trXwRE

Documentation ¶

Overview ¶

Package Lexer implements a simple lexing toolkit.

Index ¶

Constants
type Channel
type Iterator
- func (it Iterator) Token() (token Token)
type LexInner
type Lexer
- func New(name string, input string, start_state StateFn) *Lexer
- func (ln *Lexer) Go() Channel
- func (ln *Lexer) Iterate() *Iterator
type Mark
type Replacer
type StateFn
type Token
- func (i Token) String() string
type TokenType

Examples ¶

Lexer

Constants ¶

View Source

const Eof rune = -1

This is returned by next when there are no more characters to read.

View Source

const Err rune = utf8.RuneError

This is returned when a bad rune is encountered.

View Source

const MaxEmitsInFunction = 10

The maximum number of emits in a single state function when using Token. If this number has been reached, Token returns a StateError. If you wish to emit more than this, use the Go method to read tokens off the channel directly.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Channel ¶

type Channel <-chan Token

Generates tokens asynchronously. See Lexer.Go

type Iterator ¶

type Iterator struct {
	// contains filtered or unexported fields
}

Generates tokens synchronously. See Lexer.Iterate

func (Iterator) Token ¶

func (it Iterator) Token() (token Token)

Get a Token from the Lexer. Please note that only 10 tokens can be emitted in a single state function. If you wish to emit more per function, use the Go method.

type LexInner ¶

type LexInner struct {
	// contains filtered or unexported fields
}

LexInner is the inner type which is used within StateFn to do the actual lexing.

func (*LexInner) Accept ¶

func (l *LexInner) Accept(valid string) bool

Read one character, but only if it is one of the characters in the given string.

func (*LexInner) AcceptRun ¶

func (l *LexInner) AcceptRun(valid string) (acceptnum int)

Read as many characters as possible, but only characters that exist in the given string.

func (*LexInner) Back ¶

func (l *LexInner) Back()

Undo the last Next. This is probably won't work after calling any other lexer functions. If you need to undo more, use Mark and Unmark.

func (*LexInner) Bytes ¶ added in v0.2.5

func (l *LexInner) Bytes(number int) bool

Consume the given number of bytes. Returns true if successful, false if there are not enough bytes.

func (*LexInner) Emit ¶

func (l *LexInner) Emit(typ TokenType)

Emit the gathered token, given its type. Emits the result of ReplaceGet, then calls Ignore.

func (*LexInner) EmitEof ¶

func (l *LexInner) EmitEof() StateFn

Emit a token of type TokenEOF. Returns nil.

func (*LexInner) EmitString ¶

func (l *LexInner) EmitString(typ TokenType, str string)

Emit a token with the given type and string.

func (*LexInner) Eof ¶

func (l *LexInner) Eof() bool

Return true if the lexer has reached the end of the file.

func (*LexInner) Errorf ¶

func (l *LexInner) Errorf(format string, args ...interface{}) StateFn

Emit an Error token. Like EmitEof, Errorf returns nil.

func (*LexInner) Except ¶

func (l *LexInner) Except(valid string) bool

Read one character, but only if it is NOT one of the characters in the given string. If Eof or Err is reached, Except fails regardless of what the given string is.

func (*LexInner) ExceptRun ¶

func (l *LexInner) ExceptRun(valid string) (acceptnum int)

Read as many characters as possible, but only characters that do NOT exist in the given string. If Eof is reached, ExceptRun stops as though it found a successful character. Thus, ExceptRun("") accepts everything until Eof. or Err.

func (*LexInner) Find ¶

func (l *LexInner) Find(valid string) bool

Accepts things until the first occurence of the given string. The string itself is not accepted.

func (*LexInner) Get ¶

func (l *LexInner) Get() string

Get the string of the token gathered so far.

func (*LexInner) Ignore ¶

func (l *LexInner) Ignore()

Ignore everything gathered about the token so far. Also removes any Replaces.

func (*LexInner) Last ¶

func (l *LexInner) Last() rune

Get the last character accepted into the token.

func (*LexInner) Len ¶

func (l *LexInner) Len() int

Return the length of the token gathered so far.

func (*LexInner) Mark ¶

func (l *LexInner) Mark() Mark

Store the state of the lexer.

func (*LexInner) Next ¶

func (l *LexInner) Next() (char rune)

Read a single character. If there are no more characters, it will return Eof. If a non-utf8 character is read, it will return Err.

func (*LexInner) One ¶

func (l *LexInner) One(f func(rune) bool) bool

Accept a single character and return true if f returns true. Otherwise, do nothing and return false.

func (*LexInner) Peek ¶

func (l *LexInner) Peek() rune

Spy on the upcoming rune.

func (*LexInner) Replace ¶ added in v0.2.2

func (l *LexInner) Replace(start Mark, with string)

Replace the text from the start Mark to the current position with the given string. With may be a different length than the string being replaced, but this change will not be reflected by functions like Len and Get. Call ReplaceGet to get the token including its replaces. This is how it will be sent by Emit. The replace is part of the current Mark, so Unmarking to before a replace was done will remove the replace.

func (*LexInner) ReplaceGet ¶ added in v0.2.2

func (l *LexInner) ReplaceGet() string

Get the current token with all replaces included. This can be expensive, if you have many replaces. Without any replaces, it is identical to Get.

func (*LexInner) Retry ¶

func (l *LexInner) Retry()

Retry everything since starting this token.

func (*LexInner) Run ¶

func (l *LexInner) Run(f func(rune) bool) (acceptnum int)

Reads characters and feeds them to the given function, and keeps reading until it returns false.

func (*LexInner) Skip ¶

func (l *LexInner) Skip(n int) int

Read n characters. Returns the number of characters read. If it returns less than n, it will have reached EOF.

func (*LexInner) String ¶

func (l *LexInner) String(valid string) bool

Attempt to read a string. Only if the entire string is successfully accepted does it return true. If only a part of the string was matched, none of it is.

func (*LexInner) Unmark ¶

func (l *LexInner) Unmark(mark Mark)

Recover the state of the lexer.

func (*LexInner) Warningf ¶

func (l *LexInner) Warningf(format string, args ...interface{})

Emit a Warning token.

func (*LexInner) Whitespace ¶

func (l *LexInner) Whitespace(except string) (acceptnum int)

Accepts any whitespace (unicode.IsSpace), except for whitespace in except. For instance, Whitespace("\n") will accept all whitespace except newlines. Returns the number of runes read.

type Lexer ¶

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer is the external type which emits tokens.

Example ¶

package main

import (
	"fmt"
	"unicode"

	"github.com/PieterD/lexer"
)

const (
	tokenComment lexer.TokenType = 1 + iota
	tokenVariable
	tokenAssign
	tokenNumber
	tokenString
)

func main() {
	text := `
/* comment */
pie=314
// comment
string = "Hello world!"
`
	l := lexer.New("filename", text, state_base)
	tokenchan := l.Go()
	for token := range tokenchan {
		fmt.Printf("%s:%d [%d]\"%s\"\n", token.File, token.Line, token.Typ, token.Val)
	}
}

// Start parsing with this.
func state_base(l *lexer.LexInner) lexer.StateFn {
	// Ignore all whitespace.
	l.Run(unicode.IsSpace)
	l.Ignore()
	if l.String("//") {
		// We're remembering the '//' here so it gets included in the Emit
		// contained in state_comment_line.
		return state_comment_line
	}
	if l.String("/*") {
		return state_comment_block(state_base)
	}
	if l.Eof() {
		return l.EmitEof()
	}
	// It's not a comment or Eof, so it must be a variable name.
	return state_variable
}

// Parse a line comment.
func state_comment_line(l *lexer.LexInner) lexer.StateFn {
	// Eat up everything until end of line (or Eof)
	l.ExceptRun("\n")
	l.Emit(tokenComment)
	// Consume the end of line. If we reached Eof, this does nothing.
	l.Accept("\n")
	// Ignore that last newline
	l.Ignore()
	return state_base
}

// Parse a block comment.
// Since block comments may appear in different states,
// instead of defining the usual StateFn we define a function that
// returns a statefn, which in turn will return the parent state
// after its parsing is done.
func state_comment_block(parent lexer.StateFn) lexer.StateFn {
	return func(l *lexer.LexInner) lexer.StateFn {
		if !l.Find("*/") {
			// If closing statement couldn't be found, emit an error.
			// Errorf always returns nil, so parsing is done after this.
			return l.Errorf("Couldn't find end of block comment")
		}
		l.String("*/")
		l.Emit(tokenComment)
		return parent
	}
}

// Parse a variable name
func state_variable(l *lexer.LexInner) lexer.StateFn {
	if l.AcceptRun("abcdefghijklmnopqrstuvwxyz") == 0 {
		return l.Errorf("Invalid variable name")
	}
	l.Emit(tokenVariable)

	return state_operator
}

// Parse an assignment operator
func state_operator(l *lexer.LexInner) lexer.StateFn {
	l.Run(unicode.IsSpace)
	l.Ignore()
	if l.Accept("=") {
		l.Emit(tokenAssign)
		return state_value
	}
	return l.Errorf("Only '=' is a valid operator")
}

// Parse a value
func state_value(l *lexer.LexInner) lexer.StateFn {
	l.Run(unicode.IsSpace)
	l.Ignore()
	if l.AcceptRun("0123456789") > 0 {
		l.Emit(tokenNumber)
		return state_base
	}
	if l.Accept("\"") {
		return state_string
	}
	return l.Errorf("Unidentified value")
}

// Parse a string
func state_string(l *lexer.LexInner) lexer.StateFn {
	for {
		l.ExceptRun("\"\\")
		// Now we're either at a ", a \, or Eof.
		if l.Accept("\"") {
			l.Emit(tokenString)
			return state_base
		}
		if l.Accept("\\") {
			if !l.Accept("nrt\"'\\") {
				return l.Errorf("Invalid escape sequence: \"\\%c\"", l.Last())
			}
		}
		if l.Eof() {
			return l.Errorf("No closing '\"' found")
		}
	}
}

Output:

filename:2 [1]"/* comment */"
filename:3 [2]"pie"
filename:3 [3]"="
filename:3 [4]"314"
filename:4 [1]"// comment"
filename:5 [2]"string"
filename:5 [3]"="
filename:5 [5]""Hello world!""
filename:5 [-3]"EOF"

func New ¶

func New(name string, input string, start_state StateFn) *Lexer

Create a new lexer.

func (*Lexer) Go ¶

func (ln *Lexer) Go() Channel

Spawn a goroutine which keeps sending tokens on the returned channel, until TokenEmpty would be encountered. If Go or Iterate has already been called, it will return nil.

func (*Lexer) Iterate ¶

func (ln *Lexer) Iterate() *Iterator

Where Go starts a goroutine, Iterate returns an iterator. When using an Iterator, only MaxEmitsInFunction emits may be done in any single state function, or an error will be reported. If Go or Iterate has already been called, it will return nil.

type Mark ¶

type Mark struct {
	// contains filtered or unexported fields
}

The Mark type (used by Mark and Unmark) can be used to save the current state of the lexer, and restore it later.

type Replacer ¶ added in v0.2.2

type Replacer struct {
	// contains filtered or unexported fields
}

type StateFn ¶

type StateFn func(*LexInner) StateFn

StateFn is a function that takes a LexInner and returns a StateFn.

type Token ¶

type Token struct {
	Typ  TokenType
	Val  string
	File string
	Line int
}

Tokens are emitted by the lexer. They contained a (usually) user-defined Typ, the Value of the token, and the Filename and Line number where the token was generated.

func (Token) String ¶

func (i Token) String() string

Return a simple string representation of the value contained within the token.

type TokenType ¶

type TokenType int

TokenType is an integer representing the type of token that has been emitted. Most TokenTypes will be user-defined, and those that are must be greater than 0. Other than TokenEmpty, which is read when there is absolutely nothing left to read or when the channel is closed, the package-defined Error, Warning and EOF tokens are only generated by emitting them manually, or by evoking their corresponding Emit* functions.

const (
	// TokenEmpty is the TokenType with value 0.
	// Any zero-valued token will have this as its Typ.
	// It is also returned when the lexer has stopped (by an error, or Eof)
	TokenEmpty TokenType = -iota
	// TokenError is the Typ for errors reported by, for example, Lexer.Errorf.
	TokenError
	// TokenWarning is the Typ for warnings.
	TokenWarning
	// TokenEOF should be returned once per file, when the end of file has been reached.
	// This is not done automatically!
	TokenEOF
)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
lextest

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL