tokenizer

package

v0.10.0 Latest Latest Go to latest Published: Jan 25, 2018 License: MIT Imports: 3 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/Azure/draft-classic

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer is a go port of https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb

in their words:

# Generic programming language tokenizer.
#
# Tokens are designed for use in the language bayes classifier.
# It strips any data strings or comments and preserves significant
# language symbols.

Index ¶

Variables
func FindMultiLineComment(token []byte) (matched bool, terminator *regexp.Regexp)
func Tokenize(input []byte) (tokens []string)

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ByteLimit is the maximum input length for Tokenize()
	ByteLimit = 100000

	// StartLineComments turns string slices into their regexp slice counterparts
	// by this package's init() function.
	StartLineComments = []string{
		"\"",
		"%",
	}
	// SingleLineComments turns string slices into their regexp slice counterparts
	// by this package's init() function.
	SingleLineComments = []string{
		"//",
		"--",
		"#",
	}
	// MultiLineComments turns string slices into their regexp slice counterparts
	// by this package's init() function.
	MultiLineComments = [][]string{
		{"/*", "*/"},
		{"<!--", "-->"},
		{"{-", "-}"},
		{"(*", "*)"},
		{`"""`, `"""`},
		{"'''", "'''"},
		{"#`(", ")"},
	}
)

Functions ¶

func FindMultiLineComment ¶

func FindMultiLineComment(token []byte) (matched bool, terminator *regexp.Regexp)

FindMultiLineComment compares a given token to the start of a multiline comment and if true, returns the bool with a regex. Otherwise false and nil.

func Tokenize ¶

func Tokenize(input []byte) (tokens []string)

Tokenize is a simple tokenizer that uses bufio.Scanner to process lines and individual words and matches them against regular expressions to filter out comments, strings, and numerals in a manner very similar to github's linguist (see https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb)

The intention is to merely retrieve significant tokens from a piece of source code in order to identify the programming language using statistical analysis and NOT to be used as any part of the process of compilation whatsoever.

NOTE(tso): The tokens produced by this function may be of a dubious quality due to the approach taken. Feedback and alternate implementations welcome :)

Types ¶

This section is empty.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL