Documentation
¶
Overview ¶
go port of https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb
in their words:
# Generic programming language tokenizer. # # Tokens are designed for use in the language bayes classifier. # It strips any data strings or comments and preserves significant # language symbols.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // Maximum input length for Tokenize() ByteLimit = 100000 // NOTE(tso): these string slices are turned into their regexp slice counterparts // by this package's init() function. StartLineComments = []string{ "\"", "%", } SingleLineComments = []string{ "//", "--", "#", } MultiLineComments = [][]string{ []string{"/*", "*/"}, []string{"<!--", "-->"}, []string{"{-", "-}"}, []string{"(*", "*)"}, []string{`"""`, `"""`}, []string{"'''", "'''"}, []string{"#`(", ")"}, } StartLineComment []*regexp.Regexp BeginSingleLineComment []*regexp.Regexp BeginMultiLineComment []*regexp.Regexp EndMultiLineComment []*regexp.Regexp String = regexp.MustCompile(`[^\\]*(["'` + "`])") Shebang = regexp.MustCompile(`#!.*$`) Number = regexp.MustCompile(`(0x[0-9a-f]([0-9a-f]|\.)*|\d(\d|\.)*)([uU][lL]{0,2}|([eE][-+]\d*)?[fFlL]*)`) )
Functions ¶
func FindMultiLineComment ¶
If the given token matches the start of a multi-line comment, this function will return true and a regex for the corresponding closing token, otherwise false and nil.
func Tokenize ¶
Simple tokenizer that uses bufio.Scanner to process lines and individual words and matches them against regular expressions to filter out comments, strings, and numerals in a manner very similar to github's linguist (see https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb)
The intention is to merely retrieve significant tokens from a piece of source code in order to identify the programming language using statistical analysis and NOT to be used as any part of the process of compilation whatsoever.
NOTE(tso): The tokens produced by this function may be of a dubious quality due to the approach taken. Feedback and alternate implementations welcome :)
Types ¶
This section is empty.