Documentation ¶
Overview ¶
Package tokenizer is a go port of https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb
in their words:
# Generic programming language tokenizer. # # Tokens are designed for use in the language bayes classifier. # It strips any data strings or comments and preserves significant # language symbols.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ByteLimit is the maximum input length for Tokenize() ByteLimit = 100000 // StartLineComments turns string slices into their regexp slice counterparts // by this package's init() function. StartLineComments = []string{ "\"", "%", } // SingleLineComments turns string slices into their regexp slice counterparts // by this package's init() function. SingleLineComments = []string{ "//", "--", "#", } // MultiLineComments turns string slices into their regexp slice counterparts // by this package's init() function. MultiLineComments = [][]string{ {"/*", "*/"}, {"<!--", "-->"}, {"{-", "-}"}, {"(*", "*)"}, {`"""`, `"""`}, {"'''", "'''"}, {"#`(", ")"}, } )
Functions ¶
func FindMultiLineComment ¶
FindMultiLineComment compares a given token to the start of a multiline comment and if true, returns the bool with a regex. Otherwise false and nil.
func Tokenize ¶
Tokenize is a simple tokenizer that uses bufio.Scanner to process lines and individual words and matches them against regular expressions to filter out comments, strings, and numerals in a manner very similar to github's linguist (see https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb)
The intention is to merely retrieve significant tokens from a piece of source code in order to identify the programming language using statistical analysis and NOT to be used as any part of the process of compilation whatsoever.
NOTE(tso): The tokens produced by this function may be of a dubious quality due to the approach taken. Feedback and alternate implementations welcome :)
Types ¶
This section is empty.