tokenizer

package
v0.13.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 11, 2018 License: MIT Imports: 3 Imported by: 10

Documentation

Overview

Package tokenizer is a go port of https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb

in their words:

# Generic programming language tokenizer.
#
# Tokens are designed for use in the language bayes classifier.
# It strips any data strings or comments and preserves significant
# language symbols.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ByteLimit is the maximum input length for Tokenize()
	ByteLimit = 100000

	// StartLineComments turns string slices into their regexp slice counterparts
	// by this package's init() function.
	StartLineComments = []string{
		"\"",
		"%",
	}
	// SingleLineComments turns string slices into their regexp slice counterparts
	// by this package's init() function.
	SingleLineComments = []string{
		"//",
		"--",
		"#",
	}
	// MultiLineComments turns string slices into their regexp slice counterparts
	// by this package's init() function.
	MultiLineComments = [][]string{
		{"/*", "*/"},
		{"<!--", "-->"},
		{"{-", "-}"},
		{"(*", "*)"},
		{`"""`, `"""`},
		{"'''", "'''"},
		{"#`(", ")"},
	}
)

Functions

func FindMultiLineComment

func FindMultiLineComment(token []byte) (matched bool, terminator *regexp.Regexp)

FindMultiLineComment compares a given token to the start of a multiline comment and if true, returns the bool with a regex. Otherwise false and nil.

func Tokenize

func Tokenize(input []byte) (tokens []string)

Tokenize is a simple tokenizer that uses bufio.Scanner to process lines and individual words and matches them against regular expressions to filter out comments, strings, and numerals in a manner very similar to github's linguist (see https://github.com/github/linguist/blob/master/lib/linguist/tokenizer.rb)

The intention is to merely retrieve significant tokens from a piece of source code in order to identify the programming language using statistical analysis and NOT to be used as any part of the process of compilation whatsoever.

NOTE(tso): The tokens produced by this function may be of a dubious quality due to the approach taken. Feedback and alternate implementations welcome :)

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL