Documentation
¶
Index ¶
- func Annotation(bs []byte) int
- func CleanupStream(in <-chan string, out chan<- *CleanupResult, wn int)
- func IsVirus(data []byte) bool
- func NoParse(data []byte) bool
- func NormalizeHybridChar(bs []byte) []byte
- func StripTags(s string) string
- func UnderscoreToSpace(bs []byte) (bool, error)
- func VirusLikeName(name string) bool
- type CleanupResult
- type Preprocessor
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Annotation ¶
Annotation returns index where unparsed part starts. In case if the full string can be parsed, returns returns the index of the end of the input.
func CleanupStream ¶
func CleanupStream(in <-chan string, out chan<- *CleanupResult, wn int)
CleanupStream takes input and output string channels, and feeds output with pipe delimited strings with original name on the left and cleaned up name on the right from the pipe.
func NormalizeHybridChar ¶
NormalizeHybridChar substitutes hybrid chars 'X' or 'x' with the multiplication sign char.
func StripTags ¶
StripTags takes a slice of bytes and returns a string with common tags removed and html entities escaped. It does keep all uncommon tags intact to let parser deal with them.
func UnderscoreToSpace ¶
UnderscoreToSpace takes a slice of bytes. If it finds that the string contains underscores, but not spaces, it substitutes underscores to spaces in the slice. In case if any spaces are present, the slice is returned unmodified.
func VirusLikeName ¶
LikeVirus takes a string and checks it against known species that can easily be mistaken for viruses. If the string belongs to one of such species returns true. The following names are covered:
Aspilota vector Belokobylskij, 2007 Ceylonesmus vector Chamberlin, 1941 Cryptops (Cryptops) vector Chamberlin, 1939 Culex vector Dyar & Knab, 1906 Dasyproctus cevirus Leclercq, 1963 Desmoxytes vector (Chamberlin, 1941) Dicathais vector Thornley, 1952 Euragallia prion Kramer, 1976 Exochus virus Gauld & Sithole, 2002 Hilara vector Miller, 1923 Microgoneplax prion Castro, 2007 Neoaemula vector Mackinnon, Hiller, Long & Marshall, 2008 Ophion virus Gauld & Mitchell, 1981 Psenulus trevirus Leclercq, 1961 Tidabius vector Chamberlin, 1931
Types ¶
type CleanupResult ¶
type CleanupResult struct { // Input is the original name. Input string // Output is the name after the tag removal. Output string }
CleanupResult keeps results of removal of some HTML tags.
type Preprocessor ¶
type Preprocessor struct { Virus bool Underscore bool NoParse bool Approximate bool Annotation bool Body []byte Tail []byte }
Preprocessor structure keeps state of the preprocessor results.
func Preprocess ¶
func Preprocess(bs []byte) *Preprocessor
Preprocess runs a series of regular expressions over the input to determine features of the input before parsing.