preprocess

package
v1.0.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2021 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package preprocess performs preparsing filtering and modification of a scientific-name.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Annotation

func Annotation(bs []byte) int

Annotation returns index where unparsed part starts. In case if the full string can be parsed, returns returns the index of the end of the input.

func CleanupStream

func CleanupStream(in <-chan string, out chan<- *CleanupResult, wn int)

CleanupStream takes input and output string channels, and feeds output with pipe delimited strings with original name on the left and cleaned up name on the right from the pipe.

func IsVirus

func IsVirus(data []byte) bool

func NoParse

func NoParse(data []byte) bool

func NormalizeHybridChar

func NormalizeHybridChar(bs []byte) []byte

NormalizeHybridChar substitutes hybrid chars 'X' or 'x' with the multiplication sign char.

func StripTags

func StripTags(s string) string

StripTags takes a slice of bytes and returns a string with common tags removed and html entities escaped. It does keep all uncommon tags intact to let parser deal with them.

func UnderscoreToSpace

func UnderscoreToSpace(bs []byte) (bool, error)

UnderscoreToSpace takes a slice of bytes. If it finds that the string contains underscores, but not spaces, it substitutes underscores to spaces in the slice. In case if any spaces are present, the slice is returned unmodified.

func VirusLikeName

func VirusLikeName(name string) bool

LikeVirus takes a string and checks it against known species that can easily be mistaken for viruses. If the string belongs to one of such species returns true. The following names are covered:

Aspilota vector Belokobylskij, 2007
Ceylonesmus vector Chamberlin, 1941
Cryptops (Cryptops) vector Chamberlin, 1939
Culex vector Dyar & Knab, 1906
Dasyproctus cevirus Leclercq, 1963
Desmoxytes vector (Chamberlin, 1941)
Dicathais vector Thornley, 1952
Euragallia prion Kramer, 1976
Exochus virus Gauld & Sithole, 2002
Hilara vector Miller, 1923
Microgoneplax prion Castro, 2007
Neoaemula vector Mackinnon, Hiller, Long & Marshall, 2008
Ophion virus Gauld & Mitchell, 1981
Psenulus trevirus Leclercq, 1961
Tidabius vector Chamberlin, 1931

Types

type CleanupResult

type CleanupResult struct {
	// Input is the original name.
	Input string
	// Output is the name after the tag removal.
	Output string
}

CleanupResult keeps results of removal of some HTML tags.

type Preprocessor

type Preprocessor struct {
	Virus       bool
	Underscore  bool
	NoParse     bool
	Approximate bool
	Annotation  bool
	Body        []byte
	Tail        []byte
}

Preprocessor structure keeps state of the preprocessor results.

func Preprocess

func Preprocess(bs []byte) *Preprocessor

Preprocess runs a series of regular expressions over the input to determine features of the input before parsing.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL