Documentation
¶
Index ¶
- Constants
- func GetCharmapByName(name string) (*charmap.Charmap, error)
- func ParseVerticalFile(conf *ParserConf, lproc LineProcessor) error
- func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)
- func SupportedCharsets() []string
- type LineProcessor
- type ParserConf
- type Structure
- type StructureClose
- type Token
Constants ¶
const ( LineTypeToken = "token" LineTypeStruct = "struct" LineTypeIgnored = "ignored" AccumulatorTypeStack = "stack" AccumulatorTypeComb = "comb" AccumulatorTypeNil = "nil" CharsetISO8859_1 = "iso-8859-1" CharsetISO8859_2 = "iso-8859-2" CharsetISO8859_3 = "iso-8859-3" CharsetISO8859_4 = "iso-8859-4" CharsetISO8859_5 = "iso-8859-5" CharsetISO8859_6 = "iso-8859-6" CharsetISO8859_7 = "iso-8859-7" CharsetISO8859_8 = "iso-8859-8" CharsetWindows1250 = "windows-1250" CharsetWindows1251 = "windows-1251" CharsetWindows1252 = "windows-1252" CharsetWindows1253 = "windows-1253" CharsetWindows1254 = "windows-1254" CharsetWindows1255 = "windows-1255" CharsetWindows1256 = "windows-1256" CharsetWindows1257 = "windows-1257" CharsetWindows1258 = "windows-1258" CharsetUTF_8 = "utf-8" )
Variables ¶
This section is empty.
Functions ¶
func GetCharmapByName ¶
GetCharmapByName returns a proper Charmap instance based on provided encoding name. The name detection is case insensitive (e.g. utf-8 is the same as UTF-8). The number of supported charsets is
func ParseVerticalFile ¶
func ParseVerticalFile(conf *ParserConf, lproc LineProcessor) error
ParseVerticalFile processes a corpus vertical file line by line and applies a custom LineProcessor on them. The processing is parallelized in the sense that reading a file into lines and processing of the lines runs in different goroutines. To reduce overhead, the data are passed between goroutines in chunks.
func ParseVerticalFileNoGoRo ¶
func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)
ParseVerticalFileNoGoRo is just for benchmarking purposes
func SupportedCharsets ¶
func SupportedCharsets() []string
SupportedCharsets returns a list of names of character sets.
Types ¶
type LineProcessor ¶
type LineProcessor interface { ProcToken(token *Token, line int, err error) ProcStruct(strc *Structure, line int, err error) ProcStructClose(strc *StructureClose, line int, err error) StopChannel() chan struct{} }
LineProcessor describes an object able to handle Vertigo's parsing events.
type ParserConf ¶
type ParserConf struct { // Source vertical file (either a plain text file or a gzip one) InputFilePath string `json:"inputFilePath"` Encoding string `json:"encoding"` FilterArgs [][][]string `json:"filterArgs"` StructAttrAccumulator string `json:"structAttrAccumulator"` LogProgressEachNth int `json:"logProgressEachNth"` }
ParserConf contains configuration parameters for vertical file parser
func LoadConfig ¶
func LoadConfig(path string) *ParserConf
LoadConfig loads the configuration from a JSON file. In case of an error the program exits with panic.
type StructureClose ¶
type StructureClose struct {
Name string
}
StructureClose represent a structure closing tag
type Token ¶
Token is a representation of a parsed line. It connects both, positional attributes and currently accumulated structural attributes.
func (*Token) MatchesFilter ¶
MatchesFilter tests whether a provided token matches a filter in Conjunctive normal form encoded as a 3-d list E.g.: div.author = 'John Doe' AND (div.title = 'Unknown' OR div.title = 'Superunknown') encodes as: { {{"div.author" "John Doe"}} {{"div.title" "Unknown"} {"div.title" "Superunknown"}} }
func (*Token) PosAttrByIndex ¶
PosAttrByIndex returns a positional attribute based on its original index in vertical file