Documentation ¶
Index ¶
- Constants
- func GetCharmapByName(name string) (*charmap.Charmap, error)
- func ParseVerticalFile(ctx context.Context, conf *ParserConf, lproc LineProcessor) error
- func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)
- func SupportedCharsets() []string
- type LineProcessor
- type ParserConf
- type Structure
- type StructureClose
- type Token
Constants ¶
const ( LineTypeToken = "token" LineTypeStruct = "struct" LineTypeIgnored = "ignored" AccumulatorTypeStack = "stack" AccumulatorTypeComb = "comb" AccumulatorTypeNil = "nil" CharsetISO8859_1 = "iso-8859-1" CharsetISO8859_2 = "iso-8859-2" CharsetISO8859_3 = "iso-8859-3" CharsetISO8859_4 = "iso-8859-4" CharsetISO8859_5 = "iso-8859-5" CharsetISO8859_6 = "iso-8859-6" CharsetISO8859_7 = "iso-8859-7" CharsetISO8859_8 = "iso-8859-8" CharsetWindows1250 = "windows-1250" CharsetWindows1251 = "windows-1251" CharsetWindows1252 = "windows-1252" CharsetWindows1253 = "windows-1253" CharsetWindows1254 = "windows-1254" CharsetWindows1255 = "windows-1255" CharsetWindows1256 = "windows-1256" CharsetWindows1257 = "windows-1257" CharsetWindows1258 = "windows-1258" CharsetUTF_8 = "utf-8" )
Variables ¶
This section is empty.
Functions ¶
func GetCharmapByName ¶
GetCharmapByName returns a proper Charmap instance based on provided encoding name. The name detection is case insensitive (e.g. utf-8 is the same as UTF-8). The number of supported charsets is
func ParseVerticalFile ¶
func ParseVerticalFile(ctx context.Context, conf *ParserConf, lproc LineProcessor) error
ParseVerticalFile processes a corpus vertical file line by line and applies a custom LineProcessor on them. The processing is parallelized in the sense that reading a file into lines and processing of the lines runs in different goroutines. But the function as a whole behaves synchronously - i.e. once it returns a value, the processing is finished.
func ParseVerticalFileNoGoRo ¶
func ParseVerticalFileNoGoRo(conf *ParserConf, lproc LineProcessor)
ParseVerticalFileNoGoRo is just for benchmarking purposes
func SupportedCharsets ¶
func SupportedCharsets() []string
SupportedCharsets returns a list of names of character sets.
Types ¶
type LineProcessor ¶
type LineProcessor interface { // ProcToken is called each time the parser encounters a positional // attribute. In case parsing produces an error, it is passed to the // function without stopping the whole process. // In case the function returns an error, the parser stops // (in the simplest case it can be even the error it recieves) ProcToken(token *Token, line int, err error) error // ProcStruct is called each time parser encounters a structure opening // element (e.g. <doc>). In case parsing produces an error, it is passed // to the function without stopping the whole process. // In case the function returns an error, the parser stops. ProcStruct(strc *Structure, line int, err error) error // ProcStructClose is called each time parser encouters a structure // closing element (e.g. </doc>). In case parsing produces an error, // it is passed to the function without stopping the whole process. // In case the function returns an error, the parser stops. ProcStructClose(strc *StructureClose, line int, err error) error }
LineProcessor describes an object able to handle Vertigo's parsing events.
type ParserConf ¶
type ParserConf struct { // Source vertical file (either a plain text file or a gzip one) InputFilePath string `json:"inputFilePath"` Encoding string `json:"encoding"` FilterArgs [][][]string `json:"filterArgs"` StructAttrAccumulator string `json:"structAttrAccumulator"` LogProgressEachNth int `json:"logProgressEachNth"` }
ParserConf contains configuration parameters for vertical file parser
func LoadConfig ¶
func LoadConfig(path string) *ParserConf
LoadConfig loads the configuration from a JSON file. In case of an error the program exits with panic.
type Structure ¶
type Structure struct { // Name defines a name of a structure tag (e.g. 'doc' for <doc> element) Name string // Attrs store structural attributes of the tag // (e.g. <doc id="foo"> produces map with a single key 'id' and value 'foo') Attrs map[string]string // IsEmpty defines a possible self-closing tag // if true then the structure is self-closing // (i.e. there is no 'close element' event following) IsEmpty bool }
Structure represent a structure opening tag
type StructureClose ¶
type StructureClose struct {
Name string
}
StructureClose represent a structure closing tag
type Token ¶
Token is a representation of a parsed line. It connects both, positional attributes and currently accumulated structural attributes.
func (*Token) MatchesFilter ¶
MatchesFilter tests whether a provided token matches a filter in Conjunctive normal form encoded as a 3-d list E.g.: div.author = 'John Doe' AND (div.title = 'Unknown' OR div.title = 'Superunknown') encodes as: { {{"div.author" "John Doe"}} {{"div.title" "Unknown"} {"div.title" "Superunknown"}} }
func (*Token) PosAttrByIndex ¶
PosAttrByIndex returns a positional attribute based on its original index in vertical file