Documentation ¶
Overview ¶
Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de The Finc Authors, http://finc.info Martin Czygan, <martin.czygan@uni-leipzig.de>
This file is part of some open source application.
Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.
@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>
Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de The Finc Authors, http://finc.info Martin Czygan, <martin.czygan@uni-leipzig.de>
This file is part of some open source application.
Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.
@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>
Index ¶
- Constants
- Variables
- func ByteSink(w io.Writer, out chan []byte, done chan bool)
- func DetectLang3(text string) (string, error)
- func FromLines(r io.Reader, f ImporterFunc) (chan []Importer, error)
- func FromLinesSize(r io.Reader, f ImporterFunc, size int) (chan []Importer, error)
- func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)
- func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)
- func UnescapeTrim(s string) string
- type Importer
- type ImporterFunc
- type Skip
- type Source
- type XMLDecoderFunc
Constants ¶
const ( // AppVersion of span package. Commandline tools will show this on -v. AppVersion = "0.1.137" // KeyLengthLimit is a limit imposed by memcached protocol, which is used // for blob storage as of June 2015. If we change the key value store, // this limit might become obsolete. KeyLengthLimit = 250 )
Variables ¶
var ISSNPattern = regexp.MustCompile(`[0-9]{4,4}-[0-9]{3,3}[0-9X]`)
Functions ¶
func ByteSink ¶
ByteSink is a fan in writer for a byte channel. A newline is appended after each object.
func DetectLang3 ¶
DetectLang3 returns the best guess 3-letter language code for a given text.
func FromLines ¶ added in v0.1.56
func FromLines(r io.Reader, f ImporterFunc) (chan []Importer, error)
FromLines returns a channel of slices of importable objects with a default batch size of 20000 docs.
func FromLinesSize ¶ added in v0.1.56
FromLinesSize returns a channel of slices of importable values, given a reader, f (for a single value) and number of documents to batch. Important: Due to fan-out input and output order will not be preserved.
func FromXMLSize ¶
func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)
FromXMLSize returns a channel of importable document slices given a reader over XML, a name of the XML start element, a XMLDecoderFunc callback that deserializes an XML snippet and a batch size. TODO(miku): more idiomatic error handling, e.g. over error channel.
func UnescapeTrim ¶
UnescapeTrim unescapes HTML character references and trims the space of a given string.
Types ¶
type Importer ¶
type Importer interface {
ToIntermediateSchema() (*finc.IntermediateSchema, error)
}
Importer objects can be converted into an intermediate schema.
type ImporterFunc ¶ added in v0.1.56
ImporterFunc turns a byte slice into a single importable object.
type Source ¶
Source can emit records given a reader. The channel is of type []Importer, to allow the source to send objects over the channel in batches for performance (1000 x 1000 docs vs 1000000 x 1 doc).
type XMLDecoderFunc ¶
XMLDecoderFunc returns an importable document, given an XML decoder and a start element.
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
span-check
span-check runs quality checks on input data
|
span-check runs quality checks on input data |
span-deduplicate
deduplicate a intermediate schema with respect to licensing information
|
deduplicate a intermediate schema with respect to licensing information |
span-export
span-export creates various destination formats, mostly for SOLR.
|
span-export creates various destination formats, mostly for SOLR. |
span-import
Converts various input formats into an intermediate schema.
|
Converts various input formats into an intermediate schema. |
span-redact
redact intermediate schema
|
redact intermediate schema |
span-tag
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records.
|
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records. |
Package sets implements basic set types.
|
Package sets implements basic set types. |
encoding
|
|
formeta
Package formeta implements marshaling for formeta (metafacture internal format).
|
Package formeta implements marshaling for formeta (metafacture internal format). |
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods.
|
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods. |
Package qa implements quality assurance helpers.
|
Package qa implements quality assurance helpers. |
sources
|
|
crossref
Package crossref implements crossref related structs and transformations.
|
Package crossref implements crossref related structs and transformations. |