Documentation ¶
Overview ¶
Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de The Finc Authors, http://finc.info Martin Czygan, <martin.czygan@uni-leipzig.de>
This file is part of some open source application.
Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.
@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>
Index ¶
- Constants
- Variables
- func ByteSink(w io.Writer, out chan []byte, done chan bool)
- func DetectLang3(text string) (string, error)
- func FromLines(r io.Reader, f ImporterFunc) (chan []Importer, error)
- func FromLinesSize(r io.Reader, f ImporterFunc, size int) (chan []Importer, error)
- func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)
- func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)
- func ReadLines(filename string) (lines []string, err error)
- func UnescapeTrim(s string) string
- type FileReader
- type Importer
- type ImporterFunc
- type LinkReader
- type SavedLink
- type SavedReaders
- type Skip
- type SkipReader
- type Source
- type WriteCounter
- type XMLDecoderFunc
- type ZipContentReader
- type ZipOrPlainLinkReader
Constants ¶
const ( // AppVersion of span package. Commandline tools will show this on -v. AppVersion = "0.1.156" // KeyLengthLimit is a limit imposed by memcached protocol, which is used // for blob storage as of June 2015. If we change the key value store, // this limit might become obsolete. KeyLengthLimit = 250 )
Variables ¶
var ISSNPattern = regexp.MustCompile(`[0-9]{4,4}-[0-9]{3,3}[0-9X]`)
ISSNPattern is a regular expression matching standard ISSN.
Functions ¶
func ByteSink ¶
ByteSink is a fan in writer for a []byte channel. A newline is appended after each object.
func DetectLang3 ¶
DetectLang3 returns the best guess 3-letter language code for a given text.
func FromLines ¶ added in v0.1.56
func FromLines(r io.Reader, f ImporterFunc) (chan []Importer, error)
FromLines returns a channel of slices of importable objects with a default batch size of 20000 docs.
func FromLinesSize ¶ added in v0.1.56
FromLinesSize returns a channel of slices of importable values, given a reader, f (for a single value) and number of documents to batch. Important: Due to fan-out input and output order will not be preserved.
func FromXMLSize ¶
func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)
FromXMLSize returns a channel of importable document slices given a reader over XML, a name of the XML start element, a XMLDecoderFunc callback that deserializes an XML snippet and a batch size. TODO(miku): more idiomatic error handling, e.g. over error channel.
func ReadLines ¶ added in v0.1.130
ReadLines returns a list of trimmed lines in a file. Empty lines are skipped.
func UnescapeTrim ¶
UnescapeTrim unescapes HTML character references and trims the space of a given string.
Types ¶
type FileReader ¶ added in v0.1.130
type FileReader struct { Filename string // contains filtered or unexported fields }
FileReader creates a ReadCloser from a filename. If postpones error handling up until the first read. TODO: Throw this out.
func (*FileReader) Close ¶ added in v0.1.130
func (r *FileReader) Close() (err error)
Close closes the file.
type Importer ¶
type Importer interface {
ToIntermediateSchema() (*finc.IntermediateSchema, error)
}
Importer objects can be converted into an intermediate schema.
type ImporterFunc ¶ added in v0.1.56
ImporterFunc turns a byte slice into a single importable object.
type LinkReader ¶ added in v0.1.130
type LinkReader struct { Link string // contains filtered or unexported fields }
LinkReader implements io.Reader for a URL.
type SavedLink ¶ added in v0.1.130
type SavedLink struct { Link string // contains filtered or unexported fields }
SavedLink saves the content of a URL to a file.
type SavedReaders ¶ added in v0.1.130
SavedReaders takes a list of readers and persists their content in a temporary file.
func (*SavedReaders) Remove ¶ added in v0.1.130
func (r *SavedReaders) Remove()
Remove remove any left over temporary file.
func (*SavedReaders) Save ¶ added in v0.1.130
func (r *SavedReaders) Save() (filename string, err error)
Save saves all readers to a temporary file and returns the filename.
type SkipReader ¶ added in v0.1.130
type SkipReader struct { CommentPrefixes []string // contains filtered or unexported fields }
SkipReader skips empty lines and lines with comments.
func NewSkipReader ¶ added in v0.1.130
func NewSkipReader(r *bufio.Reader) *SkipReader
NewSkipReader creates a new SkipReader.
func (SkipReader) ReadString ¶ added in v0.1.130
func (r SkipReader) ReadString(delim byte) (s string, err error)
ReadString will return only non-empty lines and lines not starting with a comment prefix.
type Source ¶
Source can emit records given a reader. The channel is of type []Importer, to allow the source to send objects over the channel in batches for performance (1000 x 1000 docs vs 1000000 x 1 doc).
type WriteCounter ¶ added in v0.1.130
type WriteCounter struct {
// contains filtered or unexported fields
}
WriteCounter counts the number of bytes written through it.
func (*WriteCounter) Count ¶ added in v0.1.130
func (w *WriteCounter) Count() uint64
Count returns the number of bytes written.
type XMLDecoderFunc ¶
XMLDecoderFunc returns an importable document, given an XML decoder and a start element.
type ZipContentReader ¶ added in v0.1.130
type ZipContentReader struct { Filename string // contains filtered or unexported fields }
ZipContentReader returns the concatenated content of all files in a zip archive given by its filename. All content is temporarily stored in memory, so this type should only be used with smaller archives.
type ZipOrPlainLinkReader ¶ added in v0.1.130
type ZipOrPlainLinkReader struct { Link string // contains filtered or unexported fields }
ZipOrPlainLinkReader is a reader that transparently handles zipped and uncompressed content, given a URL as string.
Directories ¶
Path | Synopsis |
---|---|
cmd
|
|
span-check
span-check runs quality checks on input data
|
span-check runs quality checks on input data |
span-deduplicate
deduplicate a intermediate schema with respect to licensing information
|
deduplicate a intermediate schema with respect to licensing information |
span-export
span-export creates various destination formats, mostly for SOLR.
|
span-export creates various destination formats, mostly for SOLR. |
span-import
Converts various input formats into an intermediate schema.
|
Converts various input formats into an intermediate schema. |
span-redact
redact intermediate schema
|
redact intermediate schema |
span-tag
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records.
|
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records. |
Package sets implements basic set types.
|
Package sets implements basic set types. |
encoding
|
|
csv
Package csv implements a decoder, that supports CSV decoding.
|
Package csv implements a decoder, that supports CSV decoding. |
formeta
Package formeta implements marshaling for formeta (metafacture internal format).
|
Package formeta implements marshaling for formeta (metafacture internal format). |
tsv
Package tsv implements a decoder for tab separated data.
|
Package tsv implements a decoder for tab separated data. |
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON.
|
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON. |
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods.
|
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods. |
Package licensing implements support for KBART and ISIL attachments.
|
Package licensing implements support for KBART and ISIL attachments. |
kbart
Package kbart implements support for KBART (Knowledge Bases And Related Tools working group, http://www.uksg.org/kbart/) holding files (http://www.uksg.org/kbart/s5/guidelines/data_format).
|
Package kbart implements support for KBART (Knowledge Bases And Related Tools working group, http://www.uksg.org/kbart/) holding files (http://www.uksg.org/kbart/s5/guidelines/data_format). |
Package qa implements quality assurance helpers.
|
Package qa implements quality assurance helpers. |
sources
|
|
crossref
Package crossref implements crossref related structs and transformations.
|
Package crossref implements crossref related structs and transformations. |