span

package module
v0.1.156 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2017 License: GPL-3.0 Imports: 19 Imported by: 3

README

Span

Install with

$ go get github.com/miku/span/cmd/...

or via deb or rpm packages.

Formats

Also:

TODO

  • Decouple format from source. Things like SourceID and MegaCollection are per source, not format.

Jsoniter testdrive

  • encoding/json
$ time taskcat GeniosIntermediateSchema | span-tag -c $(taskoutput AMSLFilterConfig) > /dev/null
...
real    11m48.803s
user    40m15.980s
sys      0m32.880s
  • jsoniter/go
$ time taskcat GeniosIntermediateSchema | span-tag -c $(taskoutput AMSLFilterConfig) > /dev/null
...

real    9m25.871s
user    31m29.240s
sys      0m32.572s

Licence

  • GPLv3
  • This project uses the Compact Language Detector 2 - CLD2, Apache License Version 2.0

Documentation

Overview

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index

Constants

View Source
const (
	// AppVersion of span package. Commandline tools will show this on -v.
	AppVersion = "0.1.156"
	// KeyLengthLimit is a limit imposed by memcached protocol, which is used
	// for blob storage as of June 2015. If we change the key value store,
	// this limit might become obsolete.
	KeyLengthLimit = 250
)

Variables

View Source
var ISSNPattern = regexp.MustCompile(`[0-9]{4,4}-[0-9]{3,3}[0-9X]`)

ISSNPattern is a regular expression matching standard ISSN.

Functions

func ByteSink

func ByteSink(w io.Writer, out chan []byte, done chan bool)

ByteSink is a fan in writer for a []byte channel. A newline is appended after each object.

func DetectLang3

func DetectLang3(text string) (string, error)

DetectLang3 returns the best guess 3-letter language code for a given text.

func FromLines added in v0.1.56

func FromLines(r io.Reader, f ImporterFunc) (chan []Importer, error)

FromLines returns a channel of slices of importable objects with a default batch size of 20000 docs.

func FromLinesSize added in v0.1.56

func FromLinesSize(r io.Reader, f ImporterFunc, size int) (chan []Importer, error)

FromLinesSize returns a channel of slices of importable values, given a reader, f (for a single value) and number of documents to batch. Important: Due to fan-out input and output order will not be preserved.

func FromXML

func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)

FromXML is like FromXMLSize, with a default batch size of 2000 XML documents.

func FromXMLSize

func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)

FromXMLSize returns a channel of importable document slices given a reader over XML, a name of the XML start element, a XMLDecoderFunc callback that deserializes an XML snippet and a batch size. TODO(miku): more idiomatic error handling, e.g. over error channel.

func ReadLines added in v0.1.130

func ReadLines(filename string) (lines []string, err error)

ReadLines returns a list of trimmed lines in a file. Empty lines are skipped.

func UnescapeTrim

func UnescapeTrim(s string) string

UnescapeTrim unescapes HTML character references and trims the space of a given string.

Types

type FileReader added in v0.1.130

type FileReader struct {
	Filename string
	// contains filtered or unexported fields
}

FileReader creates a ReadCloser from a filename. If postpones error handling up until the first read. TODO: Throw this out.

func (*FileReader) Close added in v0.1.130

func (r *FileReader) Close() (err error)

Close closes the file.

func (*FileReader) Read added in v0.1.130

func (r *FileReader) Read(p []byte) (n int, err error)

Read reads from the file.

type Importer

type Importer interface {
	ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

Importer objects can be converted into an intermediate schema.

type ImporterFunc added in v0.1.56

type ImporterFunc func(b []byte) (Importer, error)

ImporterFunc turns a byte slice into a single importable object.

type LinkReader added in v0.1.130

type LinkReader struct {
	Link string
	// contains filtered or unexported fields
}

LinkReader implements io.Reader for a URL.

func (*LinkReader) Read added in v0.1.130

func (r *LinkReader) Read(p []byte) (int, error)
type SavedLink struct {
	Link string
	// contains filtered or unexported fields
}

SavedLink saves the content of a URL to a file.

func (*SavedLink) Remove added in v0.1.130

func (s *SavedLink) Remove()

Remove remove any left over temporary file.

func (*SavedLink) Save added in v0.1.130

func (s *SavedLink) Save() (filename string, err error)

Save link to a temporary file, return the filename.

type SavedReaders added in v0.1.130

type SavedReaders struct {
	Readers []io.Reader
	// contains filtered or unexported fields
}

SavedReaders takes a list of readers and persists their content in a temporary file.

func (*SavedReaders) Remove added in v0.1.130

func (r *SavedReaders) Remove()

Remove remove any left over temporary file.

func (*SavedReaders) Save added in v0.1.130

func (r *SavedReaders) Save() (filename string, err error)

Save saves all readers to a temporary file and returns the filename.

type Skip

type Skip struct {
	Reason string
}

Skip marks records to skip.

func (Skip) Error

func (s Skip) Error() string

Error returns the reason for skipping.

type SkipReader added in v0.1.130

type SkipReader struct {
	CommentPrefixes []string
	// contains filtered or unexported fields
}

SkipReader skips empty lines and lines with comments.

func NewSkipReader added in v0.1.130

func NewSkipReader(r *bufio.Reader) *SkipReader

NewSkipReader creates a new SkipReader.

func (SkipReader) ReadString added in v0.1.130

func (r SkipReader) ReadString(delim byte) (s string, err error)

ReadString will return only non-empty lines and lines not starting with a comment prefix.

type Source

type Source interface {
	Iterate(io.Reader) (<-chan []Importer, error)
}

Source can emit records given a reader. The channel is of type []Importer, to allow the source to send objects over the channel in batches for performance (1000 x 1000 docs vs 1000000 x 1 doc).

type WriteCounter added in v0.1.130

type WriteCounter struct {
	// contains filtered or unexported fields
}

WriteCounter counts the number of bytes written through it.

func (*WriteCounter) Count added in v0.1.130

func (w *WriteCounter) Count() uint64

Count returns the number of bytes written.

func (*WriteCounter) Write added in v0.1.130

func (w *WriteCounter) Write(p []byte) (int, error)

Write increments the total byte count.

type XMLDecoderFunc

type XMLDecoderFunc func(*xml.Decoder, xml.StartElement) (Importer, error)

XMLDecoderFunc returns an importable document, given an XML decoder and a start element.

type ZipContentReader added in v0.1.130

type ZipContentReader struct {
	Filename string
	// contains filtered or unexported fields
}

ZipContentReader returns the concatenated content of all files in a zip archive given by its filename. All content is temporarily stored in memory, so this type should only be used with smaller archives.

func (*ZipContentReader) Read added in v0.1.130

func (r *ZipContentReader) Read(p []byte) (int, error)

Read returns the content of all archive members.

type ZipOrPlainLinkReader added in v0.1.130

type ZipOrPlainLinkReader struct {
	Link string
	// contains filtered or unexported fields
}

ZipOrPlainLinkReader is a reader that transparently handles zipped and uncompressed content, given a URL as string.

func (*ZipOrPlainLinkReader) Read added in v0.1.130

func (r *ZipOrPlainLinkReader) Read(p []byte) (int, error)

Read implements the reader interface.

Directories

Path Synopsis
cmd
span-check
span-check runs quality checks on input data
span-check runs quality checks on input data
span-deduplicate
deduplicate a intermediate schema with respect to licensing information
deduplicate a intermediate schema with respect to licensing information
span-export
span-export creates various destination formats, mostly for SOLR.
span-export creates various destination formats, mostly for SOLR.
span-import
Converts various input formats into an intermediate schema.
Converts various input formats into an intermediate schema.
span-redact
redact intermediate schema
redact intermediate schema
span-tag
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records.
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records.
Package sets implements basic set types.
Package sets implements basic set types.
encoding
csv
Package csv implements a decoder, that supports CSV decoding.
Package csv implements a decoder, that supports CSV decoding.
formeta
Package formeta implements marshaling for formeta (metafacture internal format).
Package formeta implements marshaling for formeta (metafacture internal format).
tsv
Package tsv implements a decoder for tab separated data.
Package tsv implements a decoder for tab separated data.
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON.
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON.
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods.
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods.
Package licensing implements support for KBART and ISIL attachments.
Package licensing implements support for KBART and ISIL attachments.
kbart
Package kbart implements support for KBART (Knowledge Bases And Related Tools working group, http://www.uksg.org/kbart/) holding files (http://www.uksg.org/kbart/s5/guidelines/data_format).
Package kbart implements support for KBART (Knowledge Bases And Related Tools working group, http://www.uksg.org/kbart/) holding files (http://www.uksg.org/kbart/s5/guidelines/data_format).
Package qa implements quality assurance helpers.
Package qa implements quality assurance helpers.
sources
crossref
Package crossref implements crossref related structs and transformations.
Package crossref implements crossref related structs and transformations.
oai

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL