span

package module
v0.1.354 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 9, 2023 License: GPL-3.0 Imports: 12 Imported by: 3

README

Span

Span started as a single tool to convert Crossref API data into a VuFind/SOLR format as used in finc. An intermediate representation for article metadata is used for normalizing various input formats. Go was choosen as the implementation language because it is easy to deploy and has concurrency support built into the language. A basic scatter-gather design allowed to process millions of records fast.

While span has a few independent tools (like fetching or compacting crossref feeds), it is mostly used inside siskin, a set of tasks to build an aggregated index.

Installation

$ go install github.com/miku/span/cmd/...@latest

Span has frequent releases, although not all versions will be packaged as deb or rpm.

Background

Initial import Tue Feb 3 19:11:08 2015, a single span command. In March 2015, span-import and span-export appeared. There were some rudimentary commands for dealing with holding files of various formats. In early 2016, a licensing tool was briefly named span-label before becoming span-tag. In Summer 2016, span-check, span-deduplicate, span-redact were added, later a first man-page followed. In Summer 2017, span-deduplicate was gone, the doi-based deduplication was split up between the blunt, but fast groupcover and the generic span-update-labels. A new span-oa-filter helped to mark open-access records. In Winter 2017, a span-freeze was added to allow for fixed configuration across dozens of files. The span-crossref-snapshot tool replaced a sequence of luigi tasks responsible for creating a snapshot of crossref data (the process has been summarized in a comment). In Summer 2018, three new tools were added: span-compare for generating index diffs for index update tickets, span-review for generating reports based on SOLR queries and span-webhookd for triggering index reviews and ticket updates through GitLab. During the development, new input and output formats have been added. The parallel processing of records has been streamlined with the help of a small library called parallel. Since Winter 2017, the zek struct generator takes care of the initial screening of sources serialized as XML - making the process of mapping new data sources easier.

Since about 2018, the span tools have seen mostly small fixes and additions. Notable, since 2021, the previous scripts used to fetch daily metadata updates from crossref has been put into a standalone tool, span-crossref-sync, which merely adds some retry logic and consistent file naming to the API harvest.

Documentation

See: manual source.

Performance

In the best case no complete processing of the data should take more than two hours or run slower than 20000 records/s. The most expensive part currently seems to be the JSON serialization, but we keep JSON for now for the sake of readability. Experiments with faster JSON serializers and msgpack have been encouraging, a faster serialization should be the next measure to improve performance.

Most tools that work on lines will try to use as many workers as CPU cores. Except for span-tag - which needs to keep all holdings data in memory - all tools work well in a low-memory environment.

More cores can help (but returns may diminsh): On a 64 core 2021 Xeon, we find that e.g. span-export can process (decompression, deserialization, conversion, serialization, compression) on average 130000 JSON documents/s. The final pipeline stage (from normalized data to deduplicated and indexable data) seems to take about three hours.

Integration

The span tools are used in various tasks in siskin (which contains all orchestration code). All span tools work fine standalone, and most will accept input from stdin as well, allowing for one-off things like:

$ metha-cat http://oai.web | span-import -i name | span-tag -c amsl | span-export | solrbulk

TODO

There is a open issue regarding more flexible license labelling. While this would be useful, it would be probably even more useful to separate content conversions from licensing issues altogether. There is lots of work done in prototypes, which explore how fast and how reliable we can rewrite documents in a production server.

Ideally, a cron job or trigger regularly checks and ensures compliance.

Documentation

Overview

Package span implements common functions.

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index

Constants

View Source
const (
	// AppVersion of span package. Commandline tools will show this on -v.
	AppVersion = "0.1.353"
	// KeyLengthLimit was a limit imposed by the memcached protocol, which
	// was used for blob storage until Q1 2017. We switched the key-value
	// store, so this limit is somewhat obsolete.
	KeyLengthLimit = 250
)

Variables

View Source
var ISO639BibliographicToThree = map[string]string{

	"alb": "sqi",
	"arm": "hye",
	"baq": "eus",
	"bur": "mya",
	"chi": "zho",
	"cze": "ces",
	"dut": "nld",
	"fre": "fra",
	"geo": "kat",
	"ger": "deu",
	"gre": "ell",
	"ice": "isl",
	"mac": "mkd",
	"mao": "mri",
	"may": "msa",
	"per": "fas",
	"rum": "ron",
	"slo": "slk",
	"tib": "bod",
	"wel": "cym",
}

ISO639BibliographicToThree maps 639-2 identifier of the bibliographic applications to three-letter 639-3 identifier.

View Source
var ISO639NameToThree = map[string]string{}/* 7849 elements not displayed */

ISO639NameToThree maps a language name to three letter identifier.

View Source
var ISO639NameToThreeLower = map[string]string{}/* 7849 elements not displayed */

ISO639NameToThreeLower converts lowercase ISO language name to ISO639-3.

View Source
var ISO639OneToThree = map[string]string{}/* 184 elements not displayed */

ISO639OneToThree maps 639-1 identifier (two letters) (if there is one) to a three-letter 639-3 identifier.

View Source
var Static embed.FS

Functions

func DetectLang3

func DetectLang3(text string) (string, error)

DetectLang3 returns the best guess 3-letter language code for a given text.

func GenFincID added in v0.1.317

func GenFincID(sid, rid string) string

GenFincID returns a finc.id string consisting of an arbitraty prefix (e.g. "ai"), source id and URL safe record id. No additional checks, sid and rid should not be empty.

func LanguageIdentifier added in v0.1.130

func LanguageIdentifier(s string) string

LanguageIdentifier returns the three letter identifier from a variety of language name notations. Returns the empty string, if nothing matches. All data from http://www-01.sil.org/iso639-3/codes.asp.

func UnfreezeFilterConfig added in v0.1.130

func UnfreezeFilterConfig(frozenfile string) (dir, blob string, err error)

UnfreezeFilterConfig takes the name of a zipfile (from span-freeze) and returns of the path the thawed filterconfig (along with the temporary directory and error). When this function returns, all URLs in the filterconfig have then been replaced by absolute path on the file system. Cleanup of temporary directory is responsibility of caller.

Types

type Skip

type Skip struct {
	Reason string
}

Skip marks records to skip.

func (Skip) Error

func (s Skip) Error() string

Error returns the reason for skipping.

Directories

Path Synopsis
cmd
span-amsl-discovery
The span-amsl-discovery tool will create a discovery (now defunkt) like API response from available AMSL endpoints, refs #14456, #14415.
The span-amsl-discovery tool will create a discovery (now defunkt) like API response from available AMSL endpoints, refs #14456, #14415.
span-check
span-check runs quality checks on input data
span-check runs quality checks on input data
span-compare
span-compare renders a table with ISIL/SID counts of two indices side by side.
span-compare renders a table with ISIL/SID counts of two indices side by side.
span-crossref-members
span-crossref-members fetches crossref members api.
span-crossref-members fetches crossref members api.
span-crossref-snapshot
Given as single file with crossref works API messages, create a potentially smaller file, which contains only the most recent version of each document.
Given as single file with crossref works API messages, create a potentially smaller file, which contains only the most recent version of each document.
span-crossref-sync
span-crossref-sync download caches raw crossref messages from the crossref works API.
span-crossref-sync download caches raw crossref messages from the crossref works API.
span-crossref-table
Create a tabular representation of crossref data.
Create a tabular representation of crossref data.
span-doisniffer
Sniff out DOI from JSON document, optionally update docs with found DOI.
Sniff out DOI from JSON document, optionally update docs with found DOI.
span-export
span-export creates various destination formats, mostly for SOLR.
span-export creates various destination formats, mostly for SOLR.
span-folio
WIP: span-folio talks to FOLIO API to fetch ISIL, collections and other information relevant to attachments.
WIP: span-folio talks to FOLIO API to fetch ISIL, collections and other information relevant to attachments.
span-freeze
Freeze file containing urls along with the content of all urls.
Freeze file containing urls along with the content of all urls.
span-hcov
The span-hcov tool will generate a simple coverage report given a holding file in KBART format.
The span-hcov tool will generate a simple coverage report given a holding file in KBART format.
span-import
span-reshape is a dumbed down span-import.
span-reshape is a dumbed down span-import.
span-join-assets
span-join-assets combines a directory of json or single column TSV configurations into a single file.
span-join-assets combines a directory of json or single column TSV configurations into a single file.
span-local-data
The span-local-data extracts data from a JSON file - something `jq` can do just as well, albeit a bit slower.
The span-local-data extracts data from a JSON file - something `jq` can do just as well, albeit a bit slower.
span-oa-filter
span-oa-filter will set x.oa to true, if the given KBART file validates a record.
span-oa-filter will set x.oa to true, if the given KBART file validates a record.
span-redact
Redact intermediate schema, that is set fulltext field to the empty string.
Redact intermediate schema, that is set fulltext field to the empty string.
span-report
span-report creates data subsets from an index for reporting.
span-report creates data subsets from an index for reporting.
span-review
span-review runs plausibility queries against a SOLR server, mostly facet queries, refs #12756.
span-review runs plausibility queries against a SOLR server, mostly facet queries, refs #12756.
span-tag
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records.
span-tag takes an intermediate schema file and a configuration forest of filters for various tags and runs all filters on every record of the input to produce a stream of tagged records.
span-tagger
WIP: span-tagger will be a replacement of span-tag, with improvements:
WIP: span-tagger will be a replacement of span-tag, with improvements:
span-update-labels
span-update-labels takes a TSV of IDs and ISILs and updates an intermediate schema record x.labels field accordingly.
span-update-labels takes a TSV of IDs and ISILs and updates an intermediate schema record x.labels field accordingly.
span-webhookd
span-webhookd can serve as a webhook receiver[1] for gitlab, refs #13499.
span-webhookd can serve as a webhook receiver[1] for gitlab, refs #13499.
Package configutil handles application configuration and location and loading of various mapping files.
Package configutil handles application configuration and location and loading of various mapping files.
Package sets implements basic set types.
Package sets implements basic set types.
Package dateutil provides interval handling.
Package dateutil provides interval handling.
Package doi helps to find DOI in JSON documents.
Package doi helps to find DOI in JSON documents.
encoding
csv
Package csv implements a decoder, that supports CSV decoding.
Package csv implements a decoder, that supports CSV decoding.
formeta
Package formeta implements marshaling for formeta (metafacture internal format).
Package formeta implements marshaling for formeta (metafacture internal format).
tsv
Package tsv implements a decoder for tab separated data.
Package tsv implements a decoder for tab separated data.
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON.
Package filter implements flexible ISIL attachments with expression trees[1], serialized as JSON.
Package folio add support for a minimal subset of the FOLIO library platform API.
Package folio add support for a minimal subset of the FOLIO library platform API.
formats
doaj
Package doaj maps DOAJ metadata to intermediate schema.
Package doaj maps DOAJ metadata to intermediate schema.
dummy
Package dummy is just a minimal example.
Package dummy is just a minimal example.
elsevier
TODO.
TODO.
genderopen
Package genderopen, refs #13024.
Package genderopen, refs #13024.
jstor
TODO.
TODO.
Package gitlab contains support types for gitlab interaction.
Package gitlab contains support types for gitlab interaction.
Package licensing implements support for KBART and ISIL attachments.
Package licensing implements support for KBART and ISIL attachments.
kbart
Package kbart implements support for KBART (Knowledge Bases And Related Tools working group, http://www.uksg.org/kbart/) holding files (http://www.uksg.org/kbart/s5/guidelines/data_format).
Package kbart implements support for KBART (Knowledge Bases And Related Tools working group, http://www.uksg.org/kbart/) holding files (http://www.uksg.org/kbart/s5/guidelines/data_format).
Package parallel implements helpers for fast processing of line oriented inputs.
Package parallel implements helpers for fast processing of line oriented inputs.
Package quality implements quality checks.
Package quality implements quality checks.
Package solrutil implements helpers to access a SOLR index.
Package solrutil implements helpers to access a SOLR index.
Package tagging implements helper functions for attaching ISIL to records.
Package tagging implements helper functions for attaching ISIL to records.
Package xflag add an additional flag type Array for repeated string flags.
Package xflag add an additional flag type Array for repeated string flags.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL