ckit

package module
v0.0.0-...-24bf5da Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 28, 2024 License: MIT Imports: 26 Imported by: 0

README

ckit

Citation graph kit for the LABE project at SLUB Dresden. This subproject contains a few standalone command lines programs and servers. The task orchestration part lives under labe/python.

Go Reference

  • doisniffer, filter to find DOI by patterns in Solr VuFind JSON documents
  • labed, an HTTP server serving Open Citations data fused with catalog metadata
  • tabjson, turn JSON into TSV
  • makta, turn TSV files into sqlite3 databases

To build all binaries, run:

$ make -j

To build a debian package, run:

$ make -j deb

To cleanup all artifacts, run:

$ make clean

doisniffer

Tool to turn a VuFind Solr schema without DOI field into one with - doi_str_mv - by sniffing out potential DOI from other fields.

$ cat index.ndj | doisniffer > augmented.ndj

By default, only documents are passed through, which actually contain a DOI.

Usage of doisniffer:
  -K string
        ignore keys (regexp), comma separated (default "barcode,dewey")
  -S    do not skip unmatched documents
  -b int
        batch size (default 5000)
  -i string
        identifier key (default "id")
  -k string
        update key (default "doi_str_mv")
  -version
        show version and exit
  -w int
        number of workers (default 8)

labed

HTTP API server, takes requests for a given id and returns a result fused from OCI citations and index data.

It currently works with three types of sqlite3 databases:

  • id-to-doi mapping
  • OCI citations
  • index metadata
Usage
usage: labed [OPTION]

labed is a HTTP web server fusing Open Citation (https://opencitations.net/)
and library catalog data at SLUB Dresden (https://www.slub-dresden.de/) and
other libraries (https://finc.info/); it requires three types of databases:

(1) [-i] an sqlite3 catalog-id-to-doi translation database (~10GB+)
(2) [-o] an sqlite3 version of OCI/COCI (~150GB+)
(3) [-m] an sqlite3 mapping from catalog ids to (json) metadata; this can be repeated
         (size depends on index size and on how much metadata is included) (~40-350GB)

Each database may be updated separately, with separate processes.

Examples

  $ labed -c -z -addr localhost:1234 -i i.db -o o.db -m d.db

  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA3My9wbmFzLjg1LjguMjQ0NA
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTE3Ny8xMDQ5NzMyMzA1Mjc2Njg3
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxMC9qYy4yMDExLTAzODU
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxNC9hb3MvMTE3NjM0Nzk2Mw
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMjMwNy8yMDk1NTIx

Bulk requests

  $ curl -sL https://is.gd/xGqzsg | zstd -dc -T0 |
    parallel -j 40 "curl -s http://localhost:8000/id/{}" |
    jq -rc '[.id, .doi, .extra.citing_count, .extra.cited_count, .extra.took] | @tsv'

Flags

  -a string
        path to access log file (off, if empty)
  -addr string
        host and port to listen on (default "localhost:8000")
  -c    enable caching of expensive responses
  -ct duration
        cache trigger duration (default 250ms)
  -cx int
        maximum filesize cache in bytes (default 68719476736)
  -i string
        identifier database path (id-doi mapping)
  -logfile string
        application log file (stderr if empty)
  -m value
        index metadata cache sqlite3 path (repeatable)
  -o string
        oci as a database path (citations)
  -q    no application logging at all
  -stopwatch
        enable stopwatch (debug)
  -version
        show version and exit
  -z    enable gzip compression middleware
Using a stopwatch

Experimental -stopwatch flag to trace duration of various operations.

$ labed -stopwatch -c -z -i i.db -o o.db -m index.db
2022/01/13 12:26:30 setup group fetcher over 1 databases: [index.db]

    ___       ___       ___       ___       ___
   /\__\     /\  \     /\  \     /\  \     /\  \
  /:/  /    /::\  \   /::\  \   /::\  \   /::\  \
 /:/__/    /::\:\__\ /::\:\__\ /::\:\__\ /:/\:\__\
 \:\  \    \/\::/  / \:\::/  / \:\:\/  / \:\/:/  /
  \:\__\     /:/  /   \::/  /   \:\/  /   \::/  /
   \/__/     \/__/     \/__/     \/__/     \/__/

Examples:

  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA3My9wbmFzLjg1LjguMjQ0NA
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAwMS9qYW1hLjI4Mi4xNi4xNTE5
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTAwNi9qbXJlLjE5OTkuMTcxNQ
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTE3Ny8xMDQ5NzMyMzA1Mjc2Njg3
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxMC9qYy4yMDExLTAzODU
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxNC9hb3MvMTE3NjM0Nzk2Mw
  http://localhost:8000/id/ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMjMwNy8yMDk1NTIx

2022/01/13 12:26:30 labed starting 4a89db4 2022-01-13T11:23:31Z http://localhost:8000

2021/09/29 17:35:20 timings for XVlB

> XVlB    0    0s             0.00    started query for: ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTA5OC9yc3BhLjE5OTguMDE2NA
> XVlB    1    397.191µs      0.00    found doi for id: 10.1098/rspa.1998.0164
> XVlB    2    481.676µs      0.01    found 8 citing items
> XVlB    3    18.984627ms    0.23    found 456 cited items
> XVlB    4    13.421306ms    0.16    mapped 464 dois back to ids
> XVlB    5    494.163µs      0.01    recorded unmatched ids
> XVlB    6    44.093361ms    0.52    fetched 302 blob from index data store
> XVlB    7    6.422462ms     0.08    encoded JSON
> XVlB    -    -              -       -
> XVlB    S    84.294786ms    1.0     total
Live Stats

The server collects a few metrics internally and exposes them via URL:

$ curl -s localhost:8000/stats | jq .
{
  "pid": 82132,
  "hostname": "",
  "uptime": "39m35.668250841s",
  "uptime_sec": 2375.668250841,
  "time": "2022-01-26 14:39:20.662511775 +0100 CET m=+2375.669845497",
  "unixtime": 1643204360,
  "status_code_count": {},
  "total_status_code_count": {
    "200": 1579,
    "307": 411,
    "404": 249,
    "500": 8
  },
  "count": 0,
  "total_count": 2247,
  "total_response_time": "1h16m48.617898515s",
  "total_response_time_sec": 4608.617898515,
  "total_response_size": 4333804080,
  "average_response_size": 1928706,
  "average_response_time": "2.0510093s",
  "average_response_time_sec": 2.0510093,
  "total_metrics_counts": {
    "cache_hit": 213,
    "cached": 182,
    "index_data_fetch": 2613454,
    "sql_query": 7524
  },
  "average_metrics_timers": {
    "cache_hit": 0.3651285,
    "cached": 0.17130907,
    "index_data_fetch": 0.000639452,
    "sql_query": 0.364669177
  }
}
TODO

tabjson

A non-generic, quick JSON to TSV converter

Turns jsonlines with an id field into (id, doc) TSV. We want to take an index snapshot, extract the id, create a TSV and then put it into sqlite3 so we can serve queries.

This is nothing, jq could not do, e.g. via (might need escaping).

$ jq -rc '[.id, (. | tojson)] | @tsv'

TODO: remove or use compression.

$ tabjson -h
Usage of tabjson:
  -C    compress value; gz+b64
  -T    emit table showing possible savings through compression

Examples.

$ head -1 ../../data/index.data | tabjson
ai-49-aHR0...uMi4xNTU       {"access_facet":"Electronic Resourc ... }

makta

make a database from tabular data

Turn tabular data into a lookup table using sqlite3. This is a working PROTOTYPE with limitations, e.g. no customizations, the table definition is fixed, etc.

CREATE TABLE IF NOT EXISTS map (k TEXT, v TEXT)

As a performance data point, an example dataset with 1B+ rows can be inserted and indexed in less than two hours (on a recent CPU and an nvme drive; database file size: 400G).

Installation

https://github.com/miku/labe/releases (wip)

$ go install github.com/miku/labe/go/ckit/cmd/makta@latest
How it works

Data is chopped up into smaller chunks (defaults to about 64MB) and imported with the .import command. Indexes are created only after all data has been imported.

Example
$ cat fixtures/sample-xs.tsv | column -t
10.1001/10-v4n2-hsf10003                    10.1177/003335490912400218
10.1001/10-v4n2-hsf10003                    10.1097/01.bcr.0000155527.76205.a2
10.1001/amaguidesnewsletters.1996.novdec01  10.1056/nejm199312303292707
10.1001/amaguidesnewsletters.1996.novdec01  10.1016/s0363-5023(05)80265-5
10.1001/amaguidesnewsletters.1996.novdec01  10.1001/jama.1994.03510440069036
10.1001/amaguidesnewsletters.1997.julaug01  10.1097/00007632-199612150-00003
10.1001/amaguidesnewsletters.1997.mayjun01  10.1164/ajrccm/147.4.1056
10.1001/amaguidesnewsletters.1997.mayjun01  10.1136/thx.38.10.760
10.1001/amaguidesnewsletters.1997.mayjun01  10.1056/nejm199507133330207
10.1001/amaguidesnewsletters.1997.mayjun01  10.1378/chest.88.3.376

$ makta -o xs.db < fixtures/sample-xs.tsv
2021/10/04 16:13:06 [ok] initialized database · xs.db
2021/10/04 16:13:06 [io] written 679B · 361.3K/s
2021/10/04 16:13:06 [ok] 1/2 created index · xs.db
2021/10/04 16:13:06 [ok] 2/2 created index · xs.db

$ sqlite3 xs.db 'select * from map'
10.1001/10-v4n2-hsf10003|10.1177/003335490912400218
10.1001/10-v4n2-hsf10003|10.1097/01.bcr.0000155527.76205.a2
10.1001/amaguidesnewsletters.1996.novdec01|10.1056/nejm199312303292707
10.1001/amaguidesnewsletters.1996.novdec01|10.1016/s0363-5023(05)80265-5
10.1001/amaguidesnewsletters.1996.novdec01|10.1001/jama.1994.03510440069036
10.1001/amaguidesnewsletters.1997.julaug01|10.1097/00007632-199612150-00003
10.1001/amaguidesnewsletters.1997.mayjun01|10.1164/ajrccm/147.4.1056
10.1001/amaguidesnewsletters.1997.mayjun01|10.1136/thx.38.10.760
10.1001/amaguidesnewsletters.1997.mayjun01|10.1056/nejm199507133330207
10.1001/amaguidesnewsletters.1997.mayjun01|10.1378/chest.88.3.376

$ sqlite3 xs.db 'select * from map where k = "10.1001/amaguidesnewsletters.1997.mayjun01" '
10.1001/amaguidesnewsletters.1997.mayjun01|10.1164/ajrccm/147.4.1056
10.1001/amaguidesnewsletters.1997.mayjun01|10.1136/thx.38.10.760
10.1001/amaguidesnewsletters.1997.mayjun01|10.1056/nejm199507133330207
10.1001/amaguidesnewsletters.1997.mayjun01|10.1378/chest.88.3.376
Motivation

SQLite is likely used more than all other database engines combined. Billions and billions of copies of SQLite exist in the wild. -- https://www.sqlite.org/mostdeployed.html

Sometimes, programs need lookup tables to map values between two domains. A dictionary is a perfect data structure as long as the data fits in memory. For larger sets (hundreds of millions of entries), a dictionary may not work.

The makta tool currently takes a two-column TSV and turns it into an sqlite3 database, which you can query in your program. Depending on a couple of factors, you maybe be able to query the lookup database with about 1-50K queries per second.

Finally, sqlite3 is just an awesome database and recommeded storage format.

Usage
$ makta -h
Usage of makta:
  -B int
        buffer size (default 67108864)
  -C int
        sqlite3 cache size, needs memory = C x page size (default 1000000)
  -I int
        index mode: 0=none, 1=k, 2=v, 3=kv (default 3)
  -o string
        output filename (default "data.db")
  -version
        show version and exit
Performance
$ wc -l fixtures/sample-10m.tsv
10000000 fixtures/sample-10m.tsv

$ stat --format "%s" fixtures/sample-10m.tsv
548384897

$ time makta < fixtures/sample-10m.tsv
2021/09/30 16:58:07 [ok] initialized database -- data.db
2021/09/30 16:58:17 [io] written 523M · 56.6M/s
2021/09/30 16:58:21 [ok] 1/2 created index -- data.db
2021/09/30 16:58:34 [ok] 2/2 created index -- data.db

real    0m26.267s
user    0m24.122s
sys     0m3.224s
  • 10M rows stored, with indexed keys and values in 27s, 370370 rows/s
TODO
  • allow tab-importing to be done programmatically, for any number of columns
  • a better name: mktabdb, mktabs, dbize - go with makta for now
  • could write a tool for burst queries, e.g. split data into N shard, create N databases and distribute queries across files - e.g. dbize db.json with the same repl, etc. -- if we've seen 300K inserts per db, we may see 0.X x CPU x 300K, maybe millions/s.

As shortcut, we could add commands that turn an index into a "sqlite3" database in one command, e.g.

$ mkindexdb -server ... -key-field id -o index.db
Design ideas

A design that works with 50M rows per database, e.g. 20 files for 1B rows; grouped under a single directory. Every interaction only involves the directory, not the individual files.


Clip art from ClipArt ETC.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrBlobNotFound can be used for unfetchable blobs.
	ErrBlobNotFound   = errors.New("blob not found")
	ErrBackendsFailed = errors.New("all backends failed")
)

Functions

func OpenDatabase

func OpenDatabase(filename string) (*sqlx.DB, error)

OpenDatabase first ensures a file does actually exists, then create as read-only connection.

func SliceContains

func SliceContains(ss []string, v string) bool

SliceContains returns true, if a string slice contains a given value.

Types

type Entry

type Entry struct {
	T       time.Time
	Message string
}

Entry is a stopwatch entry.

type ErrorMessage

type ErrorMessage struct {
	Status int   `json:"status,omitempty"`
	Err    error `json:"err,omitempty"`
}

ErrorMessage from failed requests.

type FetchGroup

type FetchGroup struct {
	Backends []Fetcher
}

FetchGroup allows to run a index data fetch operation in a cascade over a couple of backends. The result from the first database that contains a value for a given id is returned. Currently sequential, but could be made parallel, maybe.

func (*FetchGroup) Fetch

func (g *FetchGroup) Fetch(id string) ([]byte, error)

Fetch constructs a URL from a template and retrieves the blob.

func (*FetchGroup) FromFiles

func (g *FetchGroup) FromFiles(files ...string) error

FromFiles sets up a fetch group from a list of sqlite3 database filenames.

func (*FetchGroup) Ping

func (g *FetchGroup) Ping() error

Ping is a healthcheck.

type Fetcher

type Fetcher interface {
	Fetch(id string) ([]byte, error)
}

Fetcher fetches one or more blobs given their identifiers.

type Map

type Map struct {
	Key   string `db:"k"`
	Value string `db:"v"`
}

Map is a generic lookup table. We use it together with sqlite3. This corresponds to the format generated by the makta command line tool: https://github.com/miku/labe/tree/main/go/ckit#makta.

type Pinger

type Pinger interface {
	Ping() error
}

Pinger allows to perform a simple health check.

type Response

type Response struct {
	ID        string            `json:"id,omitempty"`
	DOI       string            `json:"doi,omitempty"`
	Citing    []json.RawMessage `json:"citing,omitempty"`
	Cited     []json.RawMessage `json:"cited,omitempty"`
	Unmatched struct {
		Citing []json.RawMessage `json:"citing,omitempty"`
		Cited  []json.RawMessage `json:"cited,omitempty"`
	} `json:"unmatched,omitempty"`
	Extra struct {
		UnmatchedCitingCount int     `json:"unmatched_citing_count"`
		UnmatchedCitedCount  int     `json:"unmatched_cited_count"`
		CitingCount          int     `json:"citing_count"`
		CitedCount           int     `json:"cited_count"`
		Cached               bool    `json:"cached"`
		Took                 float64 `json:"took"` // seconds
		// Institution is set optionally (e.g. to "DE-14"), if the response has
		// been tailored towards the holdings of a given institution.
		Institution string `json:"institution,omitempty"`
	} `json:"extra,omitempty"`
}

Response contains a subset of index data fused with citation data. Citing and cited documents are kept unparsed for flexibility and performance; we expect JSON. For unmatched docs, we may only transmit the DOI, e.g. {"doi_str_mv": "10.12/34"}.

type Server

type Server struct {
	// IdentifierDatabase maps local ids to DOI. The expected schema is
	// documented here: https://github.com/miku/labe/tree/main/go/ckit#makta
	//
	// 0-025152688     10.1007/978-3-476-03951-4
	// 0-025351737     10.13109/9783666551536
	// 0-024312134     10.1007/978-1-4612-1116-7
	// 0-025217100     10.1007/978-3-322-96667-4
	// ...
	IdentifierDatabase *sqlx.DB
	// OciDatabase contains DOI to DOI mappings representing a citation
	// relationship. The expected schema is documented here:
	// https://github.com/miku/labe/tree/main/go/ckit#makta
	//
	// 10.1002/9781119393351.ch1       10.1109/icelmach.2012.6350005
	// 10.1002/9781119393351.ch1       10.1115/detc2011-48151
	// 10.1002/9781119393351.ch1       10.1109/ical.2009.5262972
	// 10.1002/9781119393351.ch1       10.1109/cdc.2013.6760196
	// ...
	OciDatabase *sqlx.DB
	// IndexData allows to fetch a metadata blob for an identifier. This is
	// an interface that in the past has been implemented by types wrapping
	// microblob, SOLR and sqlite3, as well as a FetchGroup, that allows to
	// query multiple backends. We settled on sqlite3 and FetchGroup, the other
	// implementations are now gone.
	//
	// dswarm-126-ZnR0aG9zdHdlc3RsaX...   {"id":"dswarm-126-ZnR0aG9zdHdlc3RsaXBwZ...
	// dswarm-126-ZnR0aG9zdHdlc3RsaX...   {"id":"dswarm-126-ZnR0aG9zdHdlc3RsaXBwZ...
	// dswarm-126-ZnR0dW11ZW5jaGVuOm...   {"id":"dswarm-126-ZnR0dW11ZW5jaGVuOm9ha...
	// dswarm-126-ZnR0dW11ZW5jaGVuOm...   {"id":"dswarm-126-ZnR0dW11ZW5jaGVuOm9ha...
	// ...
	IndexData Fetcher
	// Router to register routes on.
	Router *mux.Router
	// StopWatchEnabled enabled the stopwatch, a builtin, simplistic request tracer.
	StopWatchEnabled bool
	// Cache for expensive items.
	Cache *cache.Cache
	// CacheTriggerDuration determines which items to cache.
	CacheTriggerDuration time.Duration
	// Stats, like request counts and status codes.
	Stats *stats.Stats
}

Server wraps three data sources required for index and citation data fusion. The IdentifierDatabase maps a local identifier (e.g. 0-1238201) to a DOI, the OciDatabase contains citing and cited relationships from OCI/COCI citation corpus and IndexData allows to fetch a metadata blob from a backing store.

A performance data point: On a 8 core 16G RAM machine we can keep a sustained load of about 12K SQL qps, 150MB/s reads off disk. Total size of databases involved is about 220GB plus 10GB cache (ie. at most 6% of the data could be held in memory at any given time).

Requesting the most costly (and large) 150K docs under load, the server will hover at around 10% (of 16GB) RAM.

func (*Server) Ping

func (s *Server) Ping() error

Ping returns an error, if any of the datastores is not available.

func (*Server) Routes

func (s *Server) Routes()

Routes sets up routes.

func (*Server) ServeHTTP

func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request)

ServeHTTP turns the server into an HTTP handler.

type Snippet

type Snippet struct {
	Institutions []string `json:"institution"`
}

Snippet is a small piece of index metadata used for institution filtering.

type SqliteFetcher

type SqliteFetcher struct {
	DB *sqlx.DB
}

SqliteFetcher serves index documents from sqlite database with a fixed schema, as generated by the makta tool.

func (*SqliteFetcher) Fetch

func (b *SqliteFetcher) Fetch(id string) (p []byte, err error)

Fetch document.

func (*SqliteFetcher) Ping

func (b *SqliteFetcher) Ping() error

Ping pings the database.

type StopWatch

type StopWatch struct {
	sync.Mutex
	// contains filtered or unexported fields
}

StopWatch allows to record events over time and render them in a pretty table; thread-safe. Example log output (via stopwatch.LogTable()).

2021/09/29 17:22:40 timings for XVlB

> XVlB 0 0s 0.00 started query for: ai-49-aHR0cDovL2R4LmRvaS5vcmcvMTAuMTIxMC9qYy4yMDExLTAzODU > XVlB 1 134.532µs 0.00 found doi for id: 10.1210/jc.2011-0385 > XVlB 2 67.918529ms 0.24 found 0 outbound and 4628 inbound edges > XVlB 3 32.293723ms 0.12 mapped 4628 dois back to ids > XVlB 4 3.358704ms 0.01 recorded unmatched ids > XVlB 5 68.636671ms 0.25 fetched 2567 blob from index data store > XVlB 6 105.771005ms 0.38 encoded JSON > XVlB - - - - > XVlB S 278.113164ms 1.00 total

func (*StopWatch) Elapsed

func (s *StopWatch) Elapsed() time.Duration

Elapsed returns the total elapsed time.

func (*StopWatch) Entries

func (s *StopWatch) Entries() []*Entry

Entries returns the accumulated messages for this stopwatch.

func (*StopWatch) LogTable

func (s *StopWatch) LogTable()

LogTable write a table using standard library log facilities.

func (*StopWatch) Record

func (s *StopWatch) Record(msg string)

Record records a message.

func (*StopWatch) Recordf

func (s *StopWatch) Recordf(msg string, vs ...interface{})

Recordf records a message.

func (*StopWatch) Reset

func (s *StopWatch) Reset()

Reset resets the stopwatch.

func (*StopWatch) SetEnabled

func (s *StopWatch) SetEnabled(enabled bool)

SetEnabled enables or disables the stopwatch. If disabled, any call will be a noop.

func (*StopWatch) Table

func (s *StopWatch) Table() string

Table format the timings as table.

Directories

Path Synopsis
Package cache implements caching helpers, e.g.
Package cache implements caching helpers, e.g.
cmd
doisniffer
Sniff out DOI from JSON document, optionally update docs with found DOI.
Sniff out DOI from JSON document, optionally update docs with found DOI.
makta
makta takes a two column TSV file and turns it into an indexed sqlite3 database.
makta takes a two column TSV file and turns it into an indexed sqlite3 database.
Package doi helps to find DOI in JSON documents.
Package doi helps to find DOI in JSON documents.
Package xflag add an additional flag type Array for repeated string flags.
Package xflag add an additional flag type Array for repeated string flags.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL