scholkit

module

v0.2.0 Latest Latest Go to latest Published: Nov 18, 2024 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Sketch project, assorted utitlies around scholarly metadata.

status: unstable, wip

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables:

catshape (data format conversions)
urlstream (stream data from many urls)
cdxlookup (ad-hoc cdx api lookup)
strnorm (quick string normalization)

Example dataset to work with, e.g. convert arxiv to fatcat release:

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    catshape -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

crossref
datacite
pubmed
arxiv
oaiscrape
openalex
dblp

Target:

fatcat entities (release, work, container, file, contrib, abstract)

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The urlstream utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ urlstream < top100.txt > top100books.txt

Notes

TODO

map basic fields to fatcat release entities
map all fields to fatcat release entities
basic clustering algorithm

Directories ¶

Path	Synopsis
cmd
cdxfetch
cdxlookup Lookup CDX records at the Internet Archive.	Lookup CDX records at the Internet Archive.
clowder clowder is a release entity clusterer.	clowder is a release entity clusterer.
fcid
fifi
mdconv CLI to convert various metadata formats, mostly to fatcat entities.	CLI to convert various metadata formats, mostly to fatcat entities.
strnorm
urlstream urlstream takes one or more links to (compressed) files and will stream their content to stdout.	urlstream takes one or more links to (compressed) files and will stream their content to stdout.
convert
normal
notes
journals
parallel Package parallel implements helpers for fast processing of line oriented inputs.	Package parallel implements helpers for fast processing of line oriented inputs.
record Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.	Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
schema
arxiv
crossref
datacite
dblp
doaj
fatcat
oaiscrape
openalex
pubmed
xmlstream Package xmlstream implements a lightweight XML scanner on top of encoding/xml.	Package xmlstream implements a lightweight XML scanner on top of encoding/xml.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL