scholkit

module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 18, 2024 License: MIT

README

scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Sketch project, assorted utitlies around scholarly metadata.

status: unstable, wip

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables:

  • catshape (data format conversions)
  • urlstream (stream data from many urls)
  • cdxlookup (ad-hoc cdx api lookup)
  • strnorm (quick string normalization)

Example dataset to work with, e.g. convert arxiv to fatcat release:

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    catshape -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

  • crossref
  • datacite
  • pubmed
  • arxiv
  • oaiscrape
  • openalex
  • dblp

Target:

  • fatcat entities (release, work, container, file, contrib, abstract)

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The urlstream utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ urlstream < top100.txt > top100books.txt

Notes

TODO

  • map basic fields to fatcat release entities
  • map all fields to fatcat release entities
  • basic clustering algorithm

Directories

Path Synopsis
cmd
cdxlookup
Lookup CDX records at the Internet Archive.
Lookup CDX records at the Internet Archive.
clowder
clowder is a release entity clusterer.
clowder is a release entity clusterer.
mdconv
CLI to convert various metadata formats, mostly to fatcat entities.
CLI to convert various metadata formats, mostly to fatcat entities.
urlstream
urlstream takes one or more links to (compressed) files and will stream their content to stdout.
urlstream takes one or more links to (compressed) files and will stream their content to stdout.
notes
Package parallel implements helpers for fast processing of line oriented inputs.
Package parallel implements helpers for fast processing of line oriented inputs.
record
Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
schema
Package xmlstream implements a lightweight XML scanner on top of encoding/xml.
Package xmlstream implements a lightweight XML scanner on top of encoding/xml.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL