scholkit

package module
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2025 License: MIT Imports: 0 Imported by: 0

README

scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Scratch project, assorted utilities around scholarly metadata formats and tasks.

status: wip, api and cli not stable yet

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables:

  • sk-convert (data format conversions)
  • sk-cat (stream data from many urls)
  • sk-cdx (ad-hoc cdx api lookup)
  • sk-norm (quick string normalization)

Example dataset to work with, e.g. convert arxiv to fatcat release:

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    sk-convert -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

  • crossref
  • datacite
  • pubmed
  • arxiv
  • oaiscrape
  • openalex
  • dblp

Target:

  • fatcat entities (release, work, container, file, contrib, abstract)

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The sk-cat utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ sk-cat < top100.txt > top100books.txt

Notes

TODO

  • implement schema conversions and tests
  • add layer for daily harvests and capturing data on disk
  • cli to interact with the current files on dist
  • cli for basic stats
  • some simplistic index/query structure, e.g. to quickly find a record by id or the like

More:

  • map basic fields to fatcat release entities
  • map all fields to fatcat release entities
  • basic clustering algorithm

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Version = "0.2.0"

Functions

This section is empty.

Types

This section is empty.

Directories

Path Synopsis
To write to files in a robust way we should:
To write to files in a robust way we should:
cmd
sk-cat
sk-cat takes one or more links to (compressed) files and will stream their content to stdout.
sk-cat takes one or more links to (compressed) files and will stream their content to stdout.
sk-cluster
sk-cluster is a release entity clusterer.
sk-cluster is a release entity clusterer.
sk-convert
CLI to convert various metadata formats, mostly to fatcat entities.
CLI to convert various metadata formats, mostly to fatcat entities.
sk-feed
sk-feed retrieves various upstream data sources.
sk-feed retrieves various upstream data sources.
sk-norm
TODO: string normalization cli tool
TODO: string normalization cli tool
sk-oai-dctojsonl
sk-oai-dctojsonl converts a stream of XML records, where each record is separated by a record separator "1E".
sk-oai-dctojsonl converts a stream of XML records, where each record is separated by a record separator "1E".
sk-oai-records
sk-oai-records was used as a first step to go from concatenated metha OAI XML file to more valid XML.
sk-oai-records was used as a first step to go from concatenated metha OAI XML file to more valid XML.
Package dateutil provides interval handling.
Package dateutil provides interval handling.
notes
Package parallel implements helpers for fast processing of line oriented inputs.
Package parallel implements helpers for fast processing of line oriented inputs.
record
Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
schema
Package xflag add an additional flag type Array for repeated string flags.
Package xflag add an additional flag type Array for repeated string flags.
Package xmlstream implements a lightweight XML scanner on top of encoding/xml.
Package xmlstream implements a lightweight XML scanner on top of encoding/xml.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL