Documentation
¶
Overview ¶
sk-cat takes one or more links to (compressed) files and will stream their content to stdout. Uses curl and external compression programs. Nothing bash and curl could not do, but a bit shorter to type.
Often, datasets come split over many files and for ad-hoc data inspection, a single file can be more convenient.
$ cat urls.txt https://archive.org/download/openalex_snapshot_2023-07-11/data/works/updated_date=2023-05-09/part_000.gz https://archive.org/download/openalex_snapshot_2023-07-11/data/works/updated_date=2023-05-01/part_000.gz https://archive.org/download/openalex_snapshot_2023-07-11/data/works/updated_date=2023-04-16/part_000.gz ...
$ cat urls.txt | sk-cat | zstd -c > data.zst
Another example:
Turn PubMed baseline into a single zst file (concatenated, most likely invalid XML):
$ curl -sL "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/" | \ pup 'a[href] text{}' | \ grep -o 'pubmed.*[.]xml[.]gz' | \ awk '{print "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/"$0}' | \ sk-cat -v | \ zstd -c > pubmed.xml.zst
Other datasets that come scattered across many files are wikipedia, openalex, ...