sk-cat

command
v0.2.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2025 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

sk-cat takes one or more links to (compressed) files and will stream their content to stdout. Uses curl and external compression programs. Nothing bash and curl could not do, but a bit shorter to type.

Often, datasets come split over many files and for ad-hoc data inspection, a single file can be more convenient.

$ cat urls.txt https://archive.org/download/openalex_snapshot_2023-07-11/data/works/updated_date=2023-05-09/part_000.gz https://archive.org/download/openalex_snapshot_2023-07-11/data/works/updated_date=2023-05-01/part_000.gz https://archive.org/download/openalex_snapshot_2023-07-11/data/works/updated_date=2023-04-16/part_000.gz ...

$ cat urls.txt | sk-cat | zstd -c > data.zst

Another example:

Turn PubMed baseline into a single zst file (concatenated, most likely invalid XML):

	$ curl -sL "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/" | \
			pup 'a[href] text{}' | \
	        grep -o 'pubmed.*[.]xml[.]gz' | \
	        awk '{print "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/"$0}' | \
			sk-cat -v | \
            zstd -c > pubmed.xml.zst

Other datasets that come scattered across many files are wikipedia, openalex, ...

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL