chronam-ocr-debatcher

command module

v0.0.1 Latest Latest Go to latest Published: Jan 25, 2019 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/lmullen/chronam-ocr-debatcher

Links

Open Source Insights

README ¶

Chronicling America OCR debatcher

This program takes paths to .tar.bz2 batches of OCR files from the Chronicling America bulk data downloads. It converts each batch into a CSV file, which you can load into a database or do whatever you like with. It will process the batches concurrently.

Usage:

./chronam-ocr-debatcher [--processes=8] <path/to/a/batch.tar.bz2 ...>

You can download binaries from the releases page.

Documentation ¶

Overview ¶

This utility converts Chronicling America OCR batches into CSVs of the OCR text. It takes as its arguments paths to Chronicling America OCR batches which are stored as .tar.bz2 files, which in turn contain directories of text files (which we care about) and XML files (which we don't). The path to the files comprise (with modification) an ID for that page on Chronicling America. This utility reads in each batch, extracts the page text, and writes each of them as a CSV file with a column for the batch ID, page ID, and text. It will process the batches in parallel.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL