This program takes paths to .tar.bz2 batches of OCR files from the
Chronicling Americabulk data
downloads. It converts
each batch into a CSV file, which you can load into a database or do whatever
you like with. It will process the batches concurrently.
This utility converts Chronicling America OCR batches into CSVs of the OCR
text. It takes as its arguments paths to Chronicling America OCR batches
which are stored as .tar.bz2 files, which in turn contain directories of text
files (which we care about) and XML files (which we don't). The path to the
files comprise (with modification) an ID for that page on Chronicling
America. This utility reads in each batch, extracts the page text, and writes
each of them as a CSV file with a column for the batch ID, page ID, and text.
It will process the batches in parallel.