bhlindex

command module
v1.0.0-RC1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2022 License: MIT Imports: 4 Imported by: 0

README

Biodiversity Heritage Library Scientific Names Index (BHLindex)

Doc Status

Creates an index of scientific names occurring in the collection of literature in Biodiversity Heritage Library

Performance

This application allows to traverse all digitized corpus of Biodiversity Heritage Library in a matter of hours. On a modern high-end laptop we observed the following results:

  • name-finding in 275,000 volumes, 60 million pages: 2.5 hours.
  • name-verification of 23 million unique name-strings: 3 hours.
  • preparing a CSV file with 250 million names occurrences/verification records : 40 minutes.

Installation on Linux

BHL corpus of OCRed data can be found as a >50GB compressed file.

Configuration

When you run the app for the first time it will create a configuration file and will provide information where the file is located (usually it is $HOME/.config/bhlnames.yaml)

Edit the file to provide credentials for PostgreSQL database.

Change the Jobs setting according to the amount of memory and the number of CPU. For 32Gb of memory Jobs: 7 works ok. This parameter sets the number of concurrent jobs running for name-finding.

Set BHLdir parameter to point to the root directory where BHL texts are located (several hundred gigabytes of texts).

Other parameters are optional.

Environment Variables

It is possible to use Environment Variables instead of configuration file. Environment Variable override the configuration file settings. The following variable can be used:

Config Env. Variable
BHLdir BHLI_BHL_DIR
OutputFormat BHLI_OUTPUT_FORMAT
PgHost BHLI_PG_HOST
PgPort BHLI_PG_PORT
PgUser BHLI_PG_USER
PgPass BHLI_PG_PASS
PgDatabase BHLI_PG_DATABASE
Jobs BHLI_JOBS
VerifierURL BHLI_VERIFIER_URL
WithWebLogs BHLI_WITH_WEB_LOGS
WithoutConfirm BHLI_WITHOUT_CONFIRM

Usage

Preparations

Login to PostgreSQL server and create a database that has the same name as the PgDatabase parameter in the configuration file (default name is bhlindex).

This database will be used to keep found names. Its final size of the database upon completion should be in a vicinity of 50GB.

Commands

Get BHLindex version

bhlindex -V

Find names in BHL

bhlindex find
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec find -y

Verify detected names using [GNverifier] service

bhlindex verify
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec verify -y

Dump data into tab-separated files

Three files will be created: pages, names, occurrences. They will have extension accodring to selected output format (CSV is the default). If it is required to filter verified results by data-sources, their list and corresponding IDs can be found at [gnverifier sources page]

Uncompressed dump files take more than 30GB of space.

# Dump files to a designated directory.
bhlindex dump -d ~/bhlindex-dump
# or
bhlindex dump --dir ~/bhlindex-dump

# Dump records verified to particular data-sources of `gnverifier`.
# In this case verified names are filtered by `The Catalogue of Life` (ID=1)
# and `The Encyclopedia of Life` (ID=12).
bhlindex dump -d ~/bhlindex-dump -s 1,12
or
bhlindex dump --dir ~/bhlindex-dump --sources 1,12

# Dump using JSON or TSV formats.
bhlindex dump -f tsv -d ~/bhlindex-dump
bhlindex dump -f json -d ~/bhlindex-dump
#or
bhlindex dump --format tsv --dir ~/bhlindex-dump

To run all commands together

bhlindex find -y && \
  bhlindex verify -y && \
  bhlindex dump -d output-dir
Testing

Testing requires PostgreSQL database bhlindex_test. Testing will delete all data from the test database.

go test

Documentation

Overview

Copyright © 2022 Dmitry Mozzherin <dmozzherin@gmail.com>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL