bhlindex

package module
v0.12.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 27, 2022 License: MIT Imports: 14 Imported by: 0

README

Biodiversity Heritage Library Scientific Names Index

Creates an index of scientific names occurring in the collection of literature in Biodiversity Heritage Library

Performance

This application allows to traverse all digitized corpus of Biodiversity Heritage Library in a matter of hours. On a modern high-end laptop we observed the following results:

  • name-finding in 275,000 volumes, 60 million pages: 2.5 hours.
  • name-verification of 23 million unique name-strings: 3 hours.
  • preparing a CSV file with 250 million names occurrences/verification records : 40 minutes.

Installation on Linux

BHL corpus of OCRed data can be found as a >50GB compressed file.

Configuration

When you run the app for the first time it will create a configuration file and will provide information where the file is located (usually it is $HOME/.config/bhlnames.yaml)

Edit the file to provide credentials for PostgreSQL database.

Change the Jobs setting according to the amount of memory and the number of CPU. For 32Gb of memory Jobs: 7 works ok. This parameter sets the number of concurrent jobs running for name-finding.

Set BHLdir parameter to point to the root directory where BHL texts are located (several hundred gigabytes of texts).

Other parameters a optional.

Environment Variables

It is possible to use Environment Variables instead of configuration file. Environment Variable override the configuration file settings. The following variable can be used:

Config Env. Variable
BHLdir BHLI_BHL_DIR
OutputFormat BHLI_OUTPUT_FORMAT
PgHost BHLI_PG_HOST
PgPort BHLI_PG_PORT
PgUser BHLI_PG_USER
PgPass BHLI_PG_PASS
PgDatabase BHLI_PG_DATABASE
Jobs BHLI_JOBS
VerifierURL BHLI_VERIFIER_URL
WithWebLogs BHLI_WITH_WEB_LOGS
WithoutConfirm BHLI_WITHOUT_CONFIRM

Usage

Preparations

Login to PostgreSQL server and create a database that has the same name as the PgDatabase parameter in the configuration file (default name is bhlindex).

This database will be used to keep found names. Its final size of the database upon completion should be in a vicinity of 50GB.

Commands

Get BHLindex version

bhlindex -V

Find names in BHL

bhlindex find
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec find -y

Verify detected names using [GNverifier] service

bhlindex verify
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec verify -y

Dump data into tab-separated files

bhlindex dump
# to compress and save on disk
bhlindex dump | gzip > bhlindex-dump.csv.gz

# -f overrides configuration file settings for output format
bhlindex dump -f tsv | gzip > bhlindex-dump.tsv.gz
bhlindex dump -f json | gzip > bhlindex-dump.json.gz

To run all commands together

bhlindex find -y && \
  bhlindex verify -y && \
  bhlindex dump | gzip > bhlindex-dump.csv.gz

Serve detected items, pages, verified names, names occurrences via RESTful interface (default port is 8080).

bhlindex rest
# using different port
bhlindex rest -p 8000

RESTful API endpoints

  • /api/v0/items
  • /api/v0/pages
  • /api/v0/names
  • /api/v0/occurrences
Query Usage
items?offset_id=11&limit=100 get items with ids 11-110
pages?offset_id=11&limit=10 get pages of items with ids 11-20
names?offset_id=1&limit=10 get verified names with ids 1-10
names?offset_id=1&limit=10&data_sources=1 get verified names with ids 1-10 verified to the "Catalogue of Life"
occurrences?offset=21&limit=10 get detected names with ids 21-30
occurrences?offset=21&limit=10&data_sources=1 get detected names with ids 21-30 verified to the "Catalogue of Life"
Testing

Testing requires PostgreSQL database bhlindex_test. Testing will delete all data from the database.

go test

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	Version = "v0.12.5+"
	Build   string
)

Functions

This section is empty.

Types

type BHLindex added in v0.11.0

type BHLindex interface {
	// FindNames traverses BHL corpus directory structure, assembling texts,
	// detecting names, saving data to storage.
	FindNames(loader.Loader, finder.Finder) error

	// Verify names runs verification on unique detected names and saves the
	// results to a local storage.
	VerifyNames(verif.VerifierBHL) error

	// DumpNames creates output with detected and verified names in CSV,
	// TSV, or JSON formats.
	DumpNames(output.Dumper) error

	// GetVersion outputs the version of BHLindex.
	GetVersion() gnvers.Version

	// GetConfig returns an instance of configuration fields.
	GetConfig() config.Config
}

BHLindex us the main usecase interface that defines functionality of BHLindex

func New added in v0.11.0

func New(
	cfg config.Config,
) BHLindex

New sets up BHLindex interface using bhlindex instance.

Directories

Path Synopsis
cmd
ent
io

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL