bhlindex

package module

v0.12.5 Latest Latest Go to latest Published: Apr 27, 2022 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gnames/bhlindex

Links

Open Source Insights

README ¶

Biodiversity Heritage Library Scientific Names Index

Creates an index of scientific names occurring in the collection of literature in Biodiversity Heritage Library

Performance

This application allows to traverse all digitized corpus of Biodiversity Heritage Library in a matter of hours. On a modern high-end laptop we observed the following results:

name-finding in 275,000 volumes, 60 million pages: 2.5 hours.
name-verification of 23 million unique name-strings: 3 hours.
preparing a CSV file with 250 million names occurrences/verification records : 40 minutes.

Installation on Linux

Download bhlindex latest release for Linux
Untar the file, copy it to /usr/local/bin or other directory in the PATH.
Use bhl testdata for testing.

BHL corpus of OCRed data can be found as a >50GB compressed file.

Configuration

When you run the app for the first time it will create a configuration file and will provide information where the file is located (usually it is $HOME/.config/bhlnames.yaml)

Edit the file to provide credentials for PostgreSQL database.

Change the Jobs setting according to the amount of memory and the number of CPU. For 32Gb of memory Jobs: 7 works ok. This parameter sets the number of concurrent jobs running for name-finding.

Set BHLdir parameter to point to the root directory where BHL texts are located (several hundred gigabytes of texts).

Other parameters a optional.

Environment Variables

It is possible to use Environment Variables instead of configuration file. Environment Variable override the configuration file settings. The following variable can be used:

Config	Env. Variable
BHLdir	BHLI_BHL_DIR
OutputFormat	BHLI_OUTPUT_FORMAT
PgHost	BHLI_PG_HOST
PgPort	BHLI_PG_PORT
PgUser	BHLI_PG_USER
PgPass	BHLI_PG_PASS
PgDatabase	BHLI_PG_DATABASE
Jobs	BHLI_JOBS
VerifierURL	BHLI_VERIFIER_URL
WithWebLogs	BHLI_WITH_WEB_LOGS
WithoutConfirm	BHLI_WITHOUT_CONFIRM

Usage

Preparations

Login to PostgreSQL server and create a database that has the same name as the PgDatabase parameter in the configuration file (default name is bhlindex).

This database will be used to keep found names. Its final size of the database upon completion should be in a vicinity of 50GB.

Commands

Get BHLindex version

bhlindex -V

Find names in BHL

bhlindex find
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec find -y

Verify detected names using [GNverifier] service

bhlindex verify
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec verify -y

Dump data into tab-separated files

bhlindex dump
# to compress and save on disk
bhlindex dump | gzip > bhlindex-dump.csv.gz

# -f overrides configuration file settings for output format
bhlindex dump -f tsv | gzip > bhlindex-dump.tsv.gz
bhlindex dump -f json | gzip > bhlindex-dump.json.gz

To run all commands together

bhlindex find -y && \
  bhlindex verify -y && \
  bhlindex dump | gzip > bhlindex-dump.csv.gz

Serve detected items, pages, verified names, names occurrences via RESTful interface (default port is 8080).

bhlindex rest
# using different port
bhlindex rest -p 8000

RESTful API endpoints

/api/v0/items
/api/v0/pages
/api/v0/names
/api/v0/occurrences

Query	Usage
items?offset_id=11&limit=100	get items with ids 11-110
pages?offset_id=11&limit=10	get pages of items with ids 11-20
names?offset_id=1&limit=10	get verified names with ids 1-10
names?offset_id=1&limit=10&data_sources=1	get verified names with ids 1-10 verified to the "Catalogue of Life"
occurrences?offset=21&limit=10	get detected names with ids 21-30
occurrences?offset=21&limit=10&data_sources=1	get detected names with ids 21-30 verified to the "Catalogue of Life"

Testing

Testing requires PostgreSQL database bhlindex_test. Testing will delete all data from the database.

go test

Documentation ¶

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	Version = "v0.12.5+"
	Build   string
)

Functions ¶

This section is empty.

Types ¶

type BHLindex ¶ added in v0.11.0

type BHLindex interface {
	// FindNames traverses BHL corpus directory structure, assembling texts,
	// detecting names, saving data to storage.
	FindNames(loader.Loader, finder.Finder) error

	// Verify names runs verification on unique detected names and saves the
	// results to a local storage.
	VerifyNames(verif.VerifierBHL) error

	// DumpNames creates output with detected and verified names in CSV,
	// TSV, or JSON formats.
	DumpNames(output.Dumper) error

	// GetVersion outputs the version of BHLindex.
	GetVersion() gnvers.Version

	// GetConfig returns an instance of configuration fields.
	GetConfig() config.Config
}

BHLindex us the main usecase interface that defines functionality of BHLindex

func New ¶ added in v0.11.0

func New(
	cfg config.Config,
) BHLindex

New sets up BHLindex interface using bhlindex instance.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
bhlindex
cmd
config
ent
finder
item
loader
name
output
page
rest
verif
io
dbio
dumpio
finderio
loaderio
restio
verifio

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL