Global Names Finder (GNfinder)
Very fast finder of scientific names. It uses dictionary and NLP approaches. On
modern multiprocessor laptop it is able to process 15 million pages per hour.
Works with many file formats and includes names verification against many
biological databases. For full functionality it requires an Internet
connection.
Citing
Zenodo DOI can be used to cite GNfinder.
Features
- Multiplatform app (supports Linux, Windows, Mac OS X).
- Self-contained, no external dependencies, only binary
gnfinder
or
gnfinder.exe
(~15Mb) is needed. However the internet connection is
required for name-verification.
- Includes REST API and web-based User Interface.
- Takes UTF8-encoded text and returns back CSV, TSV or JSON-formatted output
that contains detected scientific names.
- Extracts text from PDF files, MS Word, MS Excel, HTML, XML, RTF, JPG,
TIFF, GIF etc. files for names-detection.
- Downloads web-page from a given URL for names-detection.
- Optionally, automatically detects the language of the text, and adjusts Bayes
algorithm for the language. English and German languages are currently
supported.
- Uses complementary heuristic and natural language processing algorithms.
- Optionally verifies found names against multiple biodiversity databases using
gnindex service.
- Detection of nomenclatural annotations like
sp. nov.
, comb. nov.
,
ssp. nov.
and their variants.
- Ability to see words that surround detected name-strings.
- The library can be used concurrently to significantly improve speed.
On a server with 40threads it is able to detect names on 50 million pages
in approximately 3 hours using both heuristic and Bayes algorithms. Check
bhlindex project for an example.
Install as a command line app
Install with Homebrew
Homebrew is a packaging system originally made for Mac OS X. You can use it
now for Mac, Linux, or Windows X WSL (Windows susbsystem for Linux).
-
Install Homebrew according to their instructions.
-
Install GNfinder
with:
brew tap gnames/gn
brew install gnfinder
Install by hand
GNfinder
consists of just one executable file, so it is pretty easy to
install it by hand. To do that download the binary executable for your
operating system from the latest release.
Linux or OS X
Move gnfinder
executable somewhere in your PATH
(for example /usr/local/bin
)
sudo mv path_to/gnfinder /usr/local/bin
Windows
One possible way would be to create a default folder for executables and place
gnfinder
there.
Use Windows+R
keys
combination and type "cmd
". In the appeared terminal window type:
mkdir C:\bin
copy path_to\gnfinder.exe C:\bin
Add C:\bin
directory to your PATH
environment variable.
Go
Install Go v1.17 or higher.
git clone git@github.com:/gnames/gnfinder
cd gnfinder
make tools
make install
Configuration
When you run gnfinder
command for the first time, it will create a
gnfinder.yml
configuration file.
This file should be located in the following places:
MS Windows: C:\Users\AppData\Roaming\gnfinder.yml
Mac OS: $HOME/.config/gnfinder.yml
Linux: $HOME/.config/gnfinder.yml
This file allows to set options that will modify behaviour of GNfinder
according to your needs. It will spare you to enter the same flags for the
command line application again and again.
Command line flags will override the settings in the configuration file.
It is also possible to setup environment variables. They will override the
settings in both the configuration file and from the flags.gt
Settings |
Environment variables |
BayesOddsThreshold |
GNF_BAYES_ODDS_THRESHOLD |
Format |
GNF_FORMAT |
IncludeInputText |
GNF_INCLUDE_INPUT_TEXT |
Language |
GNF_LANGUAGE |
PreferredSources |
GNF_PREFERRED_SOURCES |
TikaURL |
GNF_TIKA_URL |
TokensAround |
GNF_TOKENS_AROUND |
VerifierURL |
GNF_VERIFIER_URL |
WithBayesOddsDetails |
GNF_WITH_BAYES_ODDS_DETAILS |
WithOddsAdjustment |
GNF_WITH_ODDS_ADJUSTMENT |
WithPlainInput |
GNF_WITH_PLAIN_INPUT |
WithUniqueNames |
GNF_WITH_UNIQUE_NAMES |
WithVerification |
GNF_WITH_VERIFICATION |
WithoutBayes |
GNF_WITHOUT_BAYES |
Usage
Usage as a command line app
To see flags and usage:
gnfinder --help
# or just
gnfinder
To see the version of its binary:
gnfinder -V
Examples:
Starting as a web-application and an API server on port 8080
gnfinder -p 8080
Getting names from a UTF8-encoded file in CSV format
# -U flag prevents use of remote Apache Tika service for file conversion to
# UTF8-encoded plain text
# -U flag is optional, but it removes unnecessary remote call to Tika.
gnfinder file_with_names.txt -U
Getting names from a UTF8-encoded file in tab-separated values (TSV) format
gnfinder file_with_names.txt -U -f tsv
Getting names from a file that is not a plain UTF8-encoded text
gnfinder file.pdf
Getting names from a URL
gnfinder https://en.wikipedia.org/wiki/Raccoon
Getting unique names from a file in JSON format. Disables -w
flag.
gnfinder file_with_names.txt -u -f pretty
Getting names from a file in JSON format, and using jq
to process JSON
gnfinder file_with_names.txt -f compact | jq
Getting data from a pipe forcing English language and verification
echo "Pomatomus saltator and Parus major" | gnfinder -v -l eng
echo "Pomatomus saltator and Parus major" | gnfinder --verify --lang eng
Displaying matches from NCBI
and Encyclopedia of Life
, if exist. For
the list of data source ids go to gnverifier's data sources page.
echo "Pomatomus saltator and Parus major" | gnfinder -v -l eng -s "4,12"
echo "Pomatomus saltator and Parus major" | gnfinder --verify --lang eng --sources "4,12"
Adjusting Prior Odds using information about found names. They are calculated
as "found names number / (capitalized words number - found names number)".
Such adjustment will decrease Odds for texts with very few names, and increase
odds for texts with a lot of found names.
gnfinder -a -d -f pretty file_with_names.txt
Returning 5 words before and after found name-candidate. This flag does is
ignored if unique names are returned.
gnfinder -w 5 file_with_names.txt
gnfinder --words-around 5 file_with_names.txt
Getting data from a file and redirecting result to another file
gnfinder file1.txt > file2.json
Detection of nomenclatural annotations
echo "Parus major sp. n." | gnfinder
Returning found names positions in the number of bytes from the beginning
of the text instead of the number of UTF-8 characters
echo "Это Parus major" | gnfinder -b
There is also a tutorial about processing many PDF files in parallel.
Usage as a library
import (
"github.com/gnames/gnfinder"
"github.com/gnames/gnfinder/ent/nlp"
"github.com/gnames/gnfinder/io/dict"
)
func Example() {
txt := `Blue Adussel (Mytilus edulis) grows to about two
inches the first year,Pardosa moesta Banks, 1892`
cfg := gnfinder.NewConfig()
dictionary := dict.LoadDictionary()
weights := nlp.BayesWeights()
gnf := gnfinder.New(cfg, dictionary, weights)
res := gnf.Find(txt)
name := res.Names[0]
fmt.Printf(
"Name: %s, start: %d, end: %d",
name.Name,
name.OffsetStart,
name.OffsetEnd,
)
// Output:
// Name: Mytilus edulis, start: 13, end: 29
}
Usage as a docker container
docker pull gnames/gnfinder
# run GNfinder server, and map it to port 8888 on the host machine
docker run -d -p 8888:8778 --name gnfinder gnames/gnfinder
Usage of API
Best source for API usage is its documenation.
If you want to start your own API endpoint (for example on localhost
, port
8080) use:
gnfinder -p 8080
curl localhost:8080/api/v1/ping
To upload a file and detect names from its content:
curl -v -F verification=true -F file=@/path/to/test.txt https://gnfinder.globalnames.org/api/v1/find
Projects based on GNfinder
gnfinder-plus allows to work with MS Docs and PDF files without remote
services (requires local install of poppler
package).
bhlindex creates an index of scientific names for Biodiversity Heritage
Library (BHL).
bhlnames adds synonymy and currently accepted names to searches
in BHL, connects publications to pages in BHL.
Development
To install the latest GNfinder
git clone git@github.com:/gnames/gnfinder
cd gnfinder
make tools
make install
Modify OpenAPI documentation
docker run -d -p 80:8080 swaggerapi/swagger-editor
Testing
From the root of the project:
make tools
# run make install for CLI testing
make install
To run tests go to the root directory of the project and run
go test ./...
#or
make test