glossterm

module

v0.0.0-...-2c36af7 Latest Latest Go to latest Published: Jan 23, 2023 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/vthommeret/glossterm

Links

Open Source Insights

README ¶

glossterm 0.5

glossterm is a pipeline that extracts, lexes, and parses wiktionary data.

Pipeline

In order to generate files for the web app, you need to grab an English Wiktionary dump, put it in data/ and run the following commands.

You can run the commands by doing e.g. go run cmd/gtdump/main.go or running make to install globally available commands that can be run as e.g. gtdump.

gtdump downloads Wiktionary dump to en.xml.bz2.
gtsplit splits Wiktionary dump into N files so it can be parsed in parallel. N is set to the current number of cores.
gtparse parses split files into words.gob and descendants.gob. Use --no-backup after initial change to index to edit index in place and compare to previously committed index.
gtresolve reads words.gob and looks up DescendantTrees references in descendants.gob, and inlines them.
gtquads generates quads for each word to power graph lookups, e.g. find all descendants for the Latin roots of a given word.
gtbeam fetches cognates in parallel using Apache Beam local runner.
gtcognates inlines cognates from gtbeam into words.gob
gtcompare compares new index to old index. always use to manually verify parsing changes
gtindex incrementally indexes (additions, deletions, updates) words in Firestore

Debugging a single word

gtpage <word> extracts a single XML page for a given word. Example: gtpage helado
gtlex <word.xml> lexes a single XML page for a given word. Example: gtpage hombre | gtlex
gtparseword <word.xml> parses a single XML word. Example: gtpage horno | gtparseword
gtparseetymtree <word.xml> parses a single etymtree XML page. Example: gtpage Template:etymtree/la/germanus | gtparseetymtree
gtdescend <word> shows the descendants from any words mentioned for a given word.
gtread <word> reads word from words.gob. Example: gtpage pt/nariz
gtsearch <query> searches the index for a given word.

Directories ¶

Path	Synopsis
cmd
gtbeam
gtcognates
gtcompare
gtdescend
gtdump
gtindex
gtlex
gtpage
gtparse
gtparseetymtree
gtparseword
gtquads
gtread
gtresolve
gtsearch
gtsplit
lib
gt
lang
mobile
radix
tpl

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL