stemmer

package

v1.11.2 Latest Latest Go to latest Published: Feb 21, 2025 License: MIT Imports: 2 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

Documentation ¶

Overview ¶

http://snowballstem.org/otherapps/schinke/ http://caio.ueberalles.net/a_stemming_algorithm_for_latin_text_databases-schinke_et_al.pdf

The Schinke Latin stemming algorithm is described in, Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.

It has the feature that it stems each word to two forms, noun and verb. For example,

            NOUN        VERB
            ----        ----
aquila      aquil       aquila
portat      portat      porta
portis      port        por

Here (slightly reformatted) are the rules of the stemmer,

1. (start)

Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u', respectively.
If the word ends in '-que' then if the word is on the list shown in Figure 4, then write the original word to both the noun-based and verb-based stem dictionaries and go to 8. else remove '-que'
[Figure 4 was
atque quoque neque itaque absque apsque abusque adaeque adusque denique deque susque oblique peraeque plenisque quandoque quisque quaeque cuiusque cuique quemque quamque quaque quique quorumque quarumque quibusque quosque quasque quotusquisque quousque ubique undique usque uterque utique utroque utribique torque coque concoque contorque detorque decoque excoque extorque obtorque optorque retorque recoque attorque incoque intorque praetorque]
Match the end of the word against the suffix list show in Figure 6(a), removing the longest matching suffix, (if any).
[Figure 6(a) was
-ibus -ius -ae -am -as -em -es -ia -is -nt -os -ud -um -us -a -e -i -o -u]
If the resulting stem contains at least two characters then write this stem to the noun-based stem dictionary.
Match the end of the word against the suffix list show in Figure 6(b), identifying the longest matching suffix, (if any).
[Figure 6(b) was
-iuntur-beris -erunt -untur -iunt -mini -ntur -stis -bor -ero -mur -mus -ris -sti -tis -tur -unt -bo -ns -nt -ri -m -r -s -t]
If any of the following suffixes are found then convert them as shown:
'-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i'; '-beris', '-bor', and '-bo' to '-bi'; '-ero' to '-eri'
else remove the suffix in the normal way.
If the resulting stem contains at least two characters then write this stem to the verb-based stem dictionary.

8. (end)

Addendum: adding -ii to Step 4.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func StemCanonical ¶

func StemCanonical(c string) string

StemCanonical takes a short form of a canonical name and returns back stemmed specific and infraspecific epithets, and an unstemmed cultivar epithet. It assumes the following properties of a string:

There are no empty spaces over any side of a string.
All spaces within the string are single.
All characters in the string are ASCII with exception of the hybrid sign.
The string always starts with a capitalized word.

Types ¶

type StemmedWord ¶

type StemmedWord struct {
	// Orig is the original word (input).
	Orig string
	// Stem is the stemmed version of the original word.
	Stem string
	// Suffix is the 'tail' left after stemming.
	Suffix string
}

StemmedWord is the output of stemming algorithm applied to a word.

func Stem ¶

func Stem(wrd string) StemmedWord

Stem takes a word and, assuming the word is noun, removes its latin suffix if such suffix is detected.

Source Files ¶

View all Source files

stemmer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL