stemmer

package
v1.11.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2025 License: MIT Imports: 2 Imported by: 1

Documentation

Overview

http://snowballstem.org/otherapps/schinke/ http://caio.ueberalles.net/a_stemming_algorithm_for_latin_text_databases-schinke_et_al.pdf

The Schinke Latin stemming algorithm is described in, Schinke R, Greengrass M, Robertson AM and Willett P (1996) A stemming algorithm for Latin text databases. Journal of Documentation, 52: 172-187.

It has the feature that it stems each word to two forms, noun and verb. For example,

            NOUN        VERB
            ----        ----
aquila      aquil       aquila
portat      portat      porta
portis      port        por

Here (slightly reformatted) are the rules of the stemmer,

1. (start)

  1. Convert all occurrences of the letters 'j' or 'v' to 'i' or 'u', respectively.

  2. If the word ends in '-que' then if the word is on the list shown in Figure 4, then write the original word to both the noun-based and verb-based stem dictionaries and go to 8. else remove '-que'

    [Figure 4 was

    atque quoque neque itaque absque apsque abusque adaeque adusque denique deque susque oblique peraeque plenisque quandoque quisque quaeque cuiusque cuique quemque quamque quaque quique quorumque quarumque quibusque quosque quasque quotusquisque quousque ubique undique usque uterque utique utroque utribique torque coque concoque contorque detorque decoque excoque extorque obtorque optorque retorque recoque attorque incoque intorque praetorque]

  3. Match the end of the word against the suffix list show in Figure 6(a), removing the longest matching suffix, (if any).

    [Figure 6(a) was

    -ibus -ius -ae -am -as -em -es -ia -is -nt -os -ud -um -us -a -e -i -o -u]

  4. If the resulting stem contains at least two characters then write this stem to the noun-based stem dictionary.

  5. Match the end of the word against the suffix list show in Figure 6(b), identifying the longest matching suffix, (if any).

    [Figure 6(b) was

    -iuntur-beris -erunt -untur -iunt -mini -ntur -stis -bor -ero -mur -mus -ris -sti -tis -tur -unt -bo -ns -nt -ri -m -r -s -t]

    If any of the following suffixes are found then convert them as shown:

    '-iuntur', '-erunt', '-untur', '-iunt', and '-unt', to '-i'; '-beris', '-bor', and '-bo' to '-bi'; '-ero' to '-eri'

    else remove the suffix in the normal way.

  6. If the resulting stem contains at least two characters then write this stem to the verb-based stem dictionary.

8. (end)

Addendum: adding -ii to Step 4.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func StemCanonical

func StemCanonical(c string) string

StemCanonical takes a short form of a canonical name and returns back stemmed specific and infraspecific epithets, and an unstemmed cultivar epithet. It assumes the following properties of a string:

  1. There are no empty spaces over any side of a string.
  2. All spaces within the string are single.
  3. All characters in the string are ASCII with exception of the hybrid sign.
  4. The string always starts with a capitalized word.

Types

type StemmedWord

type StemmedWord struct {
	// Orig is the original word (input).
	Orig string
	// Stem is the stemmed version of the original word.
	Stem string
	// Suffix is the 'tail' left after stemming.
	Suffix string
}

StemmedWord is the output of stemming algorithm applied to a word.

func Stem

func Stem(wrd string) StemmedWord

Stem takes a word and, assuming the word is noun, removes its latin suffix if such suffix is detected.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL