token

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 4, 2022 License: Apache-2.0 Imports: 7 Imported by: 65

Documentation

Overview

Package lowercase implements a TokenFilter which converts tokens to lower case according to unicode rules.

Package stop implements a TokenFilter removing tokens found in a TokenMap.

It constructor takes the following arguments:

"stop_token_map" (string): the name of the token map identifying tokens to remove.

Index

Constants

View Source
const Apostrophe = '\''
View Source
const RightSingleQuotationMark = '’'

Variables

This section is empty.

Functions

This section is empty.

Types

type ApostropheFilter

type ApostropheFilter struct{}

func NewApostropheFilter

func NewApostropheFilter() *ApostropheFilter

func (*ApostropheFilter) Filter

type CamelCaseFilter

type CamelCaseFilter struct{}

CamelCaseFilter splits a given token into a set of tokens where each resulting token falls into one the following classes:

  1. Upper case followed by lower case letters. Terminated by a number, an upper case letter, and a non alpha-numeric symbol.
  2. Upper case followed by upper case letters. Terminated by a number, an upper case followed by a lower case letter, and a non alpha-numeric symbol.
  3. Lower case followed by lower case letters. Terminated by a number, an upper case letter, and a non alpha-numeric symbol.
  4. Number followed by numbers. Terminated by a letter, and a non alpha-numeric symbol.
  5. Non alpha-numeric symbol followed by non alpha-numeric symbols. Terminated by a number, and a letter.

It does a one-time sequential pass over an input token, from left to right. The scan is greedy and generates the longest substring that fits into one of the classes.

See the test file for examples of classes and their parsings.

func NewCamelCaseFilter

func NewCamelCaseFilter() *CamelCaseFilter

func (*CamelCaseFilter) Filter

type DictionaryCompoundFilter

type DictionaryCompoundFilter struct {
	// contains filtered or unexported fields
}

func NewDictionaryCompoundFilter

func NewDictionaryCompoundFilter(dict analysis.TokenMap, minWordSize, minSubWordSize, maxSubWordSize int,
	onlyLongestMatch bool) *DictionaryCompoundFilter

func (*DictionaryCompoundFilter) Filter

type EdgeNgramFilter

type EdgeNgramFilter struct {
	// contains filtered or unexported fields
}

func NewEdgeNgramFilter

func NewEdgeNgramFilter(side Side, minLength, maxLength int) *EdgeNgramFilter

func (*EdgeNgramFilter) Filter

type ElisionFilter

type ElisionFilter struct {
	// contains filtered or unexported fields
}

func NewElisionFilter

func NewElisionFilter(articles analysis.TokenMap) *ElisionFilter

func (*ElisionFilter) Filter

type KeyWordMarkerFilter

type KeyWordMarkerFilter struct {
	// contains filtered or unexported fields
}

func NewKeyWordMarkerFilter

func NewKeyWordMarkerFilter(keyWords analysis.TokenMap) *KeyWordMarkerFilter

func (*KeyWordMarkerFilter) Filter

type LengthFilter

type LengthFilter struct {
	// contains filtered or unexported fields
}

func NewLengthFilter

func NewLengthFilter(min, max int) *LengthFilter

func (*LengthFilter) Filter

type LowerCaseFilter

type LowerCaseFilter struct{}

func NewLowerCaseFilter

func NewLowerCaseFilter() *LowerCaseFilter

func (*LowerCaseFilter) Filter

type LowerCaseState

type LowerCaseState struct{}

func (*LowerCaseState) Member

func (s *LowerCaseState) Member(sym rune, peek *rune) bool

func (*LowerCaseState) StartSym

func (s *LowerCaseState) StartSym(sym rune) bool

type NgramFilter

type NgramFilter struct {
	// contains filtered or unexported fields
}

func NewNgramFilter

func NewNgramFilter(minLength, maxLength int) *NgramFilter

func (*NgramFilter) Filter

type NonAlphaNumericCaseState

type NonAlphaNumericCaseState struct{}

func (*NonAlphaNumericCaseState) Member

func (s *NonAlphaNumericCaseState) Member(sym rune, peek *rune) bool

func (*NonAlphaNumericCaseState) StartSym

func (s *NonAlphaNumericCaseState) StartSym(sym rune) bool

type NumberCaseState

type NumberCaseState struct{}

func (*NumberCaseState) Member

func (s *NumberCaseState) Member(sym rune, peek *rune) bool

func (*NumberCaseState) StartSym

func (s *NumberCaseState) StartSym(sym rune) bool

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser accepts a symbol and passes it to the current state (representing a class). The state can accept it (and accumulate it). Otherwise, the parser creates a new state that starts with the pushed symbol.

Parser accumulates a new resulting token every time it switches state. Use FlushTokens() to get the results after the last symbol was pushed.

func NewParser

func NewParser(length, index int) *Parser

func (*Parser) FlushTokens

func (p *Parser) FlushTokens() []*analysis.Token

func (*Parser) NewState

func (p *Parser) NewState(sym rune) State

Note. States have to have different starting symbols.

func (*Parser) Push

func (p *Parser) Push(sym rune, peek *rune)

type PorterStemmer

type PorterStemmer struct{}

func NewPorterStemmer

func NewPorterStemmer() *PorterStemmer

func (*PorterStemmer) Filter

type ReverseFilter

type ReverseFilter struct{}

func NewReverseFilter

func NewReverseFilter() *ReverseFilter

func (*ReverseFilter) Filter

type ShingleFilter

type ShingleFilter struct {
	// contains filtered or unexported fields
}

func NewShingleFilter

func NewShingleFilter(min, max int, outputOriginal bool, sep, fill string) *ShingleFilter

func (*ShingleFilter) Filter

type Side

type Side bool
const BACK Side = true
const FRONT Side = false

type State

type State interface {
	// is _sym_ the start character
	StartSym(sym rune) bool

	// is _sym_ a member of a class.
	// peek, the next sym on the tape, can also be used to determine a class.
	Member(sym rune, peek *rune) bool
}

States codify the classes that the parser recognizes.

type StopTokensFilter

type StopTokensFilter struct {
	// contains filtered or unexported fields
}

func NewStopTokensFilter

func NewStopTokensFilter(stopTokens analysis.TokenMap) *StopTokensFilter

func (*StopTokensFilter) Filter

type TruncateTokenFilter

type TruncateTokenFilter struct {
	// contains filtered or unexported fields
}

func NewTruncateTokenFilter

func NewTruncateTokenFilter(length int) *TruncateTokenFilter

func (*TruncateTokenFilter) Filter

type UnicodeNormalizeFilter

type UnicodeNormalizeFilter struct {
	// contains filtered or unexported fields
}

func NewUnicodeNormalizeFilter

func NewUnicodeNormalizeFilter(form norm.Form) *UnicodeNormalizeFilter

func (*UnicodeNormalizeFilter) Filter

type UniqueTermFilter

type UniqueTermFilter struct{}

UniqueTermFilter retains only the tokens which mark the first occurrence of a term. Tokens whose term appears in a preceding token are dropped.

func NewUniqueTermFilter

func NewUniqueTermFilter() *UniqueTermFilter

func (*UniqueTermFilter) Filter

type UpperCaseState

type UpperCaseState struct {
	// contains filtered or unexported fields
}

func (*UpperCaseState) Member

func (s *UpperCaseState) Member(sym rune, peek *rune) bool

func (*UpperCaseState) StartSym

func (s *UpperCaseState) StartSym(sym rune) bool

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL