analysis

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 8, 2016 License: Apache-2.0 Imports: 8 Imported by: 8,934

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrInvalidDateTime = fmt.Errorf("unable to parse datetime with any of the layouts")

Functions

func BuildTermFromRunes

func BuildTermFromRunes(runes []rune) []byte

func DeleteRune

func DeleteRune(in []rune, pos int) []rune

func InsertRune

func InsertRune(in []rune, pos int, r rune) []rune

func RunesEndsWith

func RunesEndsWith(input []rune, suffix string) bool

func TruncateRunes

func TruncateRunes(input []byte, num int) []byte

Types

type Analyzer

type Analyzer struct {
	CharFilters  []CharFilter
	Tokenizer    Tokenizer
	TokenFilters []TokenFilter
}

func (*Analyzer) Analyze

func (a *Analyzer) Analyze(input []byte) TokenStream

type ByteArrayConverter

type ByteArrayConverter interface {
	Convert([]byte) (interface{}, error)
}

type CharFilter

type CharFilter interface {
	Filter([]byte) []byte
}

type DateTimeParser

type DateTimeParser interface {
	ParseDateTime(string) (time.Time, error)
}

type Token

type Token struct {
	// Start specifies the byte offset of the beginning of the term in the
	// field.
	Start int `json:"start"`

	// End specifies the byte offset of the end of the term in the field.
	End  int    `json:"end"`
	Term []byte `json:"term"`

	// Position specifies the 1-based index of the token in the sequence of
	// occurrences of its term in the field.
	Position int       `json:"position"`
	Type     TokenType `json:"type"`
	KeyWord  bool      `json:"keyword"`
}

Token represents one occurrence of a term at a particular location in a field.

func (*Token) String

func (t *Token) String() string

type TokenFilter

type TokenFilter interface {
	Filter(TokenStream) TokenStream
}

A TokenFilter adds, transforms or removes tokens from a token stream.

type TokenFreq

type TokenFreq struct {
	Term      []byte
	Locations []*TokenLocation
	// contains filtered or unexported fields
}

TokenFreq represents all the occurrences of a term in all fields of a document.

func (*TokenFreq) Frequency

func (tf *TokenFreq) Frequency() int

type TokenFrequencies

type TokenFrequencies map[string]*TokenFreq

TokenFrequencies maps document terms to their combined frequencies from all fields.

func TokenFrequency

func TokenFrequency(tokens TokenStream, arrayPositions []uint64, includeTermVectors bool) TokenFrequencies

func (TokenFrequencies) MergeAll

func (tfs TokenFrequencies) MergeAll(remoteField string, other TokenFrequencies)

type TokenLocation

type TokenLocation struct {
	Field          string
	ArrayPositions []uint64
	Start          int
	End            int
	Position       int
}

TokenLocation represents one occurrence of a term at a particular location in a field. Start, End and Position have the same meaning as in analysis.Token. Field and ArrayPositions identify the field value in the source document. See document.Field for details.

type TokenMap

type TokenMap map[string]bool

func NewTokenMap

func NewTokenMap() TokenMap

func (TokenMap) AddToken

func (t TokenMap) AddToken(token string)

func (TokenMap) LoadBytes

func (t TokenMap) LoadBytes(data []byte) error

LoadBytes reads in a list of tokens from memory, one per line. Comments are supported using `#` or `|`

func (TokenMap) LoadFile

func (t TokenMap) LoadFile(filename string) error

LoadFile reads in a list of tokens from a text file, one per line. Comments are supported using `#` or `|`

func (TokenMap) LoadLine

func (t TokenMap) LoadLine(line string)

type TokenStream

type TokenStream []*Token

type TokenType

type TokenType int
const (
	AlphaNumeric TokenType = iota
	Ideographic
	Numeric
	DateTime
	Shingle
	Single
	Double
	Boolean
)

type Tokenizer

type Tokenizer interface {
	Tokenize([]byte) TokenStream
}

A Tokenizer splits an input string into tokens, the usual behaviour being to map words to tokens.

Directories

Path Synopsis
analyzers
web
byte_array_converters
char_filters
datetime_parsers
language
ar
bg
ca
cjk
ckb
cs
el
en
Package en implements an analyzer with reasonable defaults for processing English text.
Package en implements an analyzer with reasonable defaults for processing English text.
eu
fa
fr
ga
gl
hi
hy
id
in
it
pt
token_filters
lower_case_filter
Package lower_case_filter implements a TokenFilter which converts tokens to lower case according to unicode rules.
Package lower_case_filter implements a TokenFilter which converts tokens to lower case according to unicode rules.
stop_tokens_filter
package stop_tokens_filter implements a TokenFilter removing tokens found in a TokenMap.
package stop_tokens_filter implements a TokenFilter removing tokens found in a TokenMap.
package token_map implements a generic TokenMap, often used in conjunction with filters to remove or process specific tokens.
package token_map implements a generic TokenMap, often used in conjunction with filters to remove or process specific tokens.
tokenizers
exception
package exception implements a Tokenizer which extracts pieces matched by a regular expression from the input data, delegates the rest to another tokenizer, then insert back extracted parts in the token stream.
package exception implements a Tokenizer which extracts pieces matched by a regular expression from the input data, delegates the rest to another tokenizer, then insert back extracted parts in the token stream.
web

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL