tsearch

package
v0.23.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2024 License: Apache-2.0 Imports: 33 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EncodeInvertedIndexKey

func EncodeInvertedIndexKey(inKey []byte, lexeme string) []byte

EncodeInvertedIndexKey returns the inverted index key for the input lexeme.

func EncodeInvertedIndexKeys

func EncodeInvertedIndexKeys(inKey []byte, vector TSVector) ([][]byte, error)

EncodeInvertedIndexKeys returns a slice of byte slices, one per inverted index key for the terms in this tsvector.

func EncodeTSQuery

func EncodeTSQuery(appendTo []byte, query TSQuery) ([]byte, error)

EncodeTSQuery encodes a tsquery into a serialized representation for on-disk storage.

func EncodeTSQueryPGBinary

func EncodeTSQueryPGBinary(appendTo []byte, query TSQuery) []byte

EncodeTSQueryPGBinary encodes a tsquery into a serialized representation.

The below comment explains the wire protocol representation. It is taken from this page: https://www.npgsql.org/dev/types.html

the tree written in prefix notation:
First the number of tokens (a token is an operand or an operator).
For each token:
  UInt8 type (1 = val, 2 = oper) followed by
  For val: UInt8 weight + UInt8 prefix (1 = yes / 0 = no) + null-terminated string,
  For oper: UInt8 oper (1 = not, 2 = and, 3 = or, 4 = phrase).
  In case of phrase oper code, an additional UInt16 field is sent (distance value of operator). Default is 1 for <->, otherwise the n value in '<n>'.

func EncodeTSVector

func EncodeTSVector(appendTo []byte, vector TSVector) ([]byte, error)

EncodeTSVector encodes a tsvector into a serialized representation for on-disk storage.

func EncodeTSVectorPGBinary

func EncodeTSVectorPGBinary(appendTo []byte, vector TSVector) ([]byte, error)

EncodeTSVectorPGBinary encodes a tsvector into a serialized representation that's identical to Postgres's wire protocol representation.

The below comment explains the wire protocol representation. It is taken from this page: https://www.npgsql.org/dev/types.html

tsvector:

UInt32 number of lexemes
for each lexeme:
    lexeme text in client encoding, null-terminated
    UInt16 number of positions
    for each position:
        UInt16 WordEntryPos, where the most significant 2 bits is weight, and the 14 least significant bits is pos (can't be 0). Weights 3,2,1,0 represent A,B,C,D

func EvalTSQuery

func EvalTSQuery(q TSQuery, v TSVector) (bool, error)

EvalTSQuery runs the provided TSQuery against the provided TSVector, returning whether or not the query matches the vector.

func GetConfigKey

func GetConfigKey(config string) string

GetConfigKey returns a config that can be used as a key to look up stemmers and stopwords from an input config value. This is simulating the more advanced customizable dictionaries and configs that Postgres has, which allows user-defined text search configurations: because of this, configs can have schema prefixes. Because we don't (yet?) allow this, we just have to trim off any `pg_catalog.` prefix if it exists.

func Rank

func Rank(weights []float32, v TSVector, q TSQuery, method int) (float32, error)

Rank implements the ts_rank functionality, which ranks a tsvector against a tsquery. The weights parameter is a list of weights corresponding to the tsvector lexeme weights D, C, B, and A. The method parameter is a bitmask defining different ranking behaviors, defined in the rankBehavior type above in this file. The default ranking behavior is 0, which doesn't perform any normalization based on the document length.

N.B.: this function is directly translated from the calc_rank function in tsrank.c, which contains almost no comments. As of this time, I am unable to sufficiently explain how this ranker works, but I'm confident that the implementation is at least compatible with Postgres. https://github.com/postgres/postgres/blob/765f5df726918bcdcfd16bcc5418e48663d1dd59/src/backend/utils/adt/tsrank.c#L357

func TSLexize

func TSLexize(config string, token string) (lexeme string, stopWord bool, err error)

TSLexize implements the "dictionary" construct that's exposed via ts_lexize. It gets invoked once per input token to produce an output lexeme during routines like to_tsvector and to_tsquery. It can return true in the second parameter to indicate a stopword was found.

func TSParse

func TSParse(input string) []string

TSParse is the function that splits an input text into a list of tokens. For now, the parser that we use is very simple: it merely lowercases the input and splits it into tokens based on assuming that non-letter, non-number characters are whitespace.

The Postgres text search parser is much, much more sophisticated. The documentation (https://www.postgresql.org/docs/current/textsearch-parsers.html) gives more information, but roughly, each token is categorized into one of about 20 different buckets, such as asciiword, url, email, host, float, int, version, tag, etc. It uses very specific rules to produce these outputs. Another interesting transformation is returning multiple tokens for a hyphenated word, including a token that represents the entire hyphenated word, as well as one for each of the hyphenated components.

It's not clear whether we need to exactly mimic this functionality. Likely, we will eventually want to do this.

func ValidConfig

func ValidConfig(input string) error

ValidConfig returns an error if the input string is not a supported and valid text search config.

Types

type TSQuery

type TSQuery struct {
	// contains filtered or unexported fields
}

TSQuery represents a tsNode AST root. A TSQuery is a tree of text search operators that can be run against a TSVector to produce a predicate of whether the query matched.

func DecodeTSQuery

func DecodeTSQuery(b []byte) (ret TSQuery, err error)

DecodeTSQuery deserializes a serialized TSQuery in on-disk format.

func DecodeTSQueryPGBinary

func DecodeTSQueryPGBinary(b []byte) (ret TSQuery, err error)

DecodeTSQueryPGBinary deserializes a serialized TSQuery in pgwire format.

func ParseTSQuery

func ParseTSQuery(input string) (TSQuery, error)

ParseTSQuery produces a TSQuery from an input string.

func PhraseToTSQuery

func PhraseToTSQuery(config string, input string) (TSQuery, error)

PhraseToTSQuery implements the phraseto_tsquery builtin, which lexes an input, performs stopwording and normalization on the tokens, and returns a parsed query, interposing the <-> operator between each token.

func PlainToTSQuery

func PlainToTSQuery(config string, input string) (TSQuery, error)

PlainToTSQuery implements the plainto_tsquery builtin, which lexes an input, performs stopwording and normalization on the tokens, and returns a parsed query, interposing the & operator between each token.

func RandomTSQuery

func RandomTSQuery(rng *rand.Rand) TSQuery

RandomTSQuery returns a random TSQuery for testing.

func ToTSQuery

func ToTSQuery(config string, input string) (TSQuery, error)

ToTSQuery implements the to_tsquery builtin, which lexes an input, performs stopwording and normalization on the tokens, and returns a parsed query.

func (TSQuery) GetInvertedExpr

func (q TSQuery) GetInvertedExpr() (expr inverted.Expression, err error)

GetInvertedExpr returns the inverted expression that can be used to search an index.

func (TSQuery) String

func (q TSQuery) String() string

type TSVector

type TSVector []tsTerm

TSVector is a sorted list of terms, each of which is a lexeme that might have an associated position within an original document.

func DecodeTSVector

func DecodeTSVector(b []byte) (ret TSVector, err error)

DecodeTSVector decodes a tsvector in disk-storage representation from the input byte slice.

func DecodeTSVectorPGBinary

func DecodeTSVectorPGBinary(b []byte) (ret TSVector, err error)

DecodeTSVectorPGBinary decodes a tsvector from the input byte slice which is formatted in Postgres binary protocol.

func DocumentToTSVector

func DocumentToTSVector(config string, input string) (TSVector, error)

DocumentToTSVector parses an input document into lexemes, removes stop words, stems and normalizes the lexemes, and returns a TSVector annotated with lexeme positions according to a text search configuration passed by name.

func ParseTSVector

func ParseTSVector(input string) (TSVector, error)

ParseTSVector produces a TSVector from an input string. The input will be sorted by lexeme, but will not be automatically stemmed or stop-worded.

func RandomTSVector

func RandomTSVector(rng *rand.Rand) TSVector

RandomTSVector returns a random TSVector for testing.

func (TSVector) String

func (t TSVector) String() string

func (TSVector) StringSize

func (t TSVector) StringSize() int

StringSize returns the length of the string that would have been returned on String() call, without actually constructing that string.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL