query

package module
v0.0.0-...-8e45692 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 22, 2022 License: MIT Imports: 6 Imported by: 7

README

go-query simple []int32 query library GitHub Actions Status codecov GoDoc

Blazingly fast query engine

Used to build and execute queries such as:

n := 10 // total docs in index

And(
        Term(n, "name:hello", []int32{4, 5}),
        Term(n, "name:world", []int32{4, 100}),
        Or(
                Term(n, "country:nl", []int32{20,30}),
                Term(n, "country:uk", []int32{4,30}),
        )
)
  • scoring: only idf score (for now)
  • supported queries: or, and, and_not, dis_max, constant, term
  • normalizers: space_between_digits, lowercase, trim, cleanup, etc
  • tokenizers: left edge, custom, charngram, unique, soundex etc
  • go-query-index: useful example of how to build more complex search engine with the library

query

Usually when you have inverted index you endup having something like:

data := []*Document{}
index := map[string][]int32{}
for docId, d := range document {
     for _, token := range tokenize(normalize(d.Name)) {
        index[token] = append(index[token],docId)
     }
}

then from documents like {hello world}, {hello}, {new york}, {new world} you get inverted index in the form of:

{
    "hello": [0,1],
    "world": [0,3],
    "new": [2,3],
    "york": [2]
}

anyway, if you want to read more on those check out the IR-book

This package helps you query indexes of this form, in fairly efficient way, keep in mind it expects the []int32 slices to be sorted. Example:

package main

import (
    "fmt"

    "github.com/rekki/go-query"
)

func main() {
    totalDocumentsInIndex := 10
    q := query.And(
        query.Or(
            query.Term(totalDocumentsInIndex, "a", []int32{1, 2, 8, 9}),
            query.Term(totalDocumentsInIndex, "b", []int32{3, 9, 8}),
        ),
        query.AndNot(
            query.Or(
                query.Term(totalDocumentsInIndex, "c", []int32{4, 5}),
                query.Term(totalDocumentsInIndex, "c", []int32{4, 100}),
            ),
            query.Or(
                query.Term(totalDocumentsInIndex, "d", []int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}),
                query.Term(totalDocumentsInIndex, "e", []int32{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}),
            ),
        ),
    )

    // q.String() is {{a OR b} AND {{d OR e} -({c OR x})}}

    for q.Next() != query.NO_MORE {
        did := q.GetDocId()
        score := q.Score()
        fmt.Printf("matching %d, score: %f\n", did, score)
    }
}

will print:

matching 1, score: 2.639057
matching 2, score: 2.639057
matching 3, score: 2.852632
matching 8, score: 2.639057
matching 9, score: 4.105394

Documentation

Overview

Package query provides simple query dsl on top of sorted arrays of integers

Index

Constants

View Source
const (
	NO_MORE   = int32(math.MaxInt32)
	NOT_READY = int32(-1)
)

Variables

View Source
var ByteOrder = binary.LittleEndian
View Source
var TERM_CHUNK_SIZE = 4096

splits the postings list into chunks that are binary searched and inside each chunk linearly searching for next advance()

Functions

func AppendFileNameTerm

func AppendFileNameTerm(fn string, docs []int32) error

func AppendFilePayload

func AppendFilePayload(f *os.File, size int64, b []byte) error

func AppendFileTerm

func AppendFileTerm(f *os.File, docs []int32) error

Types

type AndQuery

type AndQuery struct {
	// contains filtered or unexported fields
}

func And

func And(queries ...Query) *AndQuery

Creates AND query

func AndNot

func AndNot(not Query, queries ...Query) *AndQuery

Creates AND NOT query

func (*AndQuery) AddSubQuery

func (q *AndQuery) AddSubQuery(sub Query) Query

func (*AndQuery) Advance

func (q *AndQuery) Advance(target int32) int32

func (*AndQuery) Cost

func (q *AndQuery) Cost() int

func (*AndQuery) GetDocId

func (q *AndQuery) GetDocId() int32

func (*AndQuery) Next

func (q *AndQuery) Next() int32

func (*AndQuery) PayloadDecode

func (q *AndQuery) PayloadDecode(p Payload)

func (*AndQuery) Score

func (q *AndQuery) Score() float32

func (*AndQuery) SetBoost

func (q *AndQuery) SetBoost(b float32) Query

func (*AndQuery) SetNot

func (q *AndQuery) SetNot(not Query) *AndQuery

func (*AndQuery) String

func (q *AndQuery) String() string

type ConstantQuery

type ConstantQuery struct {
	// contains filtered or unexported fields
}

func Constant

func Constant(boost float32, q Query) *ConstantQuery

func (*ConstantQuery) AddSubQuery

func (q *ConstantQuery) AddSubQuery(Query) Query

func (*ConstantQuery) Advance

func (q *ConstantQuery) Advance(target int32) int32

func (*ConstantQuery) Cost

func (q *ConstantQuery) Cost() int

func (*ConstantQuery) GetDocId

func (q *ConstantQuery) GetDocId() int32

func (*ConstantQuery) Next

func (q *ConstantQuery) Next() int32

func (*ConstantQuery) PayloadDecode

func (q *ConstantQuery) PayloadDecode(p Payload)

func (*ConstantQuery) Score

func (q *ConstantQuery) Score() float32

func (*ConstantQuery) SetBoost

func (q *ConstantQuery) SetBoost(b float32) Query

func (*ConstantQuery) String

func (q *ConstantQuery) String() string

type DisMaxQuery

type DisMaxQuery struct {
	// contains filtered or unexported fields
}

func DisMax

func DisMax(tieBreaker float32, queries ...Query) *DisMaxQuery

Creates DisMax query, for example if the query is:

DisMax(0.5, "name:amsterdam","name:university","name:free")

lets say we have an index with following idf: amsterdam: 1.3, free: 0.2, university: 2.1 the score is computed by:

max(score(amsterdam),score(university), score(free)) = 2.1 (university)
+ score(free) * tiebreaker = 0.1
+ score(amsterdam) * tiebreaker = 0.65
= 2.85

func (*DisMaxQuery) AddSubQuery

func (q *DisMaxQuery) AddSubQuery(sub Query) Query

func (*DisMaxQuery) Advance

func (q *DisMaxQuery) Advance(target int32) int32

func (*DisMaxQuery) Cost

func (q *DisMaxQuery) Cost() int

func (*DisMaxQuery) GetDocId

func (q *DisMaxQuery) GetDocId() int32

func (*DisMaxQuery) Next

func (q *DisMaxQuery) Next() int32

func (*DisMaxQuery) PayloadDecode

func (q *DisMaxQuery) PayloadDecode(p Payload)

func (*DisMaxQuery) Score

func (q *DisMaxQuery) Score() float32

func (*DisMaxQuery) SetBoost

func (q *DisMaxQuery) SetBoost(b float32) Query

func (*DisMaxQuery) String

func (q *DisMaxQuery) String() string

type FileTermData

type FileTermData struct {
	// contains filtered or unexported fields
}

func FileTerm

func FileTerm(totalDocumentsInIndex int, fn string) *FileTermData

Create new lazy term from stored ByteOrder (by default little endian) encoded array of integers

The file will be closed automatically when the query is exhausted (reaches the end)

WARNING: you must exhaust the query, otherwise you will leak file descriptors.

func (*FileTermData) AddSubQuery

func (t *FileTermData) AddSubQuery(Query) Query

func (*FileTermData) Advance

func (t *FileTermData) Advance(target int32) int32

func (*FileTermData) Close

func (t *FileTermData) Close()

func (*FileTermData) Cost

func (t *FileTermData) Cost() int

func (*FileTermData) GetDocId

func (t *FileTermData) GetDocId() int32

func (*FileTermData) Next

func (t *FileTermData) Next() int32

func (*FileTermData) PayloadDecode

func (t *FileTermData) PayloadDecode(p Payload)

func (*FileTermData) Score

func (t *FileTermData) Score() float32

func (*FileTermData) SetBoost

func (t *FileTermData) SetBoost(b float32) Query

func (*FileTermData) String

func (t *FileTermData) String() string

type OrQuery

type OrQuery struct {
	// contains filtered or unexported fields
}

func Or

func Or(queries ...Query) *OrQuery

Creates OR query

func (*OrQuery) AddSubQuery

func (q *OrQuery) AddSubQuery(sub Query) Query

func (*OrQuery) Advance

func (q *OrQuery) Advance(target int32) int32

func (*OrQuery) Cost

func (q *OrQuery) Cost() int

func (*OrQuery) GetDocId

func (q *OrQuery) GetDocId() int32

func (*OrQuery) Next

func (q *OrQuery) Next() int32

func (*OrQuery) PayloadDecode

func (q *OrQuery) PayloadDecode(p Payload)

func (*OrQuery) Score

func (q *OrQuery) Score() float32

func (*OrQuery) SetBoost

func (q *OrQuery) SetBoost(b float32) Query

func (*OrQuery) String

func (q *OrQuery) String() string

type Payload

type Payload interface {
	Push()
	Pop()
	Consume(int32, int, []byte)
	Score() float32
}

type PayloadTermQuery

type PayloadTermQuery struct {
	// contains filtered or unexported fields
}

func PayloadTerm

func PayloadTerm(totalDocumentsInIndex int, t string, postings []int32, payload []byte) *PayloadTermQuery

func (*PayloadTermQuery) AddSubQuery

func (t *PayloadTermQuery) AddSubQuery(Query) Query

func (*PayloadTermQuery) Advance

func (t *PayloadTermQuery) Advance(target int32) int32

func (*PayloadTermQuery) Cost

func (t *PayloadTermQuery) Cost() int

func (*PayloadTermQuery) GetDocId

func (t *PayloadTermQuery) GetDocId() int32

func (*PayloadTermQuery) Next

func (t *PayloadTermQuery) Next() int32

func (*PayloadTermQuery) PayloadDecode

func (t *PayloadTermQuery) PayloadDecode(p Payload)

func (*PayloadTermQuery) Score

func (t *PayloadTermQuery) Score() float32

func (*PayloadTermQuery) SetBoost

func (t *PayloadTermQuery) SetBoost(b float32) Query

func (*PayloadTermQuery) String

func (t *PayloadTermQuery) String() string

type Query

type Query interface {
	Advance(int32) int32
	Next() int32
	GetDocId() int32
	Score() float32
	SetBoost(float32) Query
	Cost() int
	String() string
	AddSubQuery(Query) Query

	PayloadDecode(p Payload)
}

Reuse/Concurrency: None of the queries are safe to be re-used. WARNING: the query *can not* be reused WARNING: the query it not thread safe

Example Iteration:

q := Term([]int32{1,2,3})
for q.Next() != query.NO_MORE {
	did := q.GetDocId()
	score := q.Score()
	fmt.Printf("matching %d, score: %f\n", did, score)
}

type TermQuery

type TermQuery struct {
	// contains filtered or unexported fields
}

func Term

func Term(totalDocumentsInIndex int, t string, postings []int32) *TermQuery

Basic []int32{} that the whole interface works on top score is IDF (not tf*idf, just 1*idf, since we dont store the term frequency for now) if you dont know totalDocumentsInIndex, which could be the case sometimes, pass any constant > 0 WARNING: the query *can not* be reused WARNING: the query it not thread safe

func (*TermQuery) AddSubQuery

func (t *TermQuery) AddSubQuery(Query) Query

func (*TermQuery) Advance

func (t *TermQuery) Advance(target int32) int32

func (*TermQuery) Cost

func (t *TermQuery) Cost() int

func (*TermQuery) GetDocId

func (t *TermQuery) GetDocId() int32

func (*TermQuery) Next

func (t *TermQuery) Next() int32

func (*TermQuery) PayloadDecode

func (t *TermQuery) PayloadDecode(p Payload)

func (*TermQuery) Score

func (t *TermQuery) Score() float32

func (*TermQuery) SetBoost

func (t *TermQuery) SetBoost(b float32) Query

func (*TermQuery) String

func (t *TermQuery) String() string

type TermTFQuery

type TermTFQuery struct {
	// contains filtered or unexported fields
}

func TermTF

func TermTF(totalDocumentsInIndex int, freqBits int32, t string, postings []int32) *TermTFQuery

Splits the postings list into chunks that are binary searched and inside each chunk linearly searching for next advance() Basic []int32{} that the whole interface works on top. The Score is TF*IDF you have to specify how many bits from the docID are actually term frequency e.g if you want to store the frequency in 4 bits then document id 999 with term frequency 2 for this specific term could be stored as (999 << 4) | 2, usually you just store the floored sqrt(frequency), so 3-4 bits are enough. it is zero based, so 0 is frequency 1

if you dont know totalDocumentsInIndex, which could be the case sometimes, pass any constant > 0 WARNING: the query *can not* be reused WARNING: the query it not thread safe

func (*TermTFQuery) AddSubQuery

func (t *TermTFQuery) AddSubQuery(Query) Query

func (*TermTFQuery) Advance

func (t *TermTFQuery) Advance(target int32) int32

func (*TermTFQuery) Cost

func (t *TermTFQuery) Cost() int

func (*TermTFQuery) GetDocId

func (t *TermTFQuery) GetDocId() int32

func (*TermTFQuery) Next

func (t *TermTFQuery) Next() int32

func (*TermTFQuery) PayloadDecode

func (t *TermTFQuery) PayloadDecode(p Payload)

func (*TermTFQuery) Score

func (t *TermTFQuery) Score() float32

func (*TermTFQuery) SetBoost

func (t *TermTFQuery) SetBoost(b float32) Query

func (*TermTFQuery) String

func (t *TermTFQuery) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL