indexer

package module
v0.0.0-...-1eb67fd Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 31, 2024 License: AGPL-3.0, BSD-3-Clause Imports: 18 Imported by: 0

README

go-indexer

Bloom filter based search index with support for persistent archives.

Motivation

This is a refactoring of Ben Boyter's indexer code to do two things:

  1. To be able to index and search multiple gocloud.dev/blob instances. By default that just means mulitple directories on the same filesystem but technically it means that anything which supports the gocloud.dev/blob.Bucket interface could be indexed.
  2. To be able to export and import search "archives" derived from earlier indexings.

That's it. All the "hard" stuff is all still Ben's original code.

This is meant to be a simple tool for indexing arbitrary text, like free-form notes or a directory full of Who's On First documents and providing a good-enough-is-good-enough interface for querying those files.

Tools

$> make cli
go build -mod vendor -ldflags="-s -w" -o bin/index cmd/index/main.go
go build -mod vendor -ldflags="-s -w" -o bin/search cmd/search/main.go
index
$> ./bin/index -h
Usage of ./bin/index:
  -bucket-uri value
    	One or more valid gocloud.dev/blob bucket URIs to index. The URI 'cwd://` will be interpreted as the current working directory on the local disk.
  -index-uri string
    	A valid gocloud.dev/blob bucket URIs containing the filename of the index to archive. (default "cwd:///indexer.idx")

For example:

$> ./bin/index -bucket-uri cwd:// -index-uri cwd:///index.idx
$> du -h index.idx 
1.2M	index.idx
$> ./bin/search -h
Usage of ./bin/search:
  -bucket-uri value
    	One or more valid gocloud.dev/blob bucket URIs to index. The URI 'cwd://` will be interpreted as the current working directory on the local disk.
  -index-uri string
    	An optional valid gocloud.dev/blob bucket URIs containing the filename of the index (archive) to load (instead of indexing things from scratch). The URI scheme 'cwd://' will be interpreted as the current working directory on the local disk.

For example:

$> ./bin/search -bucket-uri cwd:// 
enter search term: 
aaronland
--------------
9 index result(s)

&{.git/config 0}
9. 	url = git@github.com:aaronland/go-indexer.git

&{bucket.go 0}
9. 	"github.com/aaronland/gocloud-blob/bucket"
13. // START OF put me in aaronland/gocloud-blob
47. // END OF put me in aaronland/gocloud-blob

&{cmd/index/main.go 0}
8. 	"github.com/aaronland/go-indexer"

&{cmd/search/main.go 0}
10. 	"github.com/aaronland/go-indexer"

&{go.mod 0}
1. module github.com/aaronland/go-indexer
8. 	github.com/aaronland/gocloud-blob v0.0.17

&{go.sum 0}
13. github.com/aaronland/gocloud-blob v0.0.17 h1:TjsM6uT+XQ8SejlFNDgyxOXKEc90gZlPI0ov2EcMUHI=
14. github.com/aaronland/gocloud-blob v0.0.17/go.mod h1:Mk/2NKSaWsLTTwdqE3AEVms4W5v+Wv1WS1Z5HyZmhHA=

&{index.go 0}
16. 	"github.com/aaronland/gocloud-blob/bucket"
17. 	"github.com/aaronland/gocloud-blob/walk"

&{vendor/github.com/whosonfirst/go-ioutil/readseekcloser.go 0}
4. // (20210217/thisisaaronland)

&{vendor/modules.txt 0}
1. # github.com/aaronland/gocloud-blob v0.0.17
3. github.com/aaronland/gocloud-blob/bucket
4. github.com/aaronland/gocloud-blob/walk

enter search term: 

It is also possible to load an existing index to query. For example:

$> ./bin/search -index-uri cwd:///index.idx
enter search term: 
sfomuseum
--------------
7 index result(s)

&{cmd/index/main.go 0}
9. 	"github.com/sfomuseum/go-flags/multi"

&{cmd/search/main.go 0}
11. 	"github.com/sfomuseum/go-flags/multi"

&{go.mod 0}
9. 	github.com/sfomuseum/go-flags v0.10.0

&{vendor/modules.txt 0}
18. # github.com/sfomuseum/go-flags v0.10.0
20. github.com/sfomuseum/go-flags/multi

enter search term:

Note: In the example above results from indexing the .git folder were excluded.

Things this package doesn't do (yet)

  • There is no way to exclude certain files from being indexed yet. This is on the "to do" list but has not happened yet so be mindful of what you choose to index.
  • It does not do incremental updates to existing indices.
  • It does not remove individual items from existing indices.
  • Probably none of the other things you'd like it to do.

gocloud.dev/blob bucket support

Currently only local files on disk are supported using the file:// scheme. Shortly the "guts" of the code for the tools in cmd folder will be moved in to library code allowing for the creation of custom search and index tools targeting other bucket schemes.

See also

Documentation

Index

Constants

View Source
const (
	BloomSize         = 4096
	DocumentsPerBlock = 64
)

Variables

This section is empty.

Functions

func FindMatchingLines

func FindMatchingLines(r io.Reader, query string, limit int) []string

Given a file and a query try to open the file, then look through its lines and see if any of them match something from the query up to a limit Note this will return partial matches as if any term matches its considered a match and there is no accounting for better matches... In other words it's a very dumb way of doing this and probably has horrible runtime performance to match

func GetFill

func GetFill(doc []bool) float64

GetFill returns the % value of how much this doc was filled, allowing for determining if the index will be overfilled for this document

func HashBloom

func HashBloom(word []byte) []uint64

HashBloom hashes a single token/word 3 times to give us the entry locations we need for our bloomFilter filter

func Itemise

func Itemise(tokens []string) []bool

Itemise given some content will turn it into tokens and then use those to create the bit positions we need to set for our bloomFilter filter index

func Ngrams

func Ngrams(text string, size int) []string

Ngrams given input splits it according the requested size such that you can get trigrams or whatever else is required

func RemoveUInt64Duplicates

func RemoveUInt64Duplicates(s []uint64) []uint64

RemoveUInt64Duplicates removes duplicate values from uint64 slice

func Trigrams

func Trigrams(text string) []string

Trigrams takes in text and returns its trigrams Attempts to be as efficient as possible

func TrigramsDancantos

func TrigramsDancantos(text string) []string

Trigrams takes in text and returns its trigrams

func TrigramsFfmiruz

func TrigramsFfmiruz(text string) []string

func TrigramsMerovius

func TrigramsMerovius(text string) []string

Types

type Archive

type Archive struct {
	BloomFilter []uint64          `json:"bloom_filter"`
	IdToFile    []*File           `json:"id_to_file"`
	BucketURIs  map[string]uint32 `json:"bucket_uris"`
}

Archive implements a struct containing data for serializing and deserializing `Index` instances

type File

type File struct {
	Path     string `json:"path"`
	BucketId uint32 `json:"bucket_id`
}

type Index

type Index struct {
	// contains filtered or unexported fields
}

Index implements a bloom filter based search index

func NewIndex

func NewIndex() *Index

NewIndex returns a new (and empty) `Index` instance

func NewIndexWithOptions

func NewIndexWithOptions(opts *IndexOptions) *Index

func (*Index) Add

func (idx *Index) Add(item []bool) error

Add adds items into the internal bloomFilter used later for pre-screening documents note that it fills the filter from right to left, which might not be what you expect

func (*Index) Archive

func (idx *Index) Archive() *Archive

func (*Index) Close

func (idx *Index) Close() error

func (*Index) ExportArchive

func (idx *Index) ExportArchive(ctx context.Context, wr io.Writer) error

func (*Index) ExportArchiveWithURI

func (idx *Index) ExportArchiveWithURI(ctx context.Context, archive_uri string) error

func (*Index) IdToFile

func (idx *Index) IdToFile(id uint32) *File

func (*Index) ImportArchive

func (idx *Index) ImportArchive(ctx context.Context, r io.Reader) error

func (*Index) ImportArchiveWithURI

func (idx *Index) ImportArchiveWithURI(ctx context.Context, archive_uri string) error

func (*Index) IndexBuckets

func (idx *Index) IndexBuckets(ctx context.Context, bucket_uris ...string) error

func (*Index) IndexObject

func (idx *Index) IndexObject(ctx context.Context, b *blob.Bucket, bucket_id uint32, obj *blob.ListObject) error

func (*Index) OpenFile

func (idx *Index) OpenFile(ctx context.Context, id uint32) (io.ReadCloser, error)

func (*Index) PrintIndex

func (idx *Index) PrintIndex()

PrintIndex prints out the index which can be useful from time to time to ensure that bits are being set correctly.

func (*Index) Queryise

func (idx *Index) Queryise(query string) []uint64

Queryise given some content will turn it into tokens and then hash them and store the resulting values into a slice which we can use to query the bloom filter

func (*Index) Search

func (idx *Index) Search(queryBits []uint64) []uint32

Search the results we need to look at very quickly using only bit operations mostly limited by memory access

func (*Index) Tokenize

func (idx *Index) Tokenize(text string) []string

Tokenize returns a slice of tokens for the given text.

type IndexOptions

type IndexOptions struct {
	Method   string
	MaxBytes int64
}

func DefaultIndexOptions

func DefaultIndexOptions() *IndexOptions

type Trigram

type Trigram [3]rune

func TrigramsJamesrom

func TrigramsJamesrom(text string) []Trigram

Trigrams takes in text and returns its trigrams.

func (Trigram) Bytes

func (t Trigram) Bytes() []byte

Bytes is the simplest way to turn an array of runes into a slice of bytes. There is a faster way to do this, but not needed for this demo. See: https://stackoverflow.com/questions/29255746/how-encode-rune-into-byte-using-utf8

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL