indexer

package module

v0.0.0-...-1eb67fd Latest Latest Go to latest Published: May 31, 2024 License: AGPL-3.0, BSD-3-Clause Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/aaronland/go-indexer

Links

Open Source Insights

README ¶

go-indexer

Bloom filter based search index with support for persistent archives.

Motivation

This is a refactoring of Ben Boyter's indexer code to do two things:

To be able to index and search multiple gocloud.dev/blob instances. By default that just means mulitple directories on the same filesystem but technically it means that anything which supports the gocloud.dev/blob.Bucket interface could be indexed.
To be able to export and import search "archives" derived from earlier indexings.

That's it. All the "hard" stuff is all still Ben's original code.

This is meant to be a simple tool for indexing arbitrary text, like free-form notes or a directory full of Who's On First documents and providing a good-enough-is-good-enough interface for querying those files.

Tools

$> make cli
go build -mod vendor -ldflags="-s -w" -o bin/index cmd/index/main.go
go build -mod vendor -ldflags="-s -w" -o bin/search cmd/search/main.go

index

$> ./bin/index -h
Usage of ./bin/index:
  -bucket-uri value
    	One or more valid gocloud.dev/blob bucket URIs to index. The URI 'cwd://` will be interpreted as the current working directory on the local disk.
  -index-uri string
    	A valid gocloud.dev/blob bucket URIs containing the filename of the index to archive. (default "cwd:///indexer.idx")

For example:

$> ./bin/index -bucket-uri cwd:// -index-uri cwd:///index.idx
$> du -h index.idx 
1.2M	index.idx

search

$> ./bin/search -h
Usage of ./bin/search:
  -bucket-uri value
    	One or more valid gocloud.dev/blob bucket URIs to index. The URI 'cwd://` will be interpreted as the current working directory on the local disk.
  -index-uri string
    	An optional valid gocloud.dev/blob bucket URIs containing the filename of the index (archive) to load (instead of indexing things from scratch). The URI scheme 'cwd://' will be interpreted as the current working directory on the local disk.

For example:

$> ./bin/search -bucket-uri cwd:// 
enter search term: 
aaronland
--------------
9 index result(s)

&{.git/config 0}
9. 	url = git@github.com:aaronland/go-indexer.git

&{bucket.go 0}
9. 	"github.com/aaronland/gocloud-blob/bucket"
13. // START OF put me in aaronland/gocloud-blob
47. // END OF put me in aaronland/gocloud-blob

&{cmd/index/main.go 0}
8. 	"github.com/aaronland/go-indexer"

&{cmd/search/main.go 0}
10. 	"github.com/aaronland/go-indexer"

&{go.mod 0}
1. module github.com/aaronland/go-indexer
8. 	github.com/aaronland/gocloud-blob v0.0.17

&{go.sum 0}
13. github.com/aaronland/gocloud-blob v0.0.17 h1:TjsM6uT+XQ8SejlFNDgyxOXKEc90gZlPI0ov2EcMUHI=
14. github.com/aaronland/gocloud-blob v0.0.17/go.mod h1:Mk/2NKSaWsLTTwdqE3AEVms4W5v+Wv1WS1Z5HyZmhHA=

&{index.go 0}
16. 	"github.com/aaronland/gocloud-blob/bucket"
17. 	"github.com/aaronland/gocloud-blob/walk"

&{vendor/github.com/whosonfirst/go-ioutil/readseekcloser.go 0}
4. // (20210217/thisisaaronland)

&{vendor/modules.txt 0}
1. # github.com/aaronland/gocloud-blob v0.0.17
3. github.com/aaronland/gocloud-blob/bucket
4. github.com/aaronland/gocloud-blob/walk

enter search term:

It is also possible to load an existing index to query. For example:

$> ./bin/search -index-uri cwd:///index.idx
enter search term: 
sfomuseum
--------------
7 index result(s)

&{cmd/index/main.go 0}
9. 	"github.com/sfomuseum/go-flags/multi"

&{cmd/search/main.go 0}
11. 	"github.com/sfomuseum/go-flags/multi"

&{go.mod 0}
9. 	github.com/sfomuseum/go-flags v0.10.0

&{vendor/modules.txt 0}
18. # github.com/sfomuseum/go-flags v0.10.0
20. github.com/sfomuseum/go-flags/multi

enter search term:

Note: In the example above results from indexing the .git folder were excluded.

Things this package doesn't do (yet)

There is no way to exclude certain files from being indexed yet. This is on the "to do" list but has not happened yet so be mindful of what you choose to index.
It does not do incremental updates to existing indices.
It does not remove individual items from existing indices.
Probably none of the other things you'd like it to do.

gocloud.dev/blob bucket support

Currently only local files on disk are supported using the file:// scheme. Shortly the "guts" of the code for the tools in cmd folder will be moved in to library code allowing for the creation of custom search and index tools targeting other bucket schemes.

Documentation ¶

Index ¶

Constants
func FindMatchingLines(r io.Reader, query string, limit int) []string
func GetFill(doc []bool) float64
func HashBloom(word []byte) []uint64
func Itemise(tokens []string) []bool
func Ngrams(text string, size int) []string
func RemoveUInt64Duplicates(s []uint64) []uint64
func Trigrams(text string) []string
func TrigramsDancantos(text string) []string
func TrigramsFfmiruz(text string) []string
func TrigramsMerovius(text string) []string
type Archive
type File
type Index
- func NewIndex() *Index
- func NewIndexWithOptions(opts *IndexOptions) *Index
- func (idx *Index) Add(item []bool) error
- func (idx *Index) Archive() *Archive
- func (idx *Index) Close() error
- func (idx *Index) ExportArchive(ctx context.Context, wr io.Writer) error
- func (idx *Index) ExportArchiveWithURI(ctx context.Context, archive_uri string) error
- func (idx *Index) IdToFile(id uint32) *File
- func (idx *Index) ImportArchive(ctx context.Context, r io.Reader) error
- func (idx *Index) ImportArchiveWithURI(ctx context.Context, archive_uri string) error
- func (idx *Index) IndexBuckets(ctx context.Context, bucket_uris ...string) error
- func (idx *Index) IndexObject(ctx context.Context, b *blob.Bucket, bucket_id uint32, obj *blob.ListObject) error
- func (idx *Index) OpenFile(ctx context.Context, id uint32) (io.ReadCloser, error)
- func (idx *Index) PrintIndex()
- func (idx *Index) Queryise(query string) []uint64
- func (idx *Index) Search(queryBits []uint64) []uint32
- func (idx *Index) Tokenize(text string) []string
type IndexOptions
- func DefaultIndexOptions() *IndexOptions
type Trigram
- func TrigramsJamesrom(text string) []Trigram
- func (t Trigram) Bytes() []byte

Constants ¶

View Source

const (
	BloomSize         = 4096
	DocumentsPerBlock = 64
)

Variables ¶

This section is empty.

Functions ¶

func FindMatchingLines ¶

func FindMatchingLines(r io.Reader, query string, limit int) []string

Given a file and a query try to open the file, then look through its lines and see if any of them match something from the query up to a limit Note this will return partial matches as if any term matches its considered a match and there is no accounting for better matches... In other words it's a very dumb way of doing this and probably has horrible runtime performance to match

func GetFill ¶

func GetFill(doc []bool) float64

GetFill returns the % value of how much this doc was filled, allowing for determining if the index will be overfilled for this document

func HashBloom ¶

func HashBloom(word []byte) []uint64

HashBloom hashes a single token/word 3 times to give us the entry locations we need for our bloomFilter filter

func Itemise ¶

func Itemise(tokens []string) []bool

Itemise given some content will turn it into tokens and then use those to create the bit positions we need to set for our bloomFilter filter index

func Ngrams ¶

func Ngrams(text string, size int) []string

Ngrams given input splits it according the requested size such that you can get trigrams or whatever else is required

func RemoveUInt64Duplicates ¶

func RemoveUInt64Duplicates(s []uint64) []uint64

RemoveUInt64Duplicates removes duplicate values from uint64 slice

func Trigrams ¶

func Trigrams(text string) []string

Trigrams takes in text and returns its trigrams Attempts to be as efficient as possible

func TrigramsDancantos ¶

func TrigramsDancantos(text string) []string

Trigrams takes in text and returns its trigrams

func TrigramsFfmiruz ¶

func TrigramsFfmiruz(text string) []string

func TrigramsMerovius ¶

func TrigramsMerovius(text string) []string

Types ¶

type Archive ¶

type Archive struct {
	BloomFilter []uint64          `json:"bloom_filter"`
	IdToFile    []*File           `json:"id_to_file"`
	BucketURIs  map[string]uint32 `json:"bucket_uris"`
}

Archive implements a struct containing data for serializing and deserializing `Index` instances

type File ¶

type File struct {
	Path     string `json:"path"`
	BucketId uint32 `json:"bucket_id`
}

type Index ¶

type Index struct {
	// contains filtered or unexported fields
}

Index implements a bloom filter based search index

func NewIndex ¶

func NewIndex() *Index

NewIndex returns a new (and empty) `Index` instance

func NewIndexWithOptions ¶

func NewIndexWithOptions(opts *IndexOptions) *Index

func (*Index) Add ¶

func (idx *Index) Add(item []bool) error

Add adds items into the internal bloomFilter used later for pre-screening documents note that it fills the filter from right to left, which might not be what you expect

func (*Index) Archive ¶

func (idx *Index) Archive() *Archive

func (*Index) Close ¶

func (idx *Index) Close() error

func (*Index) ExportArchive ¶

func (idx *Index) ExportArchive(ctx context.Context, wr io.Writer) error

func (*Index) ExportArchiveWithURI ¶

func (idx *Index) ExportArchiveWithURI(ctx context.Context, archive_uri string) error

func (*Index) IdToFile ¶

func (idx *Index) IdToFile(id uint32) *File

func (*Index) ImportArchive ¶

func (idx *Index) ImportArchive(ctx context.Context, r io.Reader) error

func (*Index) ImportArchiveWithURI ¶

func (idx *Index) ImportArchiveWithURI(ctx context.Context, archive_uri string) error

func (*Index) IndexBuckets ¶

func (idx *Index) IndexBuckets(ctx context.Context, bucket_uris ...string) error

func (*Index) IndexObject ¶

func (idx *Index) IndexObject(ctx context.Context, b *blob.Bucket, bucket_id uint32, obj *blob.ListObject) error

func (*Index) OpenFile ¶

func (idx *Index) OpenFile(ctx context.Context, id uint32) (io.ReadCloser, error)

func (*Index) PrintIndex ¶

func (idx *Index) PrintIndex()

PrintIndex prints out the index which can be useful from time to time to ensure that bits are being set correctly.

func (*Index) Queryise ¶

func (idx *Index) Queryise(query string) []uint64

Queryise given some content will turn it into tokens and then hash them and store the resulting values into a slice which we can use to query the bloom filter

func (*Index) Search ¶

func (idx *Index) Search(queryBits []uint64) []uint32

Search the results we need to look at very quickly using only bit operations mostly limited by memory access

func (*Index) Tokenize ¶

func (idx *Index) Tokenize(text string) []string

Tokenize returns a slice of tokens for the given text.

type IndexOptions ¶

type IndexOptions struct {
	Method   string
	MaxBytes int64
}

func DefaultIndexOptions ¶

func DefaultIndexOptions() *IndexOptions

type Trigram ¶

type Trigram [3]rune

func TrigramsJamesrom ¶

func TrigramsJamesrom(text string) []Trigram

Trigrams takes in text and returns its trigrams.

func (Trigram) Bytes ¶

func (t Trigram) Bytes() []byte

Bytes is the simplest way to turn an array of runes into a slice of bytes. There is a faster way to do this, but not needed for this demo. See: https://stackoverflow.com/questions/29255746/how-encode-rune-into-byte-using-utf8

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
index
search

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL