fulltext

package module
v0.0.0-...-a28063e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 10, 2015 License: MIT Imports: 12 Imported by: 3

README

Overview

This is a simple, pure-Go, full text indexing and search library.

I made it for use on small to medium websites, although there is nothing web-specific about it's API or operation.

Cdb (http://github.com/jbarham/go-cdb) is used to perform the indexing and lookups.

Status

This project is more or less stable.

Notes on Building

fulltext requires CDB:

go get github.com/jbarham/go-cdb

Usage

First, you must create an index. Like this:

import "github.com/bradleypeabody/fulltext"

// create new index with temp dir (usually "" is fine)
idx, err := fulltext.NewIndexer(""); if err != nil { panic(err) }
defer idx.Close()

// provide stop words if desired
idx.StopWordCheck = fulltext.EnglishStopWordChecker

// for each document you want to add, you do something like this:
doc := fulltext.IndexDoc{
	Id: []byte(uuid), // unique identifier (the path to a webpage works...)
	StoreValue: []byte(title), // bytes you want to be able to retrieve from search results
	IndexValue: []byte(data), // bytes you want to be split into words and indexed
}
idx.AddDoc(doc) // add it

// when done, write out to final index
err = idx.FinalizeAndWrite(f); if err != nil { panic(err) }

Once you have an index file, you can search it like this:

s, err := fulltext.NewSearcher("/path/to/index/file"); if err != nil { panic(err) }
defer s.Close()
sr, err := s.SimpleSearch("Horatio", 20); if err != nil { panic(err) }
for k, v := range sr.Items {
	fmt.Printf("----------- #:%d\n", k)
	fmt.Printf("Id: %s\n", v.Id)
	fmt.Printf("Score: %d\n", v.Score)
	fmt.Printf("StoreValue: %s\n", v.StoreValue)
}

It's rather simplistic. But it's fast and it works.

Thoughts in Comparison to blevesearch

I wrote this project before blevesearch was released. I've done a number of implementions now of website search engines using fulltext and also a number of others using blevesearch. My general experience has been that blevesearch is better suited for projects where you are really doing significant development on your search results and need the ability to customize things for various locales, etc. Fulltext on the other hand is much simpler and is better for projects that either a) have simpler search requirements or b) prefer speed of indexing over quality of results.

Adding a fulltext search engine to a website with a few hundred pages is a simple task and the indexing is fast enough that you can just run it as part of your pre-publish build process. So while there is a lot more development on blevesearch happening - and hats off to them, it's a great product - fulltext still seems to have it's place for these simpler scenarios.

TODOs

  • Will likely need some sort of "stop word" functionality.

  • Wordize(), IndexizeWord() and the scoring aggregation logic should be extracted to callback functions with the existing functionality as default.

  • The search logic is currently very naive. Ideally this project would have something as sophisticated as Lucene's query parser. But in reality what I'll likely do is a simple survey of which common features are actually used on any on-site search engines I can get my hands on. Quoting ("black cat"), and logical operators (Jim OR James) would likely be at the top of the list and implementing that sort of thing would be higher priority than trying to duplicate Lucene.

  • I've considered using boltdb for storage as an alternative to CDB, but I haven't found the time to work on it. This approach would provide the ability to update the index, reduce memory consumption during index building, and potenteially allow for wildcard suffixes.

Implementation Notes

I originally tried doing this on top of Sqlite. It was dreadfully slow. Cdb is orders of magnitude faster.

Two main disadvantages from going the Cdb route are that the index cannot be edited once it is built (you have to recreate it in full), and since it's hash-based it will not support any sort of fuzzy matching unless those variations are included in the index (which they are not, in the current implementation.) For my purposes these two disadvantages are overshadowed by the fact that it's blinding fast, easy to use, portable (pure-Go), and its interface allowed me to build the indexes I needed into a single file.

In the test suite is included a copy of the complete works of William Shakespeare (thanks to Jeremy Hylton's http://shakespeare.mit.edu/) and this library is used to create a simple search engine on top of that corpus. By default it only runs for 10 seconds, but you can run it for longer by doing something like:

SEARCHER_WEB_TIMEOUT_SECONDS=120 go test fulltext -v

Works on Windows.

Future Work

It might be feasible to supplant this project with something using suffix arrays ( http://golang.org/pkg/index/suffixarray/ ). The main down side would be the requirement of a lot more storage space (and memory to load and search it). Retooling the index/suffixarray package so it can work against the disk is an idea, but is not necessarily simple. The upside of an approach like that would be full regex support for searches with decent performance - which would rock. The index could potentially be sharded by the first character or two of the search - but that's still not as good as something with sensible caching where the whole set can be kept on disk and the "hot" parts cached in memory, etc.

Documentation

Overview

A simple cross-platform, full-text search engine, backed by sqlite. Intended for use on small- to medium-sized websites.

See README.md for usage.

Index

Constants

View Source
const HEADER_SIZE = 4096

Size of header block to prepend - make it 4k to align disk reads

Variables

View Source
var EnglishStopWordChecker = func(s string) bool {
	return STOPWORDS_EN[s]
}
View Source
var STOPWORDS_EN = map[string]bool{}/* 173 elements not displayed */

English stop words

Functions

func HTMLExtractDescription

func HTMLExtractDescription(html string) string

Helper to extract an HTML description from the meta[name=description] tag

func HTMLExtractTitle

func HTMLExtractTitle(html string) string

Helper to extract an HTML title from the title tag

func HTMLStripTags

func HTMLStripTags(s string) (output string)

This function copied from here: https://github.com/kennygrant/sanitize/blob/master/sanitize.go License is: https://github.com/kennygrant/sanitize/blob/master/License-BSD.txt Strip html tags, replace common entities, and escape <>&;'" in the result. Note the returned text may contain entities as it is escaped by HTMLEscapeString, and most entities are not translated.

func IndexizeWord

func IndexizeWord(w string) string

Make word appropriate for indexing

func Wordize

func Wordize(t string) []string

Split a string up into words

Types

type IndexDoc

type IndexDoc struct {
	Id         []byte // the id, this is usually the path to the document
	IndexValue []byte // index this data
	StoreValue []byte // store this data
}

Contents of a single document to be indexed

type Indexer

type Indexer struct {
	WordSplit     WordSplitter
	WordClean     WordCleaner
	StopWordCheck StopWordChecker
	// contains filtered or unexported fields
}

Produces a set of cdb files from a series of AddDoc() calls

func NewIndexer

func NewIndexer(tempDir string) (*Indexer, error)

Creates a new indexer, using the given temp dir while building the index.

func (*Indexer) AddDoc

func (idx *Indexer) AddDoc(idoc IndexDoc) error

Add a document to the index - writes to temporary files and stores some data in memory while building the index.

func (*Indexer) Close

func (idx *Indexer) Close()

close and remove all resources

func (*Indexer) DumpStatus

func (idx *Indexer) DumpStatus(w io.Writer)

Dump some human readable status information

func (*Indexer) FinalizeAndWrite

func (idx *Indexer) FinalizeAndWrite(w io.Writer) error

Builds a final single index file, which consists of some simple header info, followed by the cdb binary files that comprise the full index.

type SearchResultItem

type SearchResultItem struct {
	Id         []byte // id of this item (document)
	StoreValue []byte // the stored value of this document
	Score      int64  // the total score
}

A single item in a search result

type SearchResultItems

type SearchResultItems []SearchResultItem

Implement sort.Interface

func (SearchResultItems) Len

func (s SearchResultItems) Len() int

func (SearchResultItems) Less

func (s SearchResultItems) Less(i, j int) bool

func (SearchResultItems) Swap

func (s SearchResultItems) Swap(i, j int)

type SearchResults

type SearchResults struct {
	Items SearchResultItems
}

What happened during the search

type Searcher

type Searcher struct {
	// contains filtered or unexported fields
}

Interface for search. Not thread-safe, but low overhead so having a separate one per thread should be workable.

func NewSearcher

func NewSearcher(fpath string) (*Searcher, error)

Make a new searcher using the file at the specified path TODO: Make a variation that accepts a ReaderAt

func (*Searcher) Close

func (s *Searcher) Close() error

Close and release resources

func (*Searcher) SimpleSearch

func (s *Searcher) SimpleSearch(search string, maxn int) (SearchResults, error)

Perform a search

type StopWordChecker

type StopWordChecker func(string) bool

type WordCleaner

type WordCleaner func(string) string

type WordSplitter

type WordSplitter func(string) []string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL