fulltext

package module

v0.0.0-...-a28063e Latest Latest Go to latest Published: May 10, 2015 License: MIT Imports: 12 Imported by: 3

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/bradleypeabody/fulltext

Links

Open Source Insights

README ¶

Overview

This is a simple, pure-Go, full text indexing and search library.

I made it for use on small to medium websites, although there is nothing web-specific about it's API or operation.

Cdb (http://github.com/jbarham/go-cdb) is used to perform the indexing and lookups.

Status

This project is more or less stable.

Notes on Building

fulltext requires CDB:

go get github.com/jbarham/go-cdb

Usage

First, you must create an index. Like this:

import "github.com/bradleypeabody/fulltext"

// create new index with temp dir (usually "" is fine)
idx, err := fulltext.NewIndexer(""); if err != nil { panic(err) }
defer idx.Close()

// provide stop words if desired
idx.StopWordCheck = fulltext.EnglishStopWordChecker

// for each document you want to add, you do something like this:
doc := fulltext.IndexDoc{
	Id: []byte(uuid), // unique identifier (the path to a webpage works...)
	StoreValue: []byte(title), // bytes you want to be able to retrieve from search results
	IndexValue: []byte(data), // bytes you want to be split into words and indexed
}
idx.AddDoc(doc) // add it

// when done, write out to final index
err = idx.FinalizeAndWrite(f); if err != nil { panic(err) }

Once you have an index file, you can search it like this:

s, err := fulltext.NewSearcher("/path/to/index/file"); if err != nil { panic(err) }
defer s.Close()
sr, err := s.SimpleSearch("Horatio", 20); if err != nil { panic(err) }
for k, v := range sr.Items {
	fmt.Printf("----------- #:%d\n", k)
	fmt.Printf("Id: %s\n", v.Id)
	fmt.Printf("Score: %d\n", v.Score)
	fmt.Printf("StoreValue: %s\n", v.StoreValue)
}

It's rather simplistic. But it's fast and it works.

Thoughts in Comparison to blevesearch

I wrote this project before blevesearch was released. I've done a number of implementions now of website search engines using fulltext and also a number of others using blevesearch. My general experience has been that blevesearch is better suited for projects where you are really doing significant development on your search results and need the ability to customize things for various locales, etc. Fulltext on the other hand is much simpler and is better for projects that either a) have simpler search requirements or b) prefer speed of indexing over quality of results.

Adding a fulltext search engine to a website with a few hundred pages is a simple task and the indexing is fast enough that you can just run it as part of your pre-publish build process. So while there is a lot more development on blevesearch happening - and hats off to them, it's a great product - fulltext still seems to have it's place for these simpler scenarios.

TODOs

~~Will likely need some sort of "stop word" functionality.~~
~~Wordize(), IndexizeWord()~~ and the scoring aggregation logic should be extracted to callback functions with the existing functionality as default.
The search logic is currently very naive. Ideally this project would have something as sophisticated as Lucene's query parser. But in reality what I'll likely do is a simple survey of which common features are actually used on any on-site search engines I can get my hands on. Quoting ("black cat"), and logical operators (Jim OR James) would likely be at the top of the list and implementing that sort of thing would be higher priority than trying to duplicate Lucene.
I've considered using boltdb for storage as an alternative to CDB, but I haven't found the time to work on it. This approach would provide the ability to update the index, reduce memory consumption during index building, and potenteially allow for wildcard suffixes.

Implementation Notes

I originally tried doing this on top of Sqlite. It was dreadfully slow. Cdb is orders of magnitude faster.

Two main disadvantages from going the Cdb route are that the index cannot be edited once it is built (you have to recreate it in full), and since it's hash-based it will not support any sort of fuzzy matching unless those variations are included in the index (which they are not, in the current implementation.) For my purposes these two disadvantages are overshadowed by the fact that it's blinding fast, easy to use, portable (pure-Go), and its interface allowed me to build the indexes I needed into a single file.

In the test suite is included a copy of the complete works of William Shakespeare (thanks to Jeremy Hylton's http://shakespeare.mit.edu/) and this library is used to create a simple search engine on top of that corpus. By default it only runs for 10 seconds, but you can run it for longer by doing something like:

SEARCHER_WEB_TIMEOUT_SECONDS=120 go test fulltext -v

Works on Windows.

Future Work

It might be feasible to supplant this project with something using suffix arrays ( http://golang.org/pkg/index/suffixarray/ ). The main down side would be the requirement of a lot more storage space (and memory to load and search it). Retooling the index/suffixarray package so it can work against the disk is an idea, but is not necessarily simple. The upside of an approach like that would be full regex support for searches with decent performance - which would rock. The index could potentially be sharded by the first character or two of the search - but that's still not as good as something with sensible caching where the whole set can be kept on disk and the "hot" parts cached in memory, etc.

Documentation ¶

Overview ¶

A simple cross-platform, full-text search engine, backed by sqlite. Intended for use on small- to medium-sized websites.

See README.md for usage.

Index ¶

Constants
Variables
func HTMLExtractDescription(html string) string
func HTMLExtractTitle(html string) string
func HTMLStripTags(s string) (output string)
func IndexizeWord(w string) string
func Wordize(t string) []string
type IndexDoc
type Indexer
- func NewIndexer(tempDir string) (*Indexer, error)
- func (idx *Indexer) AddDoc(idoc IndexDoc) error
- func (idx *Indexer) Close()
- func (idx *Indexer) DumpStatus(w io.Writer)
- func (idx *Indexer) FinalizeAndWrite(w io.Writer) error
type SearchResultItem
type SearchResultItems
- func (s SearchResultItems) Len() int
- func (s SearchResultItems) Less(i, j int) bool
- func (s SearchResultItems) Swap(i, j int)
type SearchResults
type Searcher
- func NewSearcher(fpath string) (*Searcher, error)
- func (s *Searcher) Close() error
- func (s *Searcher) SimpleSearch(search string, maxn int) (SearchResults, error)
type StopWordChecker
type WordCleaner
type WordSplitter

Constants ¶

View Source

const HEADER_SIZE = 4096

Size of header block to prepend - make it 4k to align disk reads

Variables ¶

View Source

var EnglishStopWordChecker = func(s string) bool {
	return STOPWORDS_EN[s]
}

View Source

var STOPWORDS_EN = map[string]bool{}/* 173 elements not displayed */

English stop words

Functions ¶

func HTMLExtractDescription ¶

func HTMLExtractDescription(html string) string

Helper to extract an HTML description from the meta[name=description] tag

func HTMLExtractTitle ¶

func HTMLExtractTitle(html string) string

Helper to extract an HTML title from the title tag

func HTMLStripTags ¶

func HTMLStripTags(s string) (output string)

This function copied from here: https://github.com/kennygrant/sanitize/blob/master/sanitize.go License is: https://github.com/kennygrant/sanitize/blob/master/License-BSD.txt Strip html tags, replace common entities, and escape <>&;'" in the result. Note the returned text may contain entities as it is escaped by HTMLEscapeString, and most entities are not translated.

func IndexizeWord ¶

func IndexizeWord(w string) string

Make word appropriate for indexing

func Wordize ¶

func Wordize(t string) []string

Split a string up into words

Types ¶

type IndexDoc ¶

type IndexDoc struct {
	Id         []byte // the id, this is usually the path to the document
	IndexValue []byte // index this data
	StoreValue []byte // store this data
}

Contents of a single document to be indexed

type Indexer ¶

type Indexer struct {
	WordSplit     WordSplitter
	WordClean     WordCleaner
	StopWordCheck StopWordChecker
	// contains filtered or unexported fields
}

Produces a set of cdb files from a series of AddDoc() calls

func NewIndexer ¶

func NewIndexer(tempDir string) (*Indexer, error)

Creates a new indexer, using the given temp dir while building the index.

func (*Indexer) AddDoc ¶

func (idx *Indexer) AddDoc(idoc IndexDoc) error

Add a document to the index - writes to temporary files and stores some data in memory while building the index.

func (*Indexer) Close ¶

func (idx *Indexer) Close()

close and remove all resources

func (*Indexer) DumpStatus ¶

func (idx *Indexer) DumpStatus(w io.Writer)

Dump some human readable status information

func (*Indexer) FinalizeAndWrite ¶

func (idx *Indexer) FinalizeAndWrite(w io.Writer) error

Builds a final single index file, which consists of some simple header info, followed by the cdb binary files that comprise the full index.

type SearchResultItem ¶

type SearchResultItem struct {
	Id         []byte // id of this item (document)
	StoreValue []byte // the stored value of this document
	Score      int64  // the total score
}

A single item in a search result

type SearchResultItems ¶

type SearchResultItems []SearchResultItem

Implement sort.Interface

func (SearchResultItems) Len ¶

func (s SearchResultItems) Len() int

func (SearchResultItems) Less ¶

func (s SearchResultItems) Less(i, j int) bool

func (SearchResultItems) Swap ¶

func (s SearchResultItems) Swap(i, j int)

type SearchResults ¶

type SearchResults struct {
	Items SearchResultItems
}

What happened during the search

type Searcher ¶

type Searcher struct {
	// contains filtered or unexported fields
}

Interface for search. Not thread-safe, but low overhead so having a separate one per thread should be workable.

func NewSearcher ¶

func NewSearcher(fpath string) (*Searcher, error)

Make a new searcher using the file at the specified path TODO: Make a variation that accepts a ReaderAt

func (*Searcher) Close ¶

func (s *Searcher) Close() error

Close and release resources

func (*Searcher) SimpleSearch ¶

func (s *Searcher) SimpleSearch(search string, maxn int) (SearchResults, error)

Perform a search

type StopWordChecker ¶

type StopWordChecker func(string) bool

type WordCleaner ¶

type WordCleaner func(string) string

type WordSplitter ¶

type WordSplitter func(string) []string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL