Documentation ¶
Overview ¶
Package dupi provides a library for exploring duplicate data in large sets of documents.
Index ¶
- Variables
- func RemoveIndex(idx *Index) error
- func RemoveIndexer(idx *Indexer) error
- type Blot
- type Config
- type Doc
- type Index
- func (x *Index) BlotDoc(dst []uint32, doc *Doc) []uint32
- func (x *Index) Blotter() blotter.T
- func (x *Index) Close() error
- func (x *Index) FindBlot(theBlot uint32, doc *Doc) (start, end uint32, err error)
- func (x *Index) JoinBlot(shard uint32, sblot uint16) uint32
- func (x *Index) NumShards() int
- func (x *Index) NumShatters() int
- func (x *Index) Root() string
- func (x *Index) SeqLen() int
- func (x *Index) SplitBlot(b uint32) (shard uint32, sblot uint16)
- func (x *Index) StartQuery(s QueryStrategy) *Query
- func (x *Index) TokenFunc() token.TokenizerFunc
- type Indexer
- type Query
- type QueryStrategy
Constants ¶
This section is empty.
Variables ¶
var ErrInvalidQueryState = errors.New("query state invalid")
Functions ¶
func RemoveIndex ¶
func RemoveIndexer ¶
Types ¶
type Blot ¶
Blot represents a piece of a query or extraction. The field Blot gives the blot which was witnessed in the docs specified in the field Docs.
The caller of Query.Next supplies a slice of Blots, indicating to the index/query implementation for how many blots we would like results.
For each sub Blot, the field docs can either be nil, indicating to show all documents, or non-nil, in which case up to len(Docs) - cap(Docs) doc records are returned, each associated with Blot.
type Config ¶
type Config struct { IndexRoot string SeqLen int NumShards int NumShatters int // How frequently buckets write document // data to disk. Higher= less memory, // more frequent i/o. // Frequency in terms of number of documents. DocFlushRate int TokenConfig token.Config BlotConfig blotter.Config }
func DefaultConfig ¶
func ReadConfig ¶
func (*Config) FnamesPath ¶
type Index ¶
type Index struct {
// contains filtered or unexported fields
}
func (*Index) NumShatters ¶ added in v0.0.3
func (*Index) StartQuery ¶
func (x *Index) StartQuery(s QueryStrategy) *Query
func (*Index) TokenFunc ¶ added in v0.0.3
func (x *Index) TokenFunc() token.TokenizerFunc
type Indexer ¶
type Indexer struct {
// contains filtered or unexported fields
}
Indexer is a struct for duplicate indexing.
func CreateIndexer ¶
CreateIndexer attempts to creat a new dupy index. root is the directory root of the dupy index nbuckets states how many buckets docCap should be a conservative estimate of number of documents toksPerDoc should indicate about how many tokens are expected per document.
func IndexerFromConfig ¶
func OpenIndexer ¶
type QueryStrategy ¶
type QueryStrategy int
const ( QueryMaxBlot QueryStrategy = iota QueryMaxDoc QueryRandom )
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Package attic contains interesting dead ends.
|
Package attic contains interesting dead ends. |
ibloom
Package ibloom implements a bloom filter on integer (uint32) sets.
|
Package ibloom implements a bloom filter on integer (uint32) sets. |
trigram
Package trigram supports a trigram alphabet for dupy.
|
Package trigram supports a trigram alphabet for dupy. |
package blotter provides fingerprinting for dupi docs.
|
package blotter provides fingerprinting for dupi docs. |
cmd
|
|
dupi
Command dupi is the dupi command line.
|
Command dupi is the dupi command line. |
Package dmd maps document, offset pairs to internal document ids.
|
Package dmd maps document, offset pairs to internal document ids. |
internal
|
|
shard
Package shard implements sharded posting indices.
|
Package shard implements sharded posting indices. |
Package lock provides file based cooperative locking.
|
Package lock provides file based cooperative locking. |
Package post provides a data structure coupling dupi blots with dupi internal document ids.
|
Package post provides a data structure coupling dupi blots with dupi internal document ids. |
Package token tokenizes data for dupi.
|
Package token tokenizes data for dupi. |