dupi

package module

v0.0.2 Latest Latest Go to latest Published: Sep 15, 2021 License: Apache-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/go-air/dupi

Links

Open Source Insights

README ¶

⊧ dupi

Dupi is an engine for identifying and exploring duplicative text in sets of documents.

Status

Dupi is in alpha/early beta development stage. Expect bugs.

Input

Throw hundreds of thousands of textual documents at it. Or extract text from other documents and send that to dupy.

Output

Find and query for repeated chunks of text.

Tutorial

Design

Design Document

Documentation ¶

Overview ¶

Package dupi provides a library for exploring duplicate data in large sets of documents.

Index ¶

Variables
func RemoveIndex(idx *Index) error
func RemoveIndexer(idx *Indexer) error
type Blot
type Config
type Doc
- func NewDoc(path, body string) *Doc
type Index
- func OpenIndex(root string) (*Index, error)
type Indexer
type Query
- func (q *Query) Next(dst []Blot) (n int, err error)
type QueryStrategy

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrInvalidQueryState = errors.New("query state invalid")

Functions ¶

func RemoveIndex ¶

func RemoveIndex(idx *Index) error

func RemoveIndexer ¶

func RemoveIndexer(idx *Indexer) error

Types ¶

type Blot ¶

type Blot struct {
	Blot uint32
	Docs []Doc
}

Blot represents a piece of a query or extraction. The field Blot gives the blot which was witnessed in the docs specified in the field Docs.

The caller of Query.Next supplies a slice of Blots, indicating to the index/query implementation for how many blots we would like results.

For each sub Blot, the field docs can either be nil, indicating to show all documents, or non-nil, in which case up to len(Docs) - cap(Docs) doc records are returned, each associated with Blot.

func (*Blot) Cap ¶

func (b *Blot) Cap() int

func (*Blot) Doc ¶

func (b *Blot) Doc(i int) *Doc

func (*Blot) Len ¶

func (b *Blot) Len() int

func (*Blot) Next ¶

func (b *Blot) Next() *Doc

type Config ¶

type Config struct {
	IndexRoot   string
	SeqLen      int
	NumBuckets  int
	NumShatters int

	// How frequently buckets write document
	// data to disk.  Higher= less memory,
	// more frequent i/o.
	// Frequency in terms of number of documents.
	DocFlushRate int

	TokenConfig token.Config
	BlotConfig  blotter.Config
}

func DefaultConfig ¶

func DefaultConfig(root string) (*Config, error)

func NewConfig ¶

func NewConfig(root string, nbuckets, seqLen int) (*Config, error)

func ReadConfig ¶

func ReadConfig(root string) (*Config, error)

func (*Config) DmdPath ¶

func (cfg *Config) DmdPath() string

func (*Config) FnamesPath ¶

func (cfg *Config) FnamesPath() string

func (*Config) IixPath ¶

func (cfg *Config) IixPath(i int) string

func (*Config) LockPath ¶

func (cfg *Config) LockPath() string

func (*Config) Path ¶

func (cfg *Config) Path() string

func (*Config) PostPath ¶

func (cfg *Config) PostPath(i int) string

func (*Config) Write ¶

func (cfg *Config) Write() error

type Doc ¶

type Doc struct {
	Path  string
	Start uint32
	End   uint32
	Dat   []byte `json:"-"`
}

func NewDoc ¶

func NewDoc(path, body string) *Doc

type Index ¶

type Index struct {
	// contains filtered or unexported fields
}

func OpenIndex ¶

func OpenIndex(root string) (*Index, error)

func (*Index) Close ¶

func (x *Index) Close() error

func (*Index) Root ¶

func (x *Index) Root() string

func (*Index) StartQuery ¶

func (x *Index) StartQuery(s QueryStrategy) *Query

type Indexer ¶

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer is a struct for duplicate indexing.

func CreateIndexer ¶

func CreateIndexer(root string, nbuckets, seqLen int) (*Indexer, error)

CreateIndexer attempts to creat a new dupy index. root is the directory root of the dupy index nbuckets states how many buckets docCap should be a conservative estimate of number of documents toksPerDoc should indicate about how many tokens are expected per document.

func IndexerFromConfig ¶

func IndexerFromConfig(cfg *Config) (*Indexer, error)

func OpenIndexer ¶

func OpenIndexer(root string) (*Indexer, error)

func (*Indexer) Add ¶

func (x *Indexer) Add(doc *Doc) error

Add adds 'doc' to the index.

func (*Indexer) Close ¶

func (x *Indexer) Close() error

Close attempts to flush all data associated with the index to disk.

func (*Indexer) Root ¶

func (x *Indexer) Root() string

Root returns the path to the root of the index 'x'. the returned root is an absolute path.

type Query ¶

type Query struct {
	// contains filtered or unexported fields
}

func (*Query) Next ¶

func (q *Query) Next(dst []Blot) (n int, err error)

type QueryStrategy ¶

type QueryStrategy int

const (
	QueryMaxBlot QueryStrategy = iota
	QueryMaxDoc
	QueryRandom
)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
attic Package attic contains interesting dead ends.	Package attic contains interesting dead ends.
ibloom Package ibloom implements a bloom filter on integer (uint32) sets.	Package ibloom implements a bloom filter on integer (uint32) sets.
trigram Package trigram supports a trigram alphabet for dupy.	Package trigram supports a trigram alphabet for dupy.
blotter package blotter provides fingerprinting for dupi docs.	package blotter provides fingerprinting for dupi docs.
cmd
dupenron
dupi Command dupi is the dupi command line.	Command dupi is the dupi command line.
dmd Package dmd maps document, offset pairs to internal document ids.	Package dmd maps document, offset pairs to internal document ids.
internal
shard Package shard implements sharded posting indices.	Package shard implements sharded posting indices.
lock Package lock provides file based cooperative locking.	Package lock provides file based cooperative locking.
post Package post provides a data structure coupling dupi blots with dupi internal document ids.	Package post provides a data structure coupling dupi blots with dupi internal document ids.
token Package token tokenizes data for dupi.	Package token tokenizes data for dupi.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL