dupi

package module
v0.0.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 17, 2021 License: Apache-2.0 Imports: 20 Imported by: 0

README

⊧ dupi

Dupi is an engine for identifying and exploring duplicative text in sets of documents.

Status

Dupi is in alpha/early beta development stage. Please feel free to give it a try (and file issues). We have run it on several document sets successfully, but it definitely needs more testing to be deployed for commercial purposes.

Input

Throw hundreds of thousands of textual documents at it. Or extract text from other documents and send that to dupi.

Output

Find and query for repeated chunks of text.

Tutorial

Tutorial

Design

Design Document

Documentation

Overview

Package dupi provides a library for exploring duplicate data in large sets of documents.

Index

Constants

This section is empty.

Variables

View Source
var ErrInvalidQueryState = errors.New("query state invalid")

Functions

func RemoveIndex

func RemoveIndex(idx *Index) error

func RemoveIndexer

func RemoveIndexer(idx *Indexer) error

Types

type Blot

type Blot struct {
	Blot uint32
	Docs []Doc
}

Blot represents a piece of a query or extraction. The field Blot gives the blot which was witnessed in the docs specified in the field Docs.

The caller of Query.Next supplies a slice of Blots, indicating to the index/query implementation for how many blots we would like results.

For each sub Blot, the field docs can either be nil, indicating to show all documents, or non-nil, in which case up to len(Docs) - cap(Docs) doc records are returned, each associated with Blot.

func (*Blot) Cap

func (b *Blot) Cap() int

func (*Blot) Doc

func (b *Blot) Doc(i int) *Doc

func (*Blot) Len

func (b *Blot) Len() int

func (*Blot) Next

func (b *Blot) Next(lim bool) *Doc

type Config

type Config struct {
	IndexRoot   string
	SeqLen      int
	NumShards   int
	NumShatters int

	// How frequently buckets write document
	// data to disk.  Higher= less memory,
	// more frequent i/o.
	// Frequency in terms of number of documents.
	DocFlushRate int

	TokenConfig token.Config
	BlotConfig  blotter.Config
}

func DefaultConfig

func DefaultConfig(root string) (*Config, error)

func NewConfig

func NewConfig(root string, nbuckets, seqLen int) (*Config, error)

func ReadConfig

func ReadConfig(root string) (*Config, error)

func (*Config) DmdPath

func (cfg *Config) DmdPath() string

func (*Config) FnamesPath

func (cfg *Config) FnamesPath() string

func (*Config) IixPath

func (cfg *Config) IixPath(i int) string

func (*Config) LockPath

func (cfg *Config) LockPath() string

func (*Config) Path

func (cfg *Config) Path() string

func (*Config) PostPath

func (cfg *Config) PostPath(i int) string

func (*Config) Write

func (cfg *Config) Write() error

type Doc

type Doc struct {
	Path  string
	Start uint32
	End   uint32
	Dat   []byte `json:"-"`
}

func NewDoc

func NewDoc(path, body string) *Doc

func (*Doc) Load added in v0.0.4

func (doc *Doc) Load() error

type Index

type Index struct {
	// contains filtered or unexported fields
}

func OpenIndex

func OpenIndex(root string) (*Index, error)

func (*Index) BlotDoc added in v0.0.4

func (x *Index) BlotDoc(dst []uint32, doc *Doc) []uint32

func (*Index) Blotter added in v0.0.3

func (x *Index) Blotter() blotter.T

func (*Index) Close

func (x *Index) Close() error

func (*Index) FindBlot added in v0.0.4

func (x *Index) FindBlot(theBlot uint32, doc *Doc) (start, end uint32, err error)

func (*Index) JoinBlot added in v0.0.4

func (x *Index) JoinBlot(shard uint32, sblot uint16) uint32

func (*Index) NumShards added in v0.0.3

func (x *Index) NumShards() int

func (*Index) NumShatters added in v0.0.3

func (x *Index) NumShatters() int

func (*Index) Root

func (x *Index) Root() string

func (*Index) SeqLen added in v0.0.3

func (x *Index) SeqLen() int

func (*Index) SplitBlot added in v0.0.4

func (x *Index) SplitBlot(b uint32) (shard uint32, sblot uint16)

func (*Index) StartQuery

func (x *Index) StartQuery(s QueryStrategy) *Query

func (*Index) Stats added in v0.0.5

func (x *Index) Stats() (*Stats, error)

func (*Index) TokenFunc added in v0.0.3

func (x *Index) TokenFunc() token.TokenizerFunc

type Indexer

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer is a struct for duplicate indexing.

func CreateIndexer

func CreateIndexer(root string, nbuckets, seqLen int) (*Indexer, error)

CreateIndexer attempts to creat a new dupy index. root is the directory root of the dupy index nbuckets states how many buckets docCap should be a conservative estimate of number of documents toksPerDoc should indicate about how many tokens are expected per document.

func IndexerFromConfig

func IndexerFromConfig(cfg *Config) (*Indexer, error)

func OpenIndexer

func OpenIndexer(root string) (*Indexer, error)

func (*Indexer) Add

func (x *Indexer) Add(doc *Doc) error

Add adds 'doc' to the index.

func (*Indexer) Close

func (x *Indexer) Close() error

Close attempts to flush all data associated with the index to disk.

func (*Indexer) Root

func (x *Indexer) Root() string

Root returns the path to the root of the index 'x'. the returned root is an absolute path.

type Query

type Query struct {
	// contains filtered or unexported fields
}

func (*Query) Get added in v0.0.3

func (q *Query) Get(blot *Blot) error

func (*Query) Next

func (q *Query) Next(dst []Blot) (n int, err error)

type QueryStrategy

type QueryStrategy int
const (
	QueryMaxBlot QueryStrategy = iota
	QueryMaxDoc
	QueryRandom
)

type Stats added in v0.0.5

type Stats struct {
	Root      string
	NumDocs   uint64
	NumPaths  uint64
	NumPosts  uint64
	NumBlots  uint64
	BlotMean  float64
	BlotSigma float64
}

func (*Stats) String added in v0.0.5

func (st *Stats) String() string

Directories

Path Synopsis
Package attic contains interesting dead ends.
Package attic contains interesting dead ends.
ibloom
Package ibloom implements a bloom filter on integer (uint32) sets.
Package ibloom implements a bloom filter on integer (uint32) sets.
trigram
Package trigram supports a trigram alphabet for dupy.
Package trigram supports a trigram alphabet for dupy.
package blotter provides fingerprinting for dupi docs.
package blotter provides fingerprinting for dupi docs.
cmd
dupi
Command dupi is the dupi command line.
Command dupi is the dupi command line.
Package dmd maps document, offset pairs to internal document ids.
Package dmd maps document, offset pairs to internal document ids.
internal
shard
Package shard implements sharded posting indices.
Package shard implements sharded posting indices.
Package lock provides file based cooperative locking.
Package lock provides file based cooperative locking.
Package post provides a data structure coupling dupi blots with dupi internal document ids.
Package post provides a data structure coupling dupi blots with dupi internal document ids.
Package token tokenizes data for dupi.
Package token tokenizes data for dupi.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL