duplicatedetector

package
v1.0.0-alpha.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 1, 2021 License: BSD-2-Clause Imports: 13 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultSelectTopHit

func DefaultSelectTopHit(_ *gorm.DB, wa *models.WebArticle, hits hnswclient.Hits) (*hnswclient.Hit, error)

DefaultSelectTopHit is the default implementation of DuplicateDetector.SelectTopHit.

It simply returns the first element among "hits", if any, obtained by skipping the ID of the WebArticle itself and also ignoring hits whose ID is larger than the ID of "wa" (this is done to prevent mutual similarity between WebArticles).

Types

type DuplicateDetector

type DuplicateDetector struct {
	// A custom function can be assigned for selecting the topmost similar
	// hit among all HNSW KNN search results.
	//
	// The default value is DefaultSelectTopHit.
	SelectTopHit SelectTopHitFn
	basemodelworker.Worker
	// contains filtered or unexported fields
}

DuplicateDetector implements a Faktory worker for performing near-duplicate detection over existing WebArticles.

func New

func New(
	conf config.DuplicateDetector,
	db *gorm.DB,
	hnswClient *hnswclient.Client,
	fk *faktory_worker.Manager,
) *DuplicateDetector

New creates a new WebScraper.

type SelectTopHitFn

type SelectTopHitFn func(tx *gorm.DB, wa *models.WebArticle, hits hnswclient.Hits) (*hnswclient.Hit, error)

SelectTopHitFn is a function type for selecting the top similar entry among all KNN search hits.

Arguments:

tx: the Gorm transaction instance created for the current job.
    It can be used for getting data from the DB in order to implement
    specific filtering criteria; otherwise it can be ignored.
wa: the WebArticle whose Vector (already preloaded) was used
    for HNSW KNN Search, obtaining the "hits".
    This value MUST NOT be modified.
hits: the value returned from hnswclient.Client.SearchKNN().
      Please note that, according to the default implementation of other
      workers, it might always include the ID of the WebArticle "wa" itself.
      This value MUST NOT be modified.

Returned values:

  • If a non-nil *Hit is returned with no error, "wa" will be considered a duplicate of the "parent" WebArticle identified by Hit.ID. The Hit's ID and Distance will be stored in the new SimilarityInfo model associated to "wa", as SimilarityInfo.ParentID and SimilarityInfo.Distance respectively.
  • If a nil *Hit is returned with no error, "wa" is not considered a duplicate of another WebArticle (has no parent). The new SimilarityInfo model associated to "wa" will have the neither ParentID nor Distance.
  • If the returned error is not nil, the *Hit value will be ignored and the whole job will be aborted.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL