Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DefaultSelectTopHit ¶
func DefaultSelectTopHit(_ *gorm.DB, wa *models.WebArticle, hits hnswclient.Hits) (*hnswclient.Hit, error)
DefaultSelectTopHit is the default implementation of DuplicateDetector.SelectTopHit.
It simply returns the first element among "hits", if any, obtained by skipping the ID of the WebArticle itself and also ignoring hits whose ID is larger than the ID of "wa" (this is done to prevent mutual similarity between WebArticles).
Types ¶
type DuplicateDetector ¶
type DuplicateDetector struct { // A custom function can be assigned for selecting the topmost similar // hit among all HNSW KNN search results. // // The default value is DefaultSelectTopHit. SelectTopHit SelectTopHitFn basemodelworker.Worker // contains filtered or unexported fields }
DuplicateDetector implements a Faktory worker for performing near-duplicate detection over existing WebArticles.
func New ¶
func New( conf config.DuplicateDetector, db *gorm.DB, hnswClient *hnswclient.Client, fk *faktory_worker.Manager, ) *DuplicateDetector
New creates a new WebScraper.
type SelectTopHitFn ¶
type SelectTopHitFn func(tx *gorm.DB, wa *models.WebArticle, hits hnswclient.Hits) (*hnswclient.Hit, error)
SelectTopHitFn is a function type for selecting the top similar entry among all KNN search hits.
Arguments:
tx: the Gorm transaction instance created for the current job. It can be used for getting data from the DB in order to implement specific filtering criteria; otherwise it can be ignored. wa: the WebArticle whose Vector (already preloaded) was used for HNSW KNN Search, obtaining the "hits". This value MUST NOT be modified. hits: the value returned from hnswclient.Client.SearchKNN(). Please note that, according to the default implementation of other workers, it might always include the ID of the WebArticle "wa" itself. This value MUST NOT be modified.
Returned values:
- If a non-nil *Hit is returned with no error, "wa" will be considered a duplicate of the "parent" WebArticle identified by Hit.ID. The Hit's ID and Distance will be stored in the new SimilarityInfo model associated to "wa", as SimilarityInfo.ParentID and SimilarityInfo.Distance respectively.
- If a nil *Hit is returned with no error, "wa" is not considered a duplicate of another WebArticle (has no parent). The new SimilarityInfo model associated to "wa" will have the neither ParentID nor Distance.
- If the returned error is not nil, the *Hit value will be ignored and the whole job will be aborted.
Click to show internal directories.
Click to hide internal directories.