Documentation ¶
Index ¶
Constants ¶
View Source
const ( // TODO: this should be lowered to a reasonable amount (eg: 1024-2048-4096) MaxChromeURLLength = 2097152 // TODO: fine tune the number MinSequenceLength = 10 MaxSequenceCount = 10 )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Filter ¶
type Filter interface { // Close closes the filter and releases associated resources Close() // UniqueURL specifies whether a URL is unique UniqueURL(url string) bool // UniqueContent specifies whether a content is unique // Deduplication is done by hashing of the response data. // // TODO: Consider levenshtein length / keyword based hashing // to account for dynamic response content. UniqueContent(content []byte) bool // IsCycle attempts to detect if the current URL is a cycle // until graph navigation is implemented, the only ways to discard a potential // loop cycle are // - implementing upper hard limit to the URL length (https://bugs.chromium.org/p/chromium/issues/detail?id=69227 => 2Mb) // - Heuristically find the longest repeating substring and set a max threshold of how many max times it should repeat (eg. 10) // Todo: This should be replace with graph cycle detection => https://github.com/wangsir01/katana/pull/174 IsCycle(url string) bool }
Filter is an interface implemented by deduplication mechanism
type Simple ¶
type Simple struct {
// contains filtered or unexported fields
}
Simple is a simple unique URL filter.
The URLs are maintained in a global sync.Map for deduplication and no normalization is performed.
func (*Simple) Close ¶
func (s *Simple) Close()
Close closes the filter and relases associated resources
func (*Simple) UniqueContent ¶
UniqueContent returns true if the content is unique
Click to show internal directories.
Click to hide internal directories.