filters

package
v0.0.0-...-24f6000 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 20, 2023 License: MIT Imports: 4 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// TODO: this should be lowered to a reasonable amount (eg: 1024-2048-4096)
	MaxChromeURLLength = 2097152
	// TODO: fine tune the number
	MinSequenceLength = 10
	MaxSequenceCount  = 10
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Filter

type Filter interface {
	// Close closes the filter and releases associated resources
	Close()
	// UniqueURL specifies whether a URL is unique
	UniqueURL(url string) bool
	// UniqueContent specifies whether a content is unique
	// Deduplication is done by hashing of the response data.
	//
	// TODO: Consider levenshtein length / keyword based hashing
	// to account for dynamic response content.
	UniqueContent(content []byte) bool
	// IsCycle attempts to detect if the current URL is a cycle
	// until graph navigation is implemented, the only ways to discard a potential
	// loop cycle are
	// - implementing upper hard limit to the URL length (https://bugs.chromium.org/p/chromium/issues/detail?id=69227 => 2Mb)
	// - Heuristically find the longest repeating substring and set a max threshold of how many max times it should repeat (eg. 10)
	// Todo: This should be replace with graph cycle detection => https://github.com/wangsir01/katana/pull/174
	IsCycle(url string) bool
}

Filter is an interface implemented by deduplication mechanism

type Simple

type Simple struct {
	// contains filtered or unexported fields
}

Simple is a simple unique URL filter.

The URLs are maintained in a global sync.Map for deduplication and no normalization is performed.

func NewSimple

func NewSimple() (*Simple, error)

NewSimple returns a new simple filter

func (*Simple) Close

func (s *Simple) Close()

Close closes the filter and relases associated resources

func (*Simple) IsCycle

func (s *Simple) IsCycle(url string) bool

IsCycle attempts to determine if the url is a cycle loop

func (*Simple) UniqueContent

func (s *Simple) UniqueContent(data []byte) bool

UniqueContent returns true if the content is unique

func (*Simple) UniqueURL

func (s *Simple) UniqueURL(url string) bool

UniqueURL returns true if the URL is unique

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL