crawl

package
v1.0.78 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 12, 2024 License: AGPL-3.0 Imports: 52 Imported by: 0

Documentation

Overview

Package crawl handles all the crawling logic for Zeno

Index

Constants

View Source
const (
	// B represent a Byte
	B = 1
	// KB represent a Kilobyte
	KB = 1024 * B
	// MB represent a MegaByte
	MB = 1024 * KB
	// GB represent a GigaByte
	GB = 1024 * MB
)

Variables

This section is empty.

Functions

func ClosingPipedTeeReader

func ClosingPipedTeeReader(r io.Reader, pw *io.PipeWriter) io.Reader

ClosingPipedTeeReader is like a classic io.TeeReader, but it explicitely takes an io.PipeWriter, and make sure to close it

func ParseAttr added in v1.0.65

func ParseAttr(attrs string) (key, value string)

Parse a single attribute key value pair and return it

Types

type APIWorkerState added in v1.0.61

type APIWorkerState struct {
	WorkerID   string `json:"worker_id"`
	Status     string `json:"status"`
	LastError  string `json:"last_error"`
	LastSeen   string `json:"last_seen"`
	LastAction string `json:"last_action"`
	URL        string `json:"url"`
	Locked     bool   `json:"locked"`
}

APIWorkerState represents the state of an API worker.

type APIWorkersState added in v1.0.61

type APIWorkersState struct {
	Workers []*APIWorkerState `json:"workers"`
}

APIWorkersState represents the state of all API workers.

type Crawl

type Crawl struct {
	*sync.Mutex
	StartTime time.Time
	SeedList  []queue.Item
	Paused    *utils.TAtomBool
	Finished  *utils.TAtomBool
	LiveStats bool

	// Logger
	Log *log.Logger

	// Queue (ex-frontier)
	Queue        *queue.PersistentGroupedQueue
	Seencheck    *seencheck.Seencheck
	UseSeencheck bool
	UseHandover  bool
	UseCommit    bool

	// Worker pool
	Workers *WorkerPool

	// Crawl settings
	MaxConcurrentAssets            int
	Client                         *warc.CustomHTTPClient
	ClientProxied                  *warc.CustomHTTPClient
	DisabledHTMLTags               []string
	ExcludedHosts                  []string
	IncludedHosts                  []string
	ExcludedStrings                []string
	UserAgent                      string
	Job                            string
	JobPath                        string
	MaxHops                        uint8
	MaxRetry                       uint8
	MaxRedirect                    uint8
	HTTPTimeout                    int
	MaxConcurrentRequestsPerDomain int
	RateLimitDelay                 int
	CrawlTimeLimit                 int
	MaxCrawlTimeLimit              int
	DisableAssetsCapture           bool
	CaptureAlternatePages          bool
	DomainsCrawl                   bool
	Headless                       bool
	RandomLocalIP                  bool
	MinSpaceRequired               int

	// Cookie-related settings
	CookieFile  string
	KeepCookies bool
	CookieJar   http.CookieJar

	// proxy settings
	Proxy       string
	BypassProxy []string

	// API settings
	API               bool
	APIPort           string
	Prometheus        bool
	PrometheusMetrics *PrometheusMetrics

	// Real time statistics
	URIsPerSecond *ratecounter.RateCounter
	ActiveWorkers *ratecounter.Counter
	CrawledSeeds  *ratecounter.Counter
	CrawledAssets *ratecounter.Counter

	// WARC settings
	WARCPrefix         string
	WARCOperator       string
	WARCWriter         chan *warc.RecordBatch
	WARCWriterFinish   chan bool
	WARCTempDir        string
	CDXDedupeServer    string
	WARCFullOnDisk     bool
	WARCPoolSize       int
	WARCDedupeSize     int
	DisableLocalDedupe bool
	CertValidation     bool
	WARCCustomCookie   string

	// Crawl HQ settings
	UseHQ                  bool
	HQAddress              string
	HQProject              string
	HQKey                  string
	HQSecret               string
	HQStrategy             string
	HQBatchSize            int
	HQContinuousPull       bool
	HQClient               *gocrawlhq.Client
	HQConsumerState        string
	HQFinishedChannel      chan *queue.Item
	HQProducerChannel      chan *queue.Item
	HQChannelsWg           *sync.WaitGroup
	HQRateLimitingSendBack bool

	// Dependencies
	NoYTDLP   bool
	YTDLPPath string
}

Crawl define the parameters of a crawl process

func GenerateCrawlConfig added in v1.0.65

func GenerateCrawlConfig(config *config.Config) (*Crawl, error)

func (*Crawl) Capture

func (c *Crawl) Capture(item *queue.Item) error

Capture capture the URL and return the outlinks

func (*Crawl) HQConsumer

func (c *Crawl) HQConsumer()

func (*Crawl) HQFinisher

func (c *Crawl) HQFinisher()

func (*Crawl) HQProducer

func (c *Crawl) HQProducer()

func (*Crawl) HQSeencheckURL

func (c *Crawl) HQSeencheckURL(URL *url.URL) (bool, error)

returns:

  • bool: true if the URL is new, false if it has been seen before
  • error: if there's an error sending the payload to crawl HQ

NOTE: if there's an error, the URL is considered new

func (*Crawl) HQSeencheckURLs

func (c *Crawl) HQSeencheckURLs(URLs []*url.URL) (seencheckedBatch []*url.URL, err error)

func (*Crawl) HQWebsocket

func (c *Crawl) HQWebsocket()

This function connects to HQ's websocket and listen for messages. It also sends and "identify" message to the HQ to let it know that Zeno is connected. This "identify" message is sent every second and contains the crawler's stats and details.

func (*Crawl) Start

func (c *Crawl) Start() (err error)

Start fire up the crawling process

type Link struct {
	URL string
	Rel string
}

Represents a Link struct, containing a URL to which it links, and a Rel to define the relation

func Parse added in v1.0.65

func Parse(link string) []Link

Parse parses a raw Link header in the form:

<url1>; rel="what", <url2>; rel="any"; another="yes", <url3>; rel="thing"

returning a slice of Link structs Each of these are separated by a `, ` and the in turn by a `; `, with the first always being the url, and the remaining the key-val pairs See: https://simon-frey.com/blog/link-header/, https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Link

type PrometheusMetrics

type PrometheusMetrics struct {
	Prefix        string
	DownloadedURI prometheus.Counter
}

PrometheusMetrics define all the metrics exposed by the Prometheus exporter

type Worker added in v1.0.61

type Worker struct {
	sync.Mutex
	ID uuid.UUID
	// contains filtered or unexported fields
}

func (*Worker) Run added in v1.0.61

func (w *Worker) Run()

Run is the key component of a crawl, it's a background processed dispatched when the crawl starts, it listens on a channel to get new URLs to archive, and eventually push newly discovered URLs back in the queue.

func (*Worker) Stop added in v1.0.61

func (w *Worker) Stop()

func (*Worker) WatchHang added in v1.0.65

func (w *Worker) WatchHang()

WatchHang is a function that checks if a worker is hanging based on the last time it was seen

type WorkerPool added in v1.0.65

type WorkerPool struct {
	Crawl            *Crawl
	Count            uint
	Workers          sync.Map
	StopSignal       chan bool
	StopTimeout      time.Duration
	GarbageCollector chan uuid.UUID
}

func NewPool added in v1.0.65

func NewPool(count uint, stopTimeout time.Duration, crawl *Crawl) *WorkerPool

func (*WorkerPool) EnsureFinished added in v1.0.65

func (wp *WorkerPool) EnsureFinished() bool

EnsureFinished waits for all workers to finish

func (*WorkerPool) GetWorkerStateFromPool added in v1.0.65

func (wp *WorkerPool) GetWorkerStateFromPool(UUID string) interface{}

GetWorkerStateFromPool returns the state of a worker given its index in the worker pool if the provided index is -1 then the state of all workers is returned

func (*WorkerPool) NewWorker added in v1.0.65

func (wp *WorkerPool) NewWorker(crawlParameters *Crawl) *Worker

func (*WorkerPool) Start added in v1.0.65

func (wp *WorkerPool) Start()

func (*WorkerPool) WorkerWatcher added in v1.0.65

func (wp *WorkerPool) WorkerWatcher()

WorkerWatcher is a background process that watches over the workers and remove them from the pool when they are done

Directories

Path Synopsis
dependencies
sitespecific
vk

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL