Documentation ¶
Overview ¶
Package crawl handles all the crawling logic for Zeno
Index ¶
- Constants
- func ClosingPipedTeeReader(r io.Reader, pw *io.PipeWriter) io.Reader
- func ParseAttr(attrs string) (key, value string)
- type APIWorkerState
- type APIWorkersState
- type Crawl
- func (c *Crawl) Capture(item *queue.Item) error
- func (c *Crawl) HQConsumer()
- func (c *Crawl) HQFinisher()
- func (c *Crawl) HQProducer()
- func (c *Crawl) HQSeencheckURL(URL *url.URL) (bool, error)
- func (c *Crawl) HQSeencheckURLs(URLs []*url.URL) (seencheckedBatch []*url.URL, err error)
- func (c *Crawl) HQWebsocket()
- func (c *Crawl) Start() (err error)
- type Link
- type PrometheusMetrics
- type Worker
- type WorkerPool
Constants ¶
const ( // B represent a Byte B = 1 // KB represent a Kilobyte KB = 1024 * B // MB represent a MegaByte MB = 1024 * KB // GB represent a GigaByte GB = 1024 * MB )
Variables ¶
This section is empty.
Functions ¶
func ClosingPipedTeeReader ¶
ClosingPipedTeeReader is like a classic io.TeeReader, but it explicitely takes an io.PipeWriter, and make sure to close it
Types ¶
type APIWorkerState ¶ added in v1.0.61
type APIWorkerState struct { WorkerID string `json:"worker_id"` Status string `json:"status"` LastError string `json:"last_error"` LastSeen string `json:"last_seen"` LastAction string `json:"last_action"` URL string `json:"url"` Locked bool `json:"locked"` }
APIWorkerState represents the state of an API worker.
type APIWorkersState ¶ added in v1.0.61
type APIWorkersState struct {
Workers []*APIWorkerState `json:"workers"`
}
APIWorkersState represents the state of all API workers.
type Crawl ¶
type Crawl struct { *sync.Mutex StartTime time.Time SeedList []queue.Item Paused *utils.TAtomBool Finished *utils.TAtomBool LiveStats bool // Logger Log *log.Logger // Queue (ex-frontier) Queue *queue.PersistentGroupedQueue Seencheck *seencheck.Seencheck UseSeencheck bool UseHandover bool UseCommit bool // Worker pool Workers *WorkerPool // Crawl settings MaxConcurrentAssets int Client *warc.CustomHTTPClient ClientProxied *warc.CustomHTTPClient DisabledHTMLTags []string ExcludedHosts []string IncludedHosts []string ExcludedStrings []string UserAgent string Job string JobPath string MaxHops uint8 MaxRetry uint8 MaxRedirect uint8 HTTPTimeout int MaxConcurrentRequestsPerDomain int RateLimitDelay int CrawlTimeLimit int MaxCrawlTimeLimit int DisableAssetsCapture bool CaptureAlternatePages bool DomainsCrawl bool Headless bool RandomLocalIP bool MinSpaceRequired int // Cookie-related settings CookieFile string KeepCookies bool CookieJar http.CookieJar // proxy settings Proxy string BypassProxy []string // API settings API bool APIPort string Prometheus bool PrometheusMetrics *PrometheusMetrics // Real time statistics URIsPerSecond *ratecounter.RateCounter ActiveWorkers *ratecounter.Counter CrawledSeeds *ratecounter.Counter CrawledAssets *ratecounter.Counter // WARC settings WARCPrefix string WARCOperator string WARCWriter chan *warc.RecordBatch WARCWriterFinish chan bool WARCTempDir string CDXDedupeServer string WARCFullOnDisk bool WARCPoolSize int WARCDedupeSize int DisableLocalDedupe bool CertValidation bool WARCCustomCookie string // Crawl HQ settings UseHQ bool HQAddress string HQProject string HQKey string HQSecret string HQStrategy string HQBatchSize int HQContinuousPull bool HQClient *gocrawlhq.Client HQConsumerState string HQFinishedChannel chan *queue.Item HQProducerChannel chan *queue.Item HQChannelsWg *sync.WaitGroup HQRateLimitingSendBack bool // Dependencies NoYTDLP bool YTDLPPath string }
Crawl define the parameters of a crawl process
func GenerateCrawlConfig ¶ added in v1.0.65
func (*Crawl) HQConsumer ¶
func (c *Crawl) HQConsumer()
func (*Crawl) HQFinisher ¶
func (c *Crawl) HQFinisher()
func (*Crawl) HQProducer ¶
func (c *Crawl) HQProducer()
func (*Crawl) HQSeencheckURL ¶
returns:
- bool: true if the URL is new, false if it has been seen before
- error: if there's an error sending the payload to crawl HQ
NOTE: if there's an error, the URL is considered new
func (*Crawl) HQSeencheckURLs ¶
func (*Crawl) HQWebsocket ¶
func (c *Crawl) HQWebsocket()
This function connects to HQ's websocket and listen for messages. It also sends and "identify" message to the HQ to let it know that Zeno is connected. This "identify" message is sent every second and contains the crawler's stats and details.
type Link ¶ added in v1.0.65
Represents a Link struct, containing a URL to which it links, and a Rel to define the relation
func Parse ¶ added in v1.0.65
Parse parses a raw Link header in the form:
<url1>; rel="what", <url2>; rel="any"; another="yes", <url3>; rel="thing"
returning a slice of Link structs Each of these are separated by a `, ` and the in turn by a `; `, with the first always being the url, and the remaining the key-val pairs See: https://simon-frey.com/blog/link-header/, https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Link
type PrometheusMetrics ¶
type PrometheusMetrics struct { Prefix string DownloadedURI prometheus.Counter }
PrometheusMetrics define all the metrics exposed by the Prometheus exporter
type Worker ¶ added in v1.0.61
type WorkerPool ¶ added in v1.0.65
type WorkerPool struct { Crawl *Crawl Count uint Workers sync.Map StopSignal chan bool StopTimeout time.Duration GarbageCollector chan uuid.UUID }
func NewPool ¶ added in v1.0.65
func NewPool(count uint, stopTimeout time.Duration, crawl *Crawl) *WorkerPool
func (*WorkerPool) EnsureFinished ¶ added in v1.0.65
func (wp *WorkerPool) EnsureFinished() bool
EnsureFinished waits for all workers to finish
func (*WorkerPool) GetWorkerStateFromPool ¶ added in v1.0.65
func (wp *WorkerPool) GetWorkerStateFromPool(UUID string) interface{}
GetWorkerStateFromPool returns the state of a worker given its index in the worker pool if the provided index is -1 then the state of all workers is returned
func (*WorkerPool) NewWorker ¶ added in v1.0.65
func (wp *WorkerPool) NewWorker(crawlParameters *Crawl) *Worker
func (*WorkerPool) Start ¶ added in v1.0.65
func (wp *WorkerPool) Start()
func (*WorkerPool) WorkerWatcher ¶ added in v1.0.65
func (wp *WorkerPool) WorkerWatcher()
WorkerWatcher is a background process that watches over the workers and remove them from the pool when they are done