crawlcmd

package
v0.0.0-...-bd556f4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 15, 2024 License: Apache-2.0 Imports: 19 Imported by: 8

README

Package cloudeng.io/file/crawl/crawlcmd

import cloudeng.io/file/crawl/crawlcmd

Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.

Types

Type Config
type Config struct {
	Name          string           `yaml:"name"`
	Depth         int              `yaml:"depth"`
	Seeds         []string         `yaml:"seeds"`
	NoFollowRules []string         `yaml:"nofollow"`
	FollowRules   []string         `yaml:"follow"`
	RewriteRules  []string         `yaml:"rewrite"`
	Download      DownloadConfig   `yaml:"download"`
	NumExtractors int              `yaml:"num_extractors"`
	Extractors    []content.Type   `yaml:"extractors"`
	Cache         CrawlCacheConfig `yaml:"cache"`
}

Config represents the configuration for a single crawl.

Methods
func (c Config) CreateSeedCrawlRequests(ctx context.Context, factories map[string]file.FSFactory, seeds map[string][]cloudpath.Match) ([]download.Request, error)

CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.

func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)

ExtractorRegistry returns a content.Registry containing for outlinks.Extractor that can be used with outlinks.Extract.

func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)

NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.

func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)

SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.

Type CrawlCacheConfig
type CrawlCacheConfig struct {
	Prefix            string `yaml:"cache_prefix"`
	ClearBeforeCrawl  bool   `yaml:"cache_clear_before_crawl"`
	Checkpoint        string `yaml:"cache_checkpoint"`
	ShardingPrefixLen int    `yaml:"cache_sharding_prefix_len"`
}

Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The cache is intended to be relative to the

Methods
func (c CrawlCacheConfig) Initialize(root string) (cachePath, checkpointPath string, err error)

Initialize creates the cache and checkpoint directories relative to the specified root, and optionally clears them before the crawl (if Cache.ClearBeforeCrawl is true). Any environment variables in the root or Cache.Prefix will be expanded.

Type Crawler
type Crawler struct {
	Config
	Extractors func() map[content.Type]outlinks.Extractor
	// contains filtered or unexported fields
}

Crawler represents a crawler instance and contains global configuration information.

Methods
func (c *Crawler) Run(ctx context.Context, fsMap map[string]file.FSFactory, cacheRoot string, displayOutlinks, displayProgress bool) error

Run runs the crawler.

Type DownloadConfig
type DownloadConfig struct {
	DownloadFactoryConfig `yaml:",inline"`
	RateControlConfig     RateControl `yaml:",inline"`
}
Type DownloadFactoryConfig
type DownloadFactoryConfig struct {
	DefaultConcurrency       int   `yaml:"default_concurrency"`
	DefaultRequestChanSize   int   `yaml:"default_request_chan_size"`
	DefaultCrawledChanSize   int   `yaml:"default_crawled_chan_size"`
	PerDepthConcurrency      []int `yaml:"per_depth_concurrency"`
	PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes"`
	PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes"`
}

DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.

Methods
func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)

Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.

func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory

NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.

Type ExponentialBackoff
type ExponentialBackoff struct {
	InitialDelay time.Duration `yaml:"initial_delay"`
	Steps        int           `yaml:"steps"`
	StatusCodes  []int         `yaml:"status_codes,flow"`
}

ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.

Type Rate
type Rate struct {
	Tick            time.Duration `yaml:"tick"`
	RequestsPerTick int           `yaml:"requests_per_tick"`
	BytesPerTick    int           `yaml:"bytes_per_tick"`
}

Rate specifies a rate in one of several forms, only one should be used.

Type RateControl
type RateControl struct {
	Rate               Rate               `yaml:"rate_control"`
	ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff"`
}

RateControl is the configuration for rate based control of download requests.

Methods
func (c RateControl) NewRateController() (*ratecontrol.Controller, error)

NewRateController creates a new rate controller based on the values contained in RateControl.

Documentation

Overview

Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Name          string           `yaml:"name" cmd:"the name of the crawl"`
	Depth         int              `yaml:"depth" cmd:"the maximum depth to crawl"`
	Seeds         []string         `yaml:"seeds" cmd:"the initial set of URIs to crawl"`
	NoFollowRules []string         `` /* 161-byte string literal not displayed */
	FollowRules   []string         `` /* 155-byte string literal not displayed */
	RewriteRules  []string         `` /* 138-byte string literal not displayed */
	Download      DownloadConfig   `yaml:"download" cmd:"the configuration for downloading documents"`
	NumExtractors int              `yaml:"num_extractors" cmd:"the number of concurrent link extractors to use"`
	Extractors    []content.Type   `yaml:"extractors" cmd:"the content types to extract links from"`
	Cache         CrawlCacheConfig `yaml:"cache" cmd:"the configuration for the cache of downloaded documents"`
}

Config represents the configuration for a single crawl.

func (Config) CreateSeedCrawlRequests

func (c Config) CreateSeedCrawlRequests(
	ctx context.Context,
	factories map[string]FSFactory,
	seeds map[string][]cloudpath.Match,
) ([]download.Request, error)

CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.

func (Config) ExtractorRegistry

func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)

ExtractorRegistry returns a content.Registry containing the outlinks.Extractor that can be used with outlinks.Extract.

func (Config) NewLinkProcessor

func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)

NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.

func (Config) SeedsByScheme

func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)

SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.

type CrawlCacheConfig

type CrawlCacheConfig struct {
	Downloads         string    `` /* 147-byte string literal not displayed */
	ClearBeforeCrawl  bool      `yaml:"clear_before_crawl" cmd:"if true, the cache and checkpoint will be cleared before the crawl starts."`
	Checkpoint        string    `yaml:"checkpoint" cmd:"the location of any checkpoint data used to resume a crawl, this is an absolute path."`
	ShardingPrefixLen int       `` /* 187-byte string literal not displayed */
	Concurrency       int       `yaml:"concurrency" cmd:"the number of concurrent operations to use when reading/writing to the cache."`
	ServiceConfig     yaml.Node `yaml:"service_config,omitempty" cmd:"cache service specific configuration, eg. AWS specific configuration"`
}

Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The ServiceSpecific field is intended to be parametized to some service specific configuration for cache services that require it, such as AWS S3. This is deliberately left to client packages to avoid depenedency bloat in core packages such as this. The type of the ServiceConfig file is generally determined using the scheme of the Downloads path (e.g s3://... would imply an AWS specific configuration).

Example
package main

import (
	"fmt"

	"cloudeng.io/cmdutil/cmdyaml"
	"cloudeng.io/file/crawl/crawlcmd"
)

func main() {
	type cloudConfig struct {
		Region string `yaml:"region"`
	}
	var cfg crawlcmd.CrawlCacheConfig
	var service cloudConfig

	err := cmdyaml.ParseConfig([]byte(`
downloads: cloud-service://bucket/downloads
service_config:
  region: us-west-2
`), &cfg)
	if err != nil {
		fmt.Printf("error: %v\n", err)
	}
	if err := cfg.ServiceConfig.Decode(&service); err != nil {
		fmt.Printf("error: %v\n", err)
	}
	fmt.Println(cfg.Downloads)
	fmt.Println(service.Region)
}
Output:

cloud-service://bucket/downloads
us-west-2

func (CrawlCacheConfig) CheckpointPath

func (c CrawlCacheConfig) CheckpointPath() string

CheckpointPath returns the expanded checkpoint path.

func (CrawlCacheConfig) DownloadPath

func (c CrawlCacheConfig) DownloadPath() string

DownloadPath returns the expanded downloads path.

func (CrawlCacheConfig) PrepareCheckpoint

func (c CrawlCacheConfig) PrepareCheckpoint(ctx context.Context, op checkpoint.Operation) error

PrepareCheckpoint initializes the checkpoint operation (ie. calls op.Init(ctx, checkpointPath)) and optionally clears the checkpoint if ClearBeforeCrawl is true. It returns an error if the checkpoint cannot be initialized or cleared.

func (CrawlCacheConfig) PrepareDownloads

func (c CrawlCacheConfig) PrepareDownloads(ctx context.Context, fs content.FS) error

PrepareDownloads ensures that the cache directory exists and is empty if ClearBeforeCrawl is true. It returns an error if the directory cannot be created or cleared.

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler represents a crawler instance and contains global configuration information.

func NewCrawler

func NewCrawler(cfg Config, resources Resources) *Crawler

NewCrawler creates a new crawler instance using the supplied configuration and resources.

func (*Crawler) Run

func (c *Crawler) Run(ctx context.Context,
	displayOutlinks, displayProgress bool) error

Run runs the crawler.

type DownloadConfig

type DownloadConfig struct {
	DownloadFactoryConfig `yaml:",inline"`
	RateControlConfig     RateControl `yaml:",inline"`
}

type DownloadFactoryConfig

type DownloadFactoryConfig struct {
	DefaultConcurrency       int   `` /* 174-byte string literal not displayed */
	DefaultRequestChanSize   int   `` /* 282-byte string literal not displayed */
	DefaultCrawledChanSize   int   `` /* 274-byte string literal not displayed */
	PerDepthConcurrency      []int `yaml:"per_depth_concurrency" cmd:"per crawl depth values for the number of concurrent downloads"`
	PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue download requests"`
	PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue downloaded items"`
}

DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.

func (DownloadFactoryConfig) Depth0Chans

func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)

Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.

func (DownloadFactoryConfig) NewFactory

NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.

type ExponentialBackoff

type ExponentialBackoff struct {
	InitialDelay time.Duration `yaml:"initial_delay" cmd:"the initial delay between retries for exponential backoff"`
	Steps        int           `yaml:"steps" cmd:"the number of steps of exponential backoff before giving up"`
	StatusCodes  []int         `yaml:"status_codes,flow" cmd:"the status codes that trigger a retry"`
}

ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.

type FSFactory

type FSFactory func(context.Context) (file.FS, error)

FSFactory is a function that returns a file.FS used to crawl a given FS.

type Rate

type Rate struct {
	Tick            time.Duration `yaml:"tick" cmd:"the duration of a tick"`
	RequestsPerTick int           `yaml:"requests_per_tick" cmd:"the number of requests per tick"`
	BytesPerTick    int           `yaml:"bytes_per_tick" cmd:"the number of bytes per tick"`
}

Rate specifies a rate in one of several forms, only one should be used.

type RateControl

type RateControl struct {
	Rate               Rate               `yaml:"rate_control" cmd:"the rate control parameters"`
	ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff" cmd:"the exponential backoff parameters"`
}

RateControl is the configuration for rate based control of download requests.

func (RateControl) NewRateController

func (c RateControl) NewRateController() (*ratecontrol.Controller, error)

NewRateController creates a new rate controller based on the values contained in RateControl.

type Resources

type Resources struct {
	// Extractors are used to extract outlinks from crawled documents
	// based on their content type.
	Extractors map[content.Type]outlinks.Extractor
	// CrawlStoreFactories are used to create file.FS instances for
	// the files being crawled based on their scheme.
	CrawlStoreFactories map[string]FSFactory
	// ContentStoreFactory is a function that returns a content.FS used to store
	// the downloaded content.
	NewContentFS func(context.Context, CrawlCacheConfig) (content.FS, error)
}

Resources contains the resources required by the crawler.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL