Documentation ¶
Overview ¶
Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.
Index ¶
- type Config
- func (c Config) CreateSeedCrawlRequests(ctx context.Context, factories map[string]FSFactory, ...) ([]download.Request, error)
- func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)
- func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)
- func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)
- type CrawlCacheConfig
- type Crawler
- type DownloadConfig
- type DownloadFactoryConfig
- type ExponentialBackoff
- type FSFactory
- type Rate
- type RateControl
- type Resources
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct { Name string `yaml:"name" cmd:"the name of the crawl"` Depth int `yaml:"depth" cmd:"the maximum depth to crawl"` Seeds []string `yaml:"seeds" cmd:"the initial set of URIs to crawl"` NoFollowRules []string `` /* 161-byte string literal not displayed */ FollowRules []string `` /* 155-byte string literal not displayed */ RewriteRules []string `` /* 138-byte string literal not displayed */ Download DownloadConfig `yaml:"download" cmd:"the configuration for downloading documents"` NumExtractors int `yaml:"num_extractors" cmd:"the number of concurrent link extractors to use"` Extractors []content.Type `yaml:"extractors" cmd:"the content types to extract links from"` Cache CrawlCacheConfig `yaml:"cache" cmd:"the configuration for the cache of downloaded documents"` }
Config represents the configuration for a single crawl.
func (Config) CreateSeedCrawlRequests ¶
func (c Config) CreateSeedCrawlRequests( ctx context.Context, factories map[string]FSFactory, seeds map[string][]cloudpath.Match, ) ([]download.Request, error)
CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.
func (Config) ExtractorRegistry ¶
func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)
ExtractorRegistry returns a content.Registry containing the outlinks.Extractor that can be used with outlinks.Extract.
func (Config) NewLinkProcessor ¶
func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)
NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.
func (Config) SeedsByScheme ¶
func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)
SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.
type CrawlCacheConfig ¶
type CrawlCacheConfig struct { Downloads string `` /* 147-byte string literal not displayed */ ClearBeforeCrawl bool `yaml:"clear_before_crawl" cmd:"if true, the cache and checkpoint will be cleared before the crawl starts."` Checkpoint string `yaml:"checkpoint" cmd:"the location of any checkpoint data used to resume a crawl, this is an absolute path."` ShardingPrefixLen int `` /* 187-byte string literal not displayed */ Concurrency int `yaml:"concurrency" cmd:"the number of concurrent operations to use when reading/writing to the cache."` ServiceConfig yaml.Node `yaml:"service_config,omitempty" cmd:"cache service specific configuration, eg. AWS specific configuration"` }
Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The ServiceSpecific field is intended to be parametized to some service specific configuration for cache services that require it, such as AWS S3. This is deliberately left to client packages to avoid depenedency bloat in core packages such as this. The type of the ServiceConfig file is generally determined using the scheme of the Downloads path (e.g s3://... would imply an AWS specific configuration).
Example ¶
package main import ( "fmt" "cloudeng.io/cmdutil/cmdyaml" "cloudeng.io/file/crawl/crawlcmd" ) func main() { type cloudConfig struct { Region string `yaml:"region"` } var cfg crawlcmd.CrawlCacheConfig var service cloudConfig err := cmdyaml.ParseConfig([]byte(` downloads: cloud-service://bucket/downloads service_config: region: us-west-2 `), &cfg) if err != nil { fmt.Printf("error: %v\n", err) } if err := cfg.ServiceConfig.Decode(&service); err != nil { fmt.Printf("error: %v\n", err) } fmt.Println(cfg.Downloads) fmt.Println(service.Region) }
Output: cloud-service://bucket/downloads us-west-2
func (CrawlCacheConfig) CheckpointPath ¶
func (c CrawlCacheConfig) CheckpointPath() string
CheckpointPath returns the expanded checkpoint path.
func (CrawlCacheConfig) DownloadPath ¶
func (c CrawlCacheConfig) DownloadPath() string
DownloadPath returns the expanded downloads path.
func (CrawlCacheConfig) PrepareCheckpoint ¶
func (c CrawlCacheConfig) PrepareCheckpoint(ctx context.Context, op checkpoint.Operation) error
PrepareCheckpoint initializes the checkpoint operation (ie. calls op.Init(ctx, checkpointPath)) and optionally clears the checkpoint if ClearBeforeCrawl is true. It returns an error if the checkpoint cannot be initialized or cleared.
func (CrawlCacheConfig) PrepareDownloads ¶
PrepareDownloads ensures that the cache directory exists and is empty if ClearBeforeCrawl is true. It returns an error if the directory cannot be created or cleared.
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler represents a crawler instance and contains global configuration information.
func NewCrawler ¶
NewCrawler creates a new crawler instance using the supplied configuration and resources.
type DownloadConfig ¶
type DownloadConfig struct { DownloadFactoryConfig `yaml:",inline"` RateControlConfig RateControl `yaml:",inline"` }
type DownloadFactoryConfig ¶
type DownloadFactoryConfig struct { DefaultConcurrency int `` /* 174-byte string literal not displayed */ DefaultRequestChanSize int `` /* 282-byte string literal not displayed */ DefaultCrawledChanSize int `` /* 274-byte string literal not displayed */ PerDepthConcurrency []int `yaml:"per_depth_concurrency" cmd:"per crawl depth values for the number of concurrent downloads"` PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue download requests"` PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue downloaded items"` }
DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.
func (DownloadFactoryConfig) Depth0Chans ¶
func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)
Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.
func (DownloadFactoryConfig) NewFactory ¶
func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory
NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.
type ExponentialBackoff ¶
type ExponentialBackoff struct { InitialDelay time.Duration `yaml:"initial_delay" cmd:"the initial delay between retries for exponential backoff"` Steps int `yaml:"steps" cmd:"the number of steps of exponential backoff before giving up"` StatusCodes []int `yaml:"status_codes,flow" cmd:"the status codes that trigger a retry"` }
ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.
type Rate ¶
type Rate struct { Tick time.Duration `yaml:"tick" cmd:"the duration of a tick"` RequestsPerTick int `yaml:"requests_per_tick" cmd:"the number of requests per tick"` BytesPerTick int `yaml:"bytes_per_tick" cmd:"the number of bytes per tick"` }
Rate specifies a rate in one of several forms, only one should be used.
type RateControl ¶
type RateControl struct { Rate Rate `yaml:"rate_control" cmd:"the rate control parameters"` ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff" cmd:"the exponential backoff parameters"` }
RateControl is the configuration for rate based control of download requests.
func (RateControl) NewRateController ¶
func (c RateControl) NewRateController() (*ratecontrol.Controller, error)
NewRateController creates a new rate controller based on the values contained in RateControl.
type Resources ¶
type Resources struct { // Extractors are used to extract outlinks from crawled documents // based on their content type. Extractors map[content.Type]outlinks.Extractor // CrawlStoreFactories are used to create file.FS instances for // the files being crawled based on their scheme. CrawlStoreFactories map[string]FSFactory // ContentStoreFactory is a function that returns a content.FS used to store // the downloaded content. NewContentFS func(context.Context, CrawlCacheConfig) (content.FS, error) }
Resources contains the resources required by the crawler.