crawlcmd

package

v0.0.0-...-8a46e9b Latest Latest Go to latest Published: Jan 19, 2025 License: Apache-2.0 Imports: 19 Imported by: 8

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

Package cloudeng.io/file/crawl/crawlcmd

import cloudeng.io/file/crawl/crawlcmd

Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.

Types

Type Config

type Config struct {
	Name          string           `yaml:"name" cmd:"the name of the crawl"`
	Depth         int              `yaml:"depth" cmd:"the maximum depth to crawl"`
	Seeds         []string         `yaml:"seeds" cmd:"the initial set of URIs to crawl"`
	NoFollowRules []string         `yaml:"nofollow" cmd:"a set of regular expressions that will be used to determine which links to not follow. The regular expressions are applied to the full URL."`
	FollowRules   []string         `yaml:"follow" cmd:"a set of regular expressions that will be used to determine which links to follow. The regular expressions are applied to the full URL."`
	RewriteRules  []string         `yaml:"rewrite" cmd:"a set of regular expressions that will be used to rewrite links. The regular expressions are applied to the full URL."`
	Download      DownloadConfig   `yaml:"download" cmd:"the configuration for downloading documents"`
	NumExtractors int              `yaml:"num_extractors" cmd:"the number of concurrent link extractors to use"`
	Extractors    []content.Type   `yaml:"extractors" cmd:"the content types to extract links from"`
	Cache         CrawlCacheConfig `yaml:"cache" cmd:"the configuration for the cache of downloaded documents"`
}

Config represents the configuration for a single crawl.

Methods

func (c Config) CreateSeedCrawlRequests(
	ctx context.Context,
	factories map[string]FSFactory,
	seeds map[string][]cloudpath.Match,
) ([]download.Request, error)

CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.

func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)

ExtractorRegistry returns a content.Registry containing the outlinks.Extractor that can be used with outlinks.Extract.

func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)

NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.

func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)

SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.

Type CrawlCacheConfig

type CrawlCacheConfig struct {
	Downloads         string    `yaml:"downloads" cmd:"the prefix/directory to use for the cache of downloaded documents. This is an absolute path the root directory of the crawl."`
	ClearBeforeCrawl  bool      `yaml:"clear_before_crawl" cmd:"if true, the cache and checkpoint will be cleared before the crawl starts."`
	Checkpoint        string    `yaml:"checkpoint" cmd:"the location of any checkpoint data used to resume a crawl, this is an absolute path."`
	ShardingPrefixLen int       `yaml:"sharding_prefix_len" cmd:"the number of characters of the filename to use for sharding the cache. This is intended to avoid filesystem limits on the number of files in a directory."`
	Concurrency       int       `yaml:"concurrency" cmd:"the number of concurrent operations to use when reading/writing to the cache."`
	ServiceConfig     yaml.Node `yaml:"service_config,omitempty" cmd:"cache service specific configuration, eg. AWS specific configuration"`
}

Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The ServiceSpecific field is intended to be parametized to some service specific configuration for cache services that require it, such as AWS S3. This is deliberately left to client packages to avoid depenedency bloat in core packages such as this. The type of the ServiceConfig file is generally determined using the scheme of the Downloads path (e.g s3://... would imply an AWS specific configuration).

Methods

func (c CrawlCacheConfig) CheckpointPath() string

CheckpointPath returns the expanded checkpoint path.

func (c CrawlCacheConfig) DownloadPath() string

DownloadPath returns the expanded downloads path.

func (c CrawlCacheConfig) PrepareCheckpoint(ctx context.Context, op checkpoint.Operation) error

PrepareCheckpoint initializes the checkpoint operation (ie. calls op.Init(ctx, checkpointPath)) and optionally clears the checkpoint if ClearBeforeCrawl is true. It returns an error if the checkpoint cannot be initialized or cleared.

func (c CrawlCacheConfig) PrepareDownloads(ctx context.Context, fs content.FS) error

PrepareDownloads ensures that the cache directory exists and is empty if ClearBeforeCrawl is true. It returns an error if the directory cannot be created or cleared.

Type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler represents a crawler instance and contains global configuration information.

Functions

func NewCrawler(cfg Config, resources Resources) *Crawler

NewCrawler creates a new crawler instance using the supplied configuration and resources.

Methods

func (c *Crawler) Run(ctx context.Context,
	displayOutlinks, displayProgress bool) error

Run runs the crawler.

Type DownloadConfig

type DownloadConfig struct {
	DownloadFactoryConfig `yaml:",inline"`
	RateControlConfig     RateControl `yaml:",inline"`
}

Type DownloadFactoryConfig

type DownloadFactoryConfig struct {
	DefaultConcurrency       int   `yaml:"default_concurrency" cmd:"the number of concurrent downloads (defaults to GOMAXPROCS(0)), used when a per crawl depth value is not specified via per_depth_concurrency."`
	DefaultRequestChanSize   int   `yaml:"default_request_chan_size" cmd:"the size of the channel used to queue download requests, used when a per crawl depth value is not specified via per_depth_request_chan_sizes. Increased values allow for more concurrency between discovering new items to crawl and crawling them."`
	DefaultCrawledChanSize   int   `yaml:"default_crawled_chan_size" cmd:"the size of the channel used to queue downloaded items, used when a per crawl depth value is not specified via per_depth_crawled_chan_sizes. Increased values allow for more concurrency between downloading documents and processing them."`
	PerDepthConcurrency      []int `yaml:"per_depth_concurrency" cmd:"per crawl depth values for the number of concurrent downloads"`
	PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue download requests"`
	PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue downloaded items"`
}

DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.

Methods

func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)

Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.

func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory

NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.

Type ExponentialBackoff

type ExponentialBackoff struct {
	InitialDelay time.Duration `yaml:"initial_delay" cmd:"the initial delay between retries for exponential backoff"`
	Steps        int           `yaml:"steps" cmd:"the number of steps of exponential backoff before giving up"`
	StatusCodes  []int         `yaml:"status_codes,flow" cmd:"the status codes that trigger a retry"`
}

ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.

Type FSFactory

type FSFactory func(context.Context) (file.FS, error)

FSFactory is a function that returns a file.FS used to crawl a given FS.

Type Rate

type Rate struct {
	Tick            time.Duration `yaml:"tick" cmd:"the duration of a tick"`
	RequestsPerTick int           `yaml:"requests_per_tick" cmd:"the number of requests per tick"`
	BytesPerTick    int           `yaml:"bytes_per_tick" cmd:"the number of bytes per tick"`
}

Rate specifies a rate in one of several forms, only one should be used.

Type RateControl

type RateControl struct {
	Rate               Rate               `yaml:"rate_control" cmd:"the rate control parameters"`
	ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff" cmd:"the exponential backoff parameters"`
}

RateControl is the configuration for rate based control of download requests.

Methods

func (c RateControl) NewRateController() (*ratecontrol.Controller, error)

NewRateController creates a new rate controller based on the values contained in RateControl.

Type Resources

type Resources struct {
	// Extractors are used to extract outlinks from crawled documents
	// based on their content type.
	Extractors map[content.Type]outlinks.Extractor
	// CrawlStoreFactories are used to create file.FS instances for
	// the files being crawled based on their scheme.
	CrawlStoreFactories map[string]FSFactory
	// ContentStoreFactory is a function that returns a content.FS used to store
	// the downloaded content.
	NewContentFS func(context.Context, CrawlCacheConfig) (content.FS, error)
}

Resources contains the resources required by the crawler.

Examples

ExampleCrawlCacheConfig

Documentation ¶

Overview ¶

Package crawlcmd provides support for building command line tools for crawling. In particular it provides support for managing the configuration of a crawl via yaml.

Index ¶

type Config
type CrawlCacheConfig
type Crawler
- func NewCrawler(cfg Config, resources Resources) *Crawler
- func (c *Crawler) Run(ctx context.Context, displayOutlinks, displayProgress bool) error
type DownloadConfig
type DownloadFactoryConfig
- func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)
- func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory
type ExponentialBackoff
type FSFactory
type Rate
type RateControl
- func (c RateControl) NewRateController() (*ratecontrol.Controller, error)
type Resources

Examples ¶

CrawlCacheConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	Name          string           `yaml:"name" cmd:"the name of the crawl"`
	Depth         int              `yaml:"depth" cmd:"the maximum depth to crawl"`
	Seeds         []string         `yaml:"seeds" cmd:"the initial set of URIs to crawl"`
	NoFollowRules []string         `` /* 161-byte string literal not displayed */
	FollowRules   []string         `` /* 155-byte string literal not displayed */
	RewriteRules  []string         `` /* 138-byte string literal not displayed */
	Download      DownloadConfig   `yaml:"download" cmd:"the configuration for downloading documents"`
	NumExtractors int              `yaml:"num_extractors" cmd:"the number of concurrent link extractors to use"`
	Extractors    []content.Type   `yaml:"extractors" cmd:"the content types to extract links from"`
	Cache         CrawlCacheConfig `yaml:"cache" cmd:"the configuration for the cache of downloaded documents"`
}

Config represents the configuration for a single crawl.

func (Config) CreateSeedCrawlRequests ¶

func (c Config) CreateSeedCrawlRequests(
	ctx context.Context,
	factories map[string]FSFactory,
	seeds map[string][]cloudpath.Match,
) ([]download.Request, error)

CreateSeedCrawlRequests creates a set of crawl requests for the supplied seeds. It use the factories to create a file.FS for the URI scheme of each seed.

func (Config) ExtractorRegistry ¶

func (c Config) ExtractorRegistry(avail map[content.Type]outlinks.Extractor) (*content.Registry[outlinks.Extractor], error)

ExtractorRegistry returns a content.Registry containing the outlinks.Extractor that can be used with outlinks.Extract.

func (Config) NewLinkProcessor ¶

func (c Config) NewLinkProcessor() (*outlinks.RegexpProcessor, error)

NewLinkProcessor creates a outlinks.RegexpProcessor using the nofollow, follow and reqwrite specifications in the configuration.

func (Config) SeedsByScheme ¶

func (c Config) SeedsByScheme(matchers cloudpath.MatcherSpec) (map[string][]cloudpath.Match, []string)

SeedsByScheme returns the crawl seeds grouped by their scheme and any seeds that are not recognised by the supplied cloudpath.MatcherSpec.

type CrawlCacheConfig ¶

type CrawlCacheConfig struct {
	Downloads         string    `` /* 147-byte string literal not displayed */
	ClearBeforeCrawl  bool      `yaml:"clear_before_crawl" cmd:"if true, the cache and checkpoint will be cleared before the crawl starts."`
	Checkpoint        string    `yaml:"checkpoint" cmd:"the location of any checkpoint data used to resume a crawl, this is an absolute path."`
	ShardingPrefixLen int       `` /* 187-byte string literal not displayed */
	Concurrency       int       `yaml:"concurrency" cmd:"the number of concurrent operations to use when reading/writing to the cache."`
	ServiceConfig     yaml.Node `yaml:"service_config,omitempty" cmd:"cache service specific configuration, eg. AWS specific configuration"`
}

Each crawl may specify its own cache directory and configuration. This will be used to store the results of the crawl. The ServiceSpecific field is intended to be parametized to some service specific configuration for cache services that require it, such as AWS S3. This is deliberately left to client packages to avoid depenedency bloat in core packages such as this. The type of the ServiceConfig file is generally determined using the scheme of the Downloads path (e.g s3://... would imply an AWS specific configuration).

Example ¶

package main

import (
	"fmt"

	"cloudeng.io/cmdutil/cmdyaml"
	"cloudeng.io/file/crawl/crawlcmd"
)

func main() {
	type cloudConfig struct {
		Region string `yaml:"region"`
	}
	var cfg crawlcmd.CrawlCacheConfig
	var service cloudConfig

	err := cmdyaml.ParseConfig([]byte(`
downloads: cloud-service://bucket/downloads
service_config:
  region: us-west-2
`), &cfg)
	if err != nil {
		fmt.Printf("error: %v\n", err)
	}
	if err := cfg.ServiceConfig.Decode(&service); err != nil {
		fmt.Printf("error: %v\n", err)
	}
	fmt.Println(cfg.Downloads)
	fmt.Println(service.Region)
}

Output:

cloud-service://bucket/downloads
us-west-2

func (CrawlCacheConfig) CheckpointPath ¶

func (c CrawlCacheConfig) CheckpointPath() string

CheckpointPath returns the expanded checkpoint path.

func (CrawlCacheConfig) DownloadPath ¶

func (c CrawlCacheConfig) DownloadPath() string

DownloadPath returns the expanded downloads path.

func (CrawlCacheConfig) PrepareCheckpoint ¶

func (c CrawlCacheConfig) PrepareCheckpoint(ctx context.Context, op checkpoint.Operation) error

PrepareCheckpoint initializes the checkpoint operation (ie. calls op.Init(ctx, checkpointPath)) and optionally clears the checkpoint if ClearBeforeCrawl is true. It returns an error if the checkpoint cannot be initialized or cleared.

func (CrawlCacheConfig) PrepareDownloads ¶

func (c CrawlCacheConfig) PrepareDownloads(ctx context.Context, fs content.FS) error

PrepareDownloads ensures that the cache directory exists and is empty if ClearBeforeCrawl is true. It returns an error if the directory cannot be created or cleared.

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler represents a crawler instance and contains global configuration information.

func NewCrawler ¶

func NewCrawler(cfg Config, resources Resources) *Crawler

NewCrawler creates a new crawler instance using the supplied configuration and resources.

func (*Crawler) Run ¶

func (c *Crawler) Run(ctx context.Context,
	displayOutlinks, displayProgress bool) error

Run runs the crawler.

type DownloadConfig ¶

type DownloadConfig struct {
	DownloadFactoryConfig `yaml:",inline"`
	RateControlConfig     RateControl `yaml:",inline"`
}

type DownloadFactoryConfig ¶

type DownloadFactoryConfig struct {
	DefaultConcurrency       int   `` /* 174-byte string literal not displayed */
	DefaultRequestChanSize   int   `` /* 282-byte string literal not displayed */
	DefaultCrawledChanSize   int   `` /* 274-byte string literal not displayed */
	PerDepthConcurrency      []int `yaml:"per_depth_concurrency" cmd:"per crawl depth values for the number of concurrent downloads"`
	PerDepthRequestChanSizes []int `yaml:"per_depth_request_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue download requests"`
	PerDepthCrawledChanSizes []int `yaml:"per_depth_crawled_chan_sizes" cmd:"per crawl depth values for the size of the channel used to queue downloaded items"`
}

DownloadFactoryConfig is the configuration for a crawl.DownloaderFactory.

func (DownloadFactoryConfig) Depth0Chans ¶

func (df DownloadFactoryConfig) Depth0Chans() (chan download.Request, chan crawl.Crawled)

Depth0Chans creates the chanels required to start the crawl with their capacities set to the values specified in the DownloadFactoryConfig for a depth0 crawl, or the default values if none are specified.

func (DownloadFactoryConfig) NewFactory ¶

func (df DownloadFactoryConfig) NewFactory(ch chan<- download.Progress) crawl.DownloaderFactory

NewFactory returns a new instance of a crawl.DownloaderFactory which is parametised via its DownloadFactoryConfig receiver.

type ExponentialBackoff ¶

type ExponentialBackoff struct {
	InitialDelay time.Duration `yaml:"initial_delay" cmd:"the initial delay between retries for exponential backoff"`
	Steps        int           `yaml:"steps" cmd:"the number of steps of exponential backoff before giving up"`
	StatusCodes  []int         `yaml:"status_codes,flow" cmd:"the status codes that trigger a retry"`
}

ExponentialBackoffConfig is the configuration for an exponential backoff retry strategy for downloads.

type FSFactory ¶

type FSFactory func(context.Context) (file.FS, error)

FSFactory is a function that returns a file.FS used to crawl a given FS.

type Rate ¶

type Rate struct {
	Tick            time.Duration `yaml:"tick" cmd:"the duration of a tick"`
	RequestsPerTick int           `yaml:"requests_per_tick" cmd:"the number of requests per tick"`
	BytesPerTick    int           `yaml:"bytes_per_tick" cmd:"the number of bytes per tick"`
}

Rate specifies a rate in one of several forms, only one should be used.

type RateControl ¶

type RateControl struct {
	Rate               Rate               `yaml:"rate_control" cmd:"the rate control parameters"`
	ExponentialBackoff ExponentialBackoff `yaml:"exponential_backoff" cmd:"the exponential backoff parameters"`
}

RateControl is the configuration for rate based control of download requests.

func (RateControl) NewRateController ¶

func (c RateControl) NewRateController() (*ratecontrol.Controller, error)

NewRateController creates a new rate controller based on the values contained in RateControl.

type Resources ¶

type Resources struct {
	// Extractors are used to extract outlinks from crawled documents
	// based on their content type.
	Extractors map[content.Type]outlinks.Extractor
	// CrawlStoreFactories are used to create file.FS instances for
	// the files being crawled based on their scheme.
	CrawlStoreFactories map[string]FSFactory
	// ContentStoreFactory is a function that returns a content.FS used to store
	// the downloaded content.
	NewContentFS func(context.Context, CrawlCacheConfig) (content.FS, error)
}

Resources contains the resources required by the crawler.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL