crawl

package module

v0.0.0-...-d85efee Latest Latest Go to latest Published: Jan 25, 2023 License: MIT Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

git.jordan.im//crawl

Links

Open Source Insights

README ¶

A very simple crawler

This is a fork of crawl with changes which make crawl more amenable to serve as a drop-in replacement for wpull/grab-site. Notable changes include:

dramatically reduce memory usage; (temporarily) write responses to the filesystem rather than pass data around in memory buffers
--bind, support making outbound requests from a particular interface
--resume, directory containing the crawl state to continue from
infinite recursion depth by default
set User-Agent fingerprint to Firefox on Windows to look more like a browser
store crawl contents in a dated directory
update ignore regex set per updates to ArchiveBot
max default WARC size 100 MB -> 5 GB
record assembled seed URLs to seed_urls file

This tool can crawl a bunch of URLs for HTML content, and save the results in a nice WARC file. It has little control over its traffic, save for a limit on concurrent outbound requests. An external tool like trickle can be used to limit bandwidth.

Its main purpose is to quickly and efficiently save websites for archival purposes.

The crawl tool saves its state in a database, so it can be safely interrupted and restarted without issues.

Installation

Assuming you have a proper Go environment setup, you can install this package by running:

$ go install git.jordan.im/crawl/cmd/crawl@latest

This should install the crawl binary in your $GOPATH/bin directory.

Usage

Just run crawl by passing the URLs of the websites you want to crawl as arguments on the command line:

$ crawl http://example.com/

By default, the tool will store the output WARC file and its own temporary crawl database in a newly-created directory.

The crawling scope is controlled with a set of overlapping checks:

URL scheme must be one of http or https
URL must have one of the seeds as a prefix (an eventual www. prefix is implicitly ignored)
maximum crawling depth can be controlled with the --depth option
resources related to a page (CSS, JS, etc) will always be fetched, even if on external domains, unless the --exclude-related option is specified

If the program is interrupted, running it again with the same command line from the same directory will cause it to resume crawling from where it stopped when a previous crawl state directory is passed with --resume. At the end of a successful crawl, the temporary crawl database will be removed (unless you specify the --keep option, for debugging purposes).

It is possible to tell the crawler to exclude URLs matching specific regex patterns by using the --exclude or --exclude-from-file options. These option may be repeated multiple times. The crawler comes with its own builtin set of URI regular expressions meant to avoid calendars, admin panels of common CMS applications, and other well-known pitfalls. This list is sourced from the ArchiveBot project.

If you're running a larger crawl, the tool can be told to rotate the output WARC files when they reach a certain size (100MB by default, controlled by the --output-max-size flag.

Limitations

Like most crawlers, this one has a number of limitations:

it completely ignores robots.txt. You can make such policy decisions yourself by turning the robots.txt into a list of patterns to be used with --exclude-from-file.
it does not embed a Javascript engine, so Javascript-rendered elements will not be detected.
CSS parsing is limited (uses regular expressions), so some url() resources might not be detected.
it expects reasonably well-formed HTML, so it may fail to extract links from particularly broken pages.
support for <object> and <video> tags is limited.

Contact

Send bugs and patches to me@jordan.im.

Documentation ¶

Index ¶

Constants
Variables
func Must(err error)
func MustParseURLs(urls []string) []*url.URL
func NewHTTPClient() *http.Client
func NewHTTPClientWithOverrides(dnsMap map[string]string, localAddr *net.IPAddr) *http.Client
type Crawler
- func NewCrawler(path string, seeds []*url.URL, scope Scope, f Fetcher, h Handler) (*Crawler, error)
- func (c *Crawler) Close()
- func (c *Crawler) Enqueue(link Outlink, depth int) error
- func (c *Crawler) Run(concurrency int)
- func (c *Crawler) Stop()
type Fetcher
type FetcherFunc
- func (f FetcherFunc) Fetch(u string) (*http.Response, error)
type Handler
- func FilterErrors(wrap Handler) Handler
- func FollowRedirects(wrap Handler) Handler
- func HandleRetries(wrap Handler) Handler
type HandlerFunc
- func (f HandlerFunc) Handle(p Publisher, u string, tag, depth int, resp *http.Response, body *os.File, ...) error
type Outlink
type Publisher
type Scope
- func AND(elems ...Scope) Scope
- func NewDepthScope(maxDepth int) Scope
- func NewIncludeRelatedScope() Scope
- func NewRegexpIgnoreScope(ignores []*regexp.Regexp) Scope
- func NewSchemeScope(schemes []string) Scope
- func NewSeedScope(seeds []*url.URL) Scope
- func NewURLPrefixScope(prefixes URLPrefixMap) Scope
- func OR(elems ...Scope) Scope
type URLPrefixMap
- func (m URLPrefixMap) Add(uri *url.URL)
- func (m URLPrefixMap) Contains(uri *url.URL) bool

Constants ¶

View Source

const (
	// TagPrimary is a primary reference (another web page).
	TagPrimary = iota

	// TagRelated is a secondary resource, related to a page.
	TagRelated
)

Variables ¶

View Source

var DefaultClient *http.Client

DefaultClient points at a shared http.Client suitable for crawling: does not follow redirects, accepts invalid TLS certificates, sets a reasonable timeout for requests.

View Source

var ErrRetryRequest = errors.New("retry_request")

ErrRetryRequest is returned by a Handler when the request should be retried after some time.

Functions ¶

func Must ¶

func Must(err error)

Must will abort the program with a message when we encounter an error that we can't recover from.

func MustParseURLs ¶

func MustParseURLs(urls []string) []*url.URL

MustParseURLs parses a list of URLs and aborts on failure.

func NewHTTPClient ¶

func NewHTTPClient() *http.Client

NewHTTPClient returns an http.Client suitable for crawling.

func NewHTTPClientWithOverrides ¶

func NewHTTPClientWithOverrides(dnsMap map[string]string, localAddr *net.IPAddr) *http.Client

NewHTTPClientWithOverrides returns an http.Client suitable for crawling, with additional (optional) DNS and LocalAddr overrides.

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

The Crawler object contains the crawler state.

func NewCrawler ¶

func NewCrawler(path string, seeds []*url.URL, scope Scope, f Fetcher, h Handler) (*Crawler, error)

NewCrawler creates a new Crawler object with the specified behavior.

func (*Crawler) Close ¶

func (c *Crawler) Close()

Close the database and release resources associated with the crawler state.

func (*Crawler) Enqueue ¶

func (c *Crawler) Enqueue(link Outlink, depth int) error

Enqueue a (possibly new) URL for processing.

func (*Crawler) Run ¶

func (c *Crawler) Run(concurrency int)

Run the crawl with the specified number of workers. This function does not exit until all work is done (no URLs left in the queue).

func (*Crawler) Stop ¶

func (c *Crawler) Stop()

Stop a running crawl. This will cause a running Run function to return.

type Fetcher ¶

type Fetcher interface {
	// Fetch retrieves a URL and returns the response.
	Fetch(string) (*http.Response, error)
}

A Fetcher retrieves contents from remote URLs.

type FetcherFunc ¶

type FetcherFunc func(string) (*http.Response, error)

FetcherFunc wraps a simple function into the Fetcher interface.

func (FetcherFunc) Fetch ¶

func (f FetcherFunc) Fetch(u string) (*http.Response, error)

Fetch retrieves a URL and returns the response.

type Handler ¶

type Handler interface {
	// Handle the response from a URL.
	Handle(Publisher, string, int, int, *http.Response, *os.File, error) error
}

A Handler processes crawled contents. Any errors returned by public implementations of this interface are considered fatal and will cause the crawl to abort. The URL will be removed from the queue unless the handler returns the special error ErrRetryRequest.

func FilterErrors ¶

func FilterErrors(wrap Handler) Handler

FilterErrors returns a Handler that forwards only requests with a "successful" HTTP status code (anything < 400). When using this wrapper, subsequent Handle calls will always have err set to nil.

func FollowRedirects ¶

func FollowRedirects(wrap Handler) Handler

FollowRedirects returns a Handler that follows HTTP redirects and adds them to the queue for crawling. It will call the wrapped handler on all requests regardless.

func HandleRetries ¶

func HandleRetries(wrap Handler) Handler

HandleRetries returns a Handler that will retry requests on temporary errors (all transport-level errors are considered temporary, as well as any HTTP status code >= 500).

type HandlerFunc ¶

type HandlerFunc func(Publisher, string, int, int, *http.Response, *os.File, error) error

HandlerFunc wraps a function into the Handler interface.

func (HandlerFunc) Handle ¶

func (f HandlerFunc) Handle(p Publisher, u string, tag, depth int, resp *http.Response, body *os.File, err error) error

Handle the response from a URL.

type Outlink ¶

type Outlink struct {
	URL *url.URL
	Tag int
}

Outlink is a tagged outbound link.

type Publisher ¶

type Publisher interface {
	Enqueue(Outlink, int) error
}

Publisher is an interface to something with an Enqueue() method to add new potential URLs to crawl.

type Scope ¶

type Scope interface {
	// Check a URL to see if it's in scope for crawling.
	Check(Outlink, int) bool
}

Scope defines the crawling scope.

func AND ¶

func AND(elems ...Scope) Scope

AND performs a boolean AND.

func NewDepthScope ¶

func NewDepthScope(maxDepth int) Scope

NewDepthScope returns a Scope that will limit crawls to a maximum link depth with respect to the crawl seeds.

func NewIncludeRelatedScope ¶

func NewIncludeRelatedScope() Scope

NewIncludeRelatedScope always includes resources with TagRelated.

func NewRegexpIgnoreScope ¶

func NewRegexpIgnoreScope(ignores []*regexp.Regexp) Scope

NewRegexpIgnoreScope returns a Scope that filters out URLs according to a list of regular expressions.

func NewSchemeScope ¶

func NewSchemeScope(schemes []string) Scope

NewSchemeScope limits the crawl to the specified URL schemes.

func NewSeedScope ¶

func NewSeedScope(seeds []*url.URL) Scope

NewSeedScope returns a Scope that will only allow crawling the seed prefixes.

func NewURLPrefixScope ¶

func NewURLPrefixScope(prefixes URLPrefixMap) Scope

NewURLPrefixScope returns a Scope that limits the crawl to a set of allowed URL prefixes.

func OR ¶

func OR(elems ...Scope) Scope

OR performs a boolean OR.

type URLPrefixMap ¶

type URLPrefixMap map[string]struct{}

A URLPrefixMap makes it easy to check for URL prefixes (even for very large lists). The URL scheme is ignored, along with an eventual "www." prefix.

func (URLPrefixMap) Add ¶

func (m URLPrefixMap) Add(uri *url.URL)

Add an URL to the prefix map.

func (URLPrefixMap) Contains ¶

func (m URLPrefixMap) Contains(uri *url.URL) bool

Contains returns true if the given URL matches the prefix map.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
analysis
cmd
crawl
links
warc

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL