Documentation
¶
Index ¶
Constants ¶
View Source
const ( DefaultUserAgent = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0" DefaultRobotsAgent = "Googlebot (crawlbot v1)" DefaultTimeToLive = 3 * DefaultDelay DefaultDelay = 3 * time.Second )
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Worker ¶
type Worker struct { // GetFunc issues a GET request to the specified URL and returns the // response body and an error if any. GetFunc func(*url.URL) (io.ReadCloser, error) // IsAcceptedFunc can be used to control the crawler. IsAcceptedFunc func(*url.URL) bool // ProcessFunc can be used to scrape data. ProcessFunc func(*url.URL, *html.Node, []byte) // Host defines the hostname to crawl. Worker is a single-host crawler. Host *url.URL // UserAgent defines the user-agent string to use for URL fetching. UserAgent string Accept []*regexp.Regexp Reject []*regexp.Regexp // Delay to use between requests to a same host if there is not // robots.txt crawl delay. The delay starts as soon as the response // is received from the host. Delay time.Duration // MaxEnqueue returns the maximum number of pages visited before // stopping the crawl. Note that the Crawler will send its stop signal // once this number of visits is reached, but workers may be in the // process of visiting other pages, so when the crawling stops, the // number of pages visited will be at least MaxEnqueues, possibly more. MaxEnqueue int64 Robots Robots Concurrent int }
Worker represents a crawler worker implementation.
Click to show internal directories.
Click to hide internal directories.