Documentation ¶
Overview ¶
Package gocrawl is a polite, slim and concurrent web crawler written in Go.
Index ¶
- Constants
- Variables
- type CrawlError
- type CrawlErrorKind
- type Crawler
- type DefaultExtender
- func (de *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
- func (de *DefaultExtender) Disallowed(ctx *URLContext)
- func (de *DefaultExtender) End(err error)
- func (de *DefaultExtender) Enqueued(ctx *URLContext)
- func (de *DefaultExtender) Error(err *CrawlError)
- func (de *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)
- func (de *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)
- func (de *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool
- func (de *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
- func (de *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool
- func (de *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)
- func (de *DefaultExtender) Start(seeds interface{}) interface{}
- func (de *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)
- func (de *DefaultExtender) Visited(ctx *URLContext, harvested interface{})
- type DelayInfo
- type Extender
- type FetchInfo
- type LogFlags
- type Options
- type S
- type U
- type URLContext
Constants ¶
const ( DefaultUserAgent string = `Mozilla/5.0 (Windows NT 6.1; rv:15.0) gocrawl/0.4 Gecko/20120716 Firefox/15.0a2` DefaultRobotUserAgent string = `Googlebot (gocrawl v0.4)` DefaultEnqueueChanBuffer int = 100 DefaultHostBufferFactor int = 10 DefaultCrawlDelay time.Duration = 5 * time.Second DefaultIdleTTL time.Duration = 10 * time.Second DefaultNormalizationFlags purell.NormalizationFlags = purell.FlagsAllGreedy )
Default options
Variables ¶
var ( // ErrEnqueueRedirect is returned when a redirection is requested, so that the // worker knows that this is not an actual Fetch error, but a request to // enqueue the redirect-to URL. ErrEnqueueRedirect = errors.New("redirection not followed") // ErrMaxVisits is returned when the maximum number of visits, as specified by the // Options field MaxVisits, is reached. ErrMaxVisits = errors.New("the maximum number of visits is reached") // ErrInterrupted is returned when the crawler is manually stopped // (via a call to Stop). ErrInterrupted = errors.New("interrupted") )
var HttpClient = &http.Client{CheckRedirect: func(req *http.Request, via []*http.Request) error { if isRobotsURL(req.URL) { if len(via) >= 10 { return errors.New("stopped after 10 redirects") } if len(via) > 0 { req.Header.Set("User-Agent", via[0].Header.Get("User-Agent")) } return nil } return ErrEnqueueRedirect }}
HttpClient is the default HTTP client used by DefaultExtender's fetch requests (this is thread-safe). The client's fields can be customized (i.e. for a different redirection strategy, a different Transport object, ...). It should be done prior to starting the crawler.
Functions ¶
This section is empty.
Types ¶
type CrawlError ¶
type CrawlError struct { // The URL Context where the error occurred. Ctx *URLContext // The underlying error. Err error // The error kind. Kind CrawlErrorKind // contains filtered or unexported fields }
CrawlError contains information about the crawling error.
func (CrawlError) Error ¶
func (ce CrawlError) Error() string
Error implements of the error interface for CrawlError.
type CrawlErrorKind ¶
type CrawlErrorKind uint8
CrawlErrorKind indicated the kind of crawling error.
const ( CekFetch CrawlErrorKind = iota CekParseRobots CekHttpStatusCode CekReadBody CekParseBody CekParseURL CekProcessLinks CekParseRedirectURL )
The various kinds of crawling errors.
func (CrawlErrorKind) String ¶ added in v0.4.0
func (cek CrawlErrorKind) String() string
String returns the string representation of the error kind.
type Crawler ¶
type Crawler struct { // Options configures the Crawler, refer to the Options type for documentation. Options *Options // contains filtered or unexported fields }
Crawler is the web crawler that processes URLs and manages the workers.
func NewCrawler ¶
NewCrawler returns a Crawler initialized with the default Options' values and the provided Extender. It is highly recommended to set at least the Options.RobotUserAgent to the custom name of your crawler before using the returned Crawler. Refer to the Options type documentation for details.
func NewCrawlerWithOptions ¶
NewCrawlerWithOptions returns a Crawler initialized with the provided Options.
func (*Crawler) Run ¶
Run starts the crawling process, based on the given seeds and the current Options settings. Execution stops either when MaxVisits is reached (if specified) or when no more URLs need visiting. If an error occurs, it is returned (if MaxVisits is reached, the error ErrMaxVisits is returned).
type DefaultExtender ¶
type DefaultExtender struct {
EnqueueChan chan<- interface{}
}
DefaultExtender is a default working implementation of an extender. It is possible to nest such a value in a custom struct so that only the Extender methods that require custom behaviour have to be implemented.
func (*DefaultExtender) ComputeDelay ¶
func (de *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
ComputeDelay returns the delay specified in the Crawler's Options, unless a crawl-delay is specified in the robots.txt file, which has precedence.
func (*DefaultExtender) Disallowed ¶
func (de *DefaultExtender) Disallowed(ctx *URLContext)
Disallowed is a no-op.
func (*DefaultExtender) Enqueued ¶
func (de *DefaultExtender) Enqueued(ctx *URLContext)
Enqueued is a no-op.
func (*DefaultExtender) Error ¶
func (de *DefaultExtender) Error(err *CrawlError)
Error is a no-op (logging is done automatically, regardless of the implementation of the Error hook).
func (*DefaultExtender) Fetch ¶
func (de *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)
Fetch requests the specified URL using the given user agent string. It uses a custom http Client instance that doesn't follow redirections. Instead, the redirected-to URL is enqueued so that it goes through the same Filter and Fetch process as any other URL.
Two options were considered for the default Fetch implementation :
1- Not following any redirections, and enqueuing the redirect-to URL, failing the current call with the 3xx status code. 2- Following all redirections, enqueuing only the last one (where redirection stops). Returning the response of the next-to-last request.
Ultimately, 1) was implemented, as it is the most generic solution that makes sense as default for the library. It involves no "magic" and gives full control as to what can happen, with the disadvantage of having the Filter being aware of all possible intermediary URLs before reaching the final destination of a redirection (i.e. if A redirects to B that redirects to C, Filter has to allow A, B, and C to be Fetched, while solution 2 would only have required Filter to allow A and C).
Solution 2) also has the disadvantage of fetching twice the final URL (once while processing the original URL, so that it knows that there is no more redirection HTTP code, and another time when the actual destination URL is fetched to be visited).
func (*DefaultExtender) FetchedRobots ¶
func (de *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)
FetchedRobots is a no-op.
func (*DefaultExtender) Filter ¶
func (de *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool
Filter enqueues the URL if it hasn't been visited yet.
func (*DefaultExtender) Log ¶
func (de *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
Log prints to the standard error by default, based on the requested log verbosity.
func (*DefaultExtender) RequestGet ¶
func (de *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool
RequestGet asks the worker to actually request the URL's body (issue a GET), unless the status code is not 2xx.
func (*DefaultExtender) RequestRobots ¶
func (de *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)
RequestRobots asks the worker to actually request (fetch) the robots.txt.
func (*DefaultExtender) Start ¶
func (de *DefaultExtender) Start(seeds interface{}) interface{}
Start returns the same seeds as those received (those that were passed to Run initially).
func (*DefaultExtender) Visit ¶
func (de *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)
Visit asks the worker to harvest the links in this page.
func (*DefaultExtender) Visited ¶
func (de *DefaultExtender) Visited(ctx *URLContext, harvested interface{})
Visited is a no-op.
type DelayInfo ¶
DelayInfo contains the delay configuration: the Options delay, the Robots.txt delay, and the last delay used.
type Extender ¶
type Extender interface { // Start, End, Error and Log are not related to a specific URL, so they don't // receive a URLContext struct. Start(interface{}) interface{} End(error) Error(*CrawlError) Log(LogFlags, LogFlags, string) // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo // is related to a URLContext (holds a ctx field). ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration // All other extender methods are executed in the context of an URL, and thus // receive an URLContext struct as first argument. Fetch(*URLContext, string, bool) (*http.Response, error) RequestGet(*URLContext, *http.Response) bool RequestRobots(*URLContext, string) ([]byte, bool) FetchedRobots(*URLContext, *http.Response) Filter(*URLContext, bool) bool Enqueued(*URLContext) Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool) Visited(*URLContext, interface{}) Disallowed(*URLContext) }
Extender defines the extension methods required by the crawler.
type FetchInfo ¶
type FetchInfo struct { Ctx *URLContext Duration time.Duration StatusCode int IsHeadRequest bool }
FetchInfo contains the fetch information: the duration of the fetch, the returned status code, whether or not it was a HEAD request, and whether or not it was a robots.txt request.
type LogFlags ¶
type LogFlags uint
LogFlags is a set of flags that control the logging of the Crawler.
const ( LogError LogFlags = 1 << iota LogInfo LogEnqueued LogIgnored LogTrace LogNone LogFlags = 0 LogAll LogFlags = LogError | LogInfo | LogEnqueued | LogIgnored | LogTrace )
Log levels for the library's logger
type Options ¶
type Options struct { // UserAgent is the user-agent value used to make requests to the host. UserAgent string // RobotUserAgent is the user-agent value of the robot, used to find // a matching policy in the robots.txt file of a host. It is not used // to make the robots.txt request, only to match a policy. // It should always be set to the name of your crawler application so // that site owners can configure the robots.txt accordingly. RobotUserAgent string // MaxVisits is the maximum number of pages visited before // automatically stopping the crawler. MaxVisits int // EnqueueChanBuffer is the size of the buffer for the enqueue channel. EnqueueChanBuffer int // HostBufferFactor controls the size of the map and channel used // internally to manage hosts. If there are 5 different hosts in // the initial seeds, and HostBufferFactor is 10, it will create // buffered channel of 5 * 10 (50) (and a map of hosts with that // initial capacity, though the map will grow as needed). HostBufferFactor int // CrawlDelay is the default time to wait between requests to a given // host. If a specific delay is specified in the relevant robots.txt, // then this delay is used instead. Crawl delay can be customized // further by implementing the ComputeDelay extender function. CrawlDelay time.Duration // WorkerIdleTTL is the idle time-to-live allowed for a worker // before it is cleared (its goroutine terminated). The crawl // delay is not part of idle time, this is specifically the time // when the worker is available, but there are no URLs to process. WorkerIdleTTL time.Duration // SameHostOnly limits the URLs to enqueue only to those targeting // the same hosts as the ones from the seed URLs. SameHostOnly bool // HeadBeforeGet asks the crawler to make a HEAD request before // making an eventual GET request. If set to true, the extender // method RequestGet is called after the HEAD to control if the // GET should be issued. HeadBeforeGet bool // URLNormalizationFlags controls the normalization of URLs. // See the purell package for details. URLNormalizationFlags purell.NormalizationFlags // LogFlags controls the verbosity of the logger. LogFlags LogFlags // Extender is the implementation of hooks to use by the crawler. Extender Extender }
Options contains the configuration for a Crawler to customize the crawling process.
func NewOptions ¶
NewOptions creates a new set of Options with default values using the provided Extender. The RobotUserAgent option should be set to the name of your crawler, it is used to find the matching entry in the robots.txt file.
type S ¶ added in v0.4.0
type S map[string]interface{}
S is a convenience type definition, it is a map[string]interface{} that can be used to enqueue URLs (the string) with state information.
type U ¶ added in v0.4.0
U is a convenience type definition, it is a map[*url.URL]interface{} that can be used to enqueue URLs with state information.
type URLContext ¶ added in v0.4.0
type URLContext struct { HeadBeforeGet bool State interface{} // contains filtered or unexported fields }
URLContext contains all information related to an URL to process.
func (*URLContext) IsRobotsURL ¶ added in v0.4.0
func (uc *URLContext) IsRobotsURL() bool
IsRobotsURL indicates if the URL is a robots.txt URL.
func (*URLContext) NormalizedSourceURL ¶ added in v0.4.0
func (uc *URLContext) NormalizedSourceURL() *url.URL
NormalizedSourceURL returns the normalized form of the source URL, if any (using Options.URLNormalizationFlags).
func (*URLContext) NormalizedURL ¶ added in v0.4.0
func (uc *URLContext) NormalizedURL() *url.URL
NormalizedURL returns the normalized URL (using Options.URLNormalizationFlags) of the URL.
func (*URLContext) SourceURL ¶ added in v0.4.0
func (uc *URLContext) SourceURL() *url.URL
SourceURL returns the source URL, if any (the URL that enqueued this URL).