Documentation ¶
Overview ¶
gocrawl is a polite, slim and concurrent web crawler written in Go.
Index ¶
- Constants
- Variables
- type CrawlError
- type CrawlErrorKind
- type Crawler
- type CrawlerCommand
- type DefaultExtender
- func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
- func (this *DefaultExtender) Disallowed(u *url.URL)
- func (this *DefaultExtender) End(reason EndReason)
- func (this *DefaultExtender) Enqueued(u *url.URL, from *url.URL)
- func (this *DefaultExtender) Error(err *CrawlError)
- func (this *DefaultExtender) Fetch(u *url.URL, userAgent string, headRequest bool) (*http.Response, error)
- func (this *DefaultExtender) FetchedRobots(res *http.Response)
- func (this *DefaultExtender) Filter(u *url.URL, from *url.URL, isVisited bool, origin EnqueueOrigin) (enqueue bool, priority int, headRequest HeadRequestMode)
- func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
- func (this *DefaultExtender) RequestGet(headRes *http.Response) bool
- func (this *DefaultExtender) RequestRobots(u *url.URL, robotAgent string) (request bool, data []byte)
- func (this *DefaultExtender) Start(seeds []string) []string
- func (this *DefaultExtender) Visit(res *http.Response, doc *goquery.Document) (harvested []*url.URL, findLinks bool)
- func (this *DefaultExtender) Visited(u *url.URL, harvested []*url.URL)
- type DelayInfo
- type EndReason
- type EnqueueOrigin
- type EnqueueRedirectError
- type Extender
- type FetchInfo
- type HeadRequestMode
- type LogFlags
- type Options
Constants ¶
const ( DefaultUserAgent string = `Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2` DefaultRobotUserAgent string = `Googlebot (gocrawl v0.3)` DefaultEnqueueChanBuffer int = 100 DefaultHostBufferFactor int = 10 DefaultCrawlDelay time.Duration = 5 * time.Second DefaultIdleTTL time.Duration = 10 * time.Second DefaultNormalizationFlags purell.NormalizationFlags = purell.FlagsAllGreedy )
Default options
Variables ¶
var HttpClient = &http.Client{CheckRedirect: func(req *http.Request, via []*http.Request) error { if isRobotsTxtUrl(req.URL) { if len(via) >= 10 { return errors.New("stopped after 10 redirects") } return nil } return &EnqueueRedirectError{"redirection not followed"} }}
The default HTTP client used by DefaultExtender's fetch requests (this is thread-safe). The client's fields can be customized (i.e. for a different redirection strategy, a different Transport object, ...). It should be done prior to starting the crawler.
Functions ¶
This section is empty.
Types ¶
type CrawlError ¶
type CrawlError struct { Err error Kind CrawlErrorKind URL *url.URL // contains filtered or unexported fields }
Crawl error information.
func (CrawlError) Error ¶
func (this CrawlError) Error() string
Implementation of the error interface.
type CrawlErrorKind ¶
type CrawlErrorKind uint8
Flag indicating the source of the crawl error.
const ( CekFetch CrawlErrorKind = iota CekParseRobots CekHttpStatusCode CekReadBody CekParseBody CekParseSeed CekParseNormalizedSeed CekProcessLinks CekParseRedirectUrl )
type Crawler ¶
type Crawler struct { Options *Options // contains filtered or unexported fields }
The crawler itself, the master of the whole process
func NewCrawler ¶
Crawler constructor with the specified extender object.
func NewCrawlerWithOptions ¶
Crawler constructor with a pre-initialized Options object.
type CrawlerCommand ¶
type CrawlerCommand struct { URL *url.URL Origin EnqueueOrigin }
Communication from extender to crawler about an URL to enqueue
type DefaultExtender ¶
type DefaultExtender struct {
EnqueueChan chan<- *CrawlerCommand
}
Default working implementation of an extender.
func (*DefaultExtender) ComputeDelay ¶
func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
ComputeDelay returns the delay specified in the Crawler's Options, unless a crawl-delay is specified in the robots.txt file, which has precedence.
func (*DefaultExtender) Disallowed ¶
func (this *DefaultExtender) Disallowed(u *url.URL)
Disallowed is a no-op.
func (*DefaultExtender) Enqueued ¶
func (this *DefaultExtender) Enqueued(u *url.URL, from *url.URL)
Enqueued is a no-op.
func (*DefaultExtender) Error ¶
func (this *DefaultExtender) Error(err *CrawlError)
Error is a no-op (logging is done automatically, regardless of the implementation of the Error() hook).
func (*DefaultExtender) Fetch ¶
func (this *DefaultExtender) Fetch(u *url.URL, userAgent string, headRequest bool) (*http.Response, error)
Fetch requests the specified URL using the given user agent string. It uses a custom http Client instance that doesn't follow redirections. Instead, the redirected-to URL is enqueued so that it goes through the same Filter() and Fetch() process as any other URL.
Two options were considered for the default Fetch() implementation : 1- Not following any redirections, and enqueuing the redirect-to URL,
failing the current call with the 3xx status code.
2- Following all redirections, enqueuing only the last one (where redirection
stops). Returning the response of the next-to-last request.
Ultimately, 1) was implemented, as it is the most generic solution that makes sense as default for the library. It involves no "magic" and gives full control as to what can happen, with the disadvantage of having the Filter() being aware of all possible intermediary URLs before reaching the final destination of a redirection (i.e. if A redirects to B that redirects to C, Filter has to allow A, B, and C to be Fetched, while solution 2 would only have required Filter to allow A and C).
Solution 2) also has the disadvantage of fetching twice the final URL (once while processing the original URL, so that it knows that there is no more redirection HTTP code, and another time when the actual destination URL is fetched to be visited).
func (*DefaultExtender) FetchedRobots ¶
func (this *DefaultExtender) FetchedRobots(res *http.Response)
FetchedRobots is a no-op.
func (*DefaultExtender) Filter ¶
func (this *DefaultExtender) Filter(u *url.URL, from *url.URL, isVisited bool, origin EnqueueOrigin) (enqueue bool, priority int, headRequest HeadRequestMode)
Enqueue the URL if it hasn't been visited yet.
func (*DefaultExtender) Log ¶
func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
Log prints to the standard error by default, based on the requested log verbosity.
func (*DefaultExtender) RequestGet ¶
func (this *DefaultExtender) RequestGet(headRes *http.Response) bool
Ask the worker to actually request the URL's body (issue a GET), unless the status code is not 2xx.
func (*DefaultExtender) RequestRobots ¶
func (this *DefaultExtender) RequestRobots(u *url.URL, robotAgent string) (request bool, data []byte)
Ask the worker to actually request (fetch) the Robots.txt document.
func (*DefaultExtender) Start ¶
func (this *DefaultExtender) Start(seeds []string) []string
Return the same seeds as those received (those that were passed to Run() initially).
type DelayInfo ¶
Delay information: the Options delay, the Robots.txt delay, and the last delay used.
type EnqueueOrigin ¶
type EnqueueOrigin int
EnqueueOrigin indicates to the crawler and the Filter extender function the origin of this URL.
const ( EoSeed EnqueueOrigin = iota // Seed URLs have this source EoHarvest // URLs harvested from a visit to a page have this source EoRedirect // URLs enqueued from a fetch redirection have this source by default EoError // URLs enqueued after an error EoCustomStart // Custom EnqueueOrigins should start at this value instead of iota )
type EnqueueRedirectError ¶
type EnqueueRedirectError struct {
// contains filtered or unexported fields
}
The error type returned when a redirection is requested, so that the worker knows that this is not an actual Fetch error, but a request to enqueue the redirect-to URL.
func (*EnqueueRedirectError) Error ¶
func (this *EnqueueRedirectError) Error() string
Implement the error interface
type Extender ¶
type Extender interface { Start([]string) []string End(EndReason) Error(*CrawlError) Log(LogFlags, LogFlags, string) ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration Fetch(*url.URL, string, bool) (*http.Response, error) RequestGet(*http.Response) bool RequestRobots(*url.URL, string) (bool, []byte) FetchedRobots(*http.Response) Filter(*url.URL, *url.URL, bool, EnqueueOrigin) (bool, int, HeadRequestMode) Enqueued(*url.URL, *url.URL) Visit(*http.Response, *goquery.Document) ([]*url.URL, bool) Visited(*url.URL, []*url.URL) Disallowed(*url.URL) }
Extension methods required to provide an extender instance.
type FetchInfo ¶
Fetch information: the duration of the fetch, the returned status code, whether or not it was a HEAD request, and whether or not it was a robots.txt request.
type HeadRequestMode ¶
type HeadRequestMode uint8
Flag indicating the head request override mode
const ( HrmDefault HeadRequestMode = iota HrmRequest HrmIgnore )
type LogFlags ¶
type LogFlags uint
const ( LogError LogFlags = 1 << iota LogInfo LogEnqueued LogIgnored LogTrace LogNone LogFlags = 0 LogAll LogFlags = LogError | LogInfo | LogEnqueued | LogIgnored | LogTrace )
Log levels for the library's logger
type Options ¶
type Options struct { UserAgent string RobotUserAgent string MaxVisits int EnqueueChanBuffer int HostBufferFactor int CrawlDelay time.Duration // Applied per host WorkerIdleTTL time.Duration SameHostOnly bool HeadBeforeGet bool URLNormalizationFlags purell.NormalizationFlags LogFlags LogFlags Extender Extender }
The Options available to control and customize the crawling process.