Documentation ¶
Overview ¶
gocrawl is a polite, slim and concurrent web crawler written in Go.
Index ¶
- Constants
- Variables
- type CrawlError
- type CrawlErrorKind
- type Crawler
- type DefaultExtender
- func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
- func (this *DefaultExtender) Disallowed(ctx *URLContext)
- func (this *DefaultExtender) End(err error)
- func (this *DefaultExtender) Enqueued(ctx *URLContext)
- func (this *DefaultExtender) Error(err *CrawlError)
- func (this *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)
- func (this *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)
- func (this *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool
- func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
- func (this *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool
- func (this *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)
- func (this *DefaultExtender) Start(seeds interface{}) interface{}
- func (this *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)
- func (this *DefaultExtender) Visited(ctx *URLContext, harvested interface{})
- type DelayInfo
- type Extender
- type FetchInfo
- type LogFlags
- type Options
- type S
- type U
- type URLContext
Constants ¶
const ( DefaultUserAgent string = `Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2` DefaultRobotUserAgent string = `Googlebot (gocrawl v0.4)` DefaultEnqueueChanBuffer int = 100 DefaultHostBufferFactor int = 10 DefaultCrawlDelay time.Duration = 5 * time.Second DefaultIdleTTL time.Duration = 10 * time.Second DefaultNormalizationFlags purell.NormalizationFlags = purell.FlagsAllGreedy )
Default options
Variables ¶
var ( // The error returned when a redirection is requested, so that the // worker knows that this is not an actual Fetch error, but a request to // enqueue the redirect-to URL. ErrEnqueueRedirect = errors.New("redirection not followed") // The error returned when the maximum number of visits, as specified by the // Options field MaxVisits, is reached. ErrMaxVisits = errors.New("the maximum number of visits is reached") ErrInterrupted = errors.New("interrupted") )
var HttpClient = &http.Client{CheckRedirect: func(req *http.Request, via []*http.Request) error { if isRobotsURL(req.URL) { if len(via) >= 10 { return errors.New("stopped after 10 redirects") } return nil } return ErrEnqueueRedirect }}
The default HTTP client used by DefaultExtender's fetch requests (this is thread-safe). The client's fields can be customized (i.e. for a different redirection strategy, a different Transport object, ...). It should be done prior to starting the crawler.
Functions ¶
This section is empty.
Types ¶
type CrawlError ¶
type CrawlError struct { Ctx *URLContext Err error Kind CrawlErrorKind // contains filtered or unexported fields }
Crawl error information.
func (CrawlError) Error ¶
func (this CrawlError) Error() string
Implementation of the error interface.
type CrawlErrorKind ¶
type CrawlErrorKind uint8
Enum indicating the kind of the crawling error.
const ( CekFetch CrawlErrorKind = iota CekParseRobots CekHttpStatusCode CekReadBody CekParseBody CekParseURL CekProcessLinks CekParseRedirectURL )
func (CrawlErrorKind) String ¶
func (this CrawlErrorKind) String() string
type Crawler ¶
type Crawler struct { Options *Options // contains filtered or unexported fields }
The crawler itself, the master of the whole process
func NewCrawler ¶
Crawler constructor with the specified extender object.
func NewCrawlerWithOptions ¶
Crawler constructor with a pre-initialized Options object.
func (*Crawler) Run ¶
Run starts the crawling process, based on the given seeds and the current Options settings. Execution stops either when MaxVisits is reached (if specified) or when no more URLs need visiting. If an error occurs, it is returned (if MaxVisits is reached, the error ErrMaxVisits is returned).
type DefaultExtender ¶
type DefaultExtender struct {
EnqueueChan chan<- interface{}
}
Default working implementation of an extender.
func (*DefaultExtender) ComputeDelay ¶
func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
ComputeDelay returns the delay specified in the Crawler's Options, unless a crawl-delay is specified in the robots.txt file, which has precedence.
func (*DefaultExtender) Disallowed ¶
func (this *DefaultExtender) Disallowed(ctx *URLContext)
Disallowed is a no-op.
func (*DefaultExtender) Enqueued ¶
func (this *DefaultExtender) Enqueued(ctx *URLContext)
Enqueued is a no-op.
func (*DefaultExtender) Error ¶
func (this *DefaultExtender) Error(err *CrawlError)
Error is a no-op (logging is done automatically, regardless of the implementation of the Error() hook).
func (*DefaultExtender) Fetch ¶
func (this *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)
Fetch requests the specified URL using the given user agent string. It uses a custom http Client instance that doesn't follow redirections. Instead, the redirected-to URL is enqueued so that it goes through the same Filter() and Fetch() process as any other URL.
Two options were considered for the default Fetch() implementation : 1- Not following any redirections, and enqueuing the redirect-to URL,
failing the current call with the 3xx status code.
2- Following all redirections, enqueuing only the last one (where redirection
stops). Returning the response of the next-to-last request.
Ultimately, 1) was implemented, as it is the most generic solution that makes sense as default for the library. It involves no "magic" and gives full control as to what can happen, with the disadvantage of having the Filter() being aware of all possible intermediary URLs before reaching the final destination of a redirection (i.e. if A redirects to B that redirects to C, Filter has to allow A, B, and C to be Fetched, while solution 2 would only have required Filter to allow A and C).
Solution 2) also has the disadvantage of fetching twice the final URL (once while processing the original URL, so that it knows that there is no more redirection HTTP code, and another time when the actual destination URL is fetched to be visited).
func (*DefaultExtender) FetchedRobots ¶
func (this *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)
FetchedRobots is a no-op.
func (*DefaultExtender) Filter ¶
func (this *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool
Enqueue the URL if it hasn't been visited yet.
func (*DefaultExtender) Log ¶
func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
Log prints to the standard error by default, based on the requested log verbosity.
func (*DefaultExtender) RequestGet ¶
func (this *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool
Ask the worker to actually request the URL's body (issue a GET), unless the status code is not 2xx.
func (*DefaultExtender) RequestRobots ¶
func (this *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)
Ask the worker to actually request (fetch) the Robots.txt document.
func (*DefaultExtender) Start ¶
func (this *DefaultExtender) Start(seeds interface{}) interface{}
Return the same seeds as those received (those that were passed to Run() initially).
func (*DefaultExtender) Visit ¶
func (this *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)
Ask the worker to harvest the links in this page.
func (*DefaultExtender) Visited ¶
func (this *DefaultExtender) Visited(ctx *URLContext, harvested interface{})
Visited is a no-op.
type DelayInfo ¶
Delay information: the Options delay, the Robots.txt delay, and the last delay used.
type Extender ¶
type Extender interface { // Start, End, Error and Log are not related to a specific URL, so they don't // receive a URLContext struct. Start(interface{}) interface{} End(error) Error(*CrawlError) Log(LogFlags, LogFlags, string) // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo // is related to a URLContext (holds a ctx field). ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration // All other extender methods are executed in the context of an URL, and thus // receive an URLContext struct as first argument. Fetch(*URLContext, string, bool) (*http.Response, error) RequestGet(*URLContext, *http.Response) bool RequestRobots(*URLContext, string) ([]byte, bool) FetchedRobots(*URLContext, *http.Response) Filter(*URLContext, bool) bool Enqueued(*URLContext) Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool) Visited(*URLContext, interface{}) Disallowed(*URLContext) }
Extension methods required to provide an extender instance.
type FetchInfo ¶
type FetchInfo struct { Ctx *URLContext Duration time.Duration StatusCode int IsHeadRequest bool }
Fetch information: the duration of the fetch, the returned status code, whether or not it was a HEAD request, and whether or not it was a robots.txt request.
type LogFlags ¶
type LogFlags uint
const ( LogError LogFlags = 1 << iota LogInfo LogEnqueued LogIgnored LogTrace LogNone LogFlags = 0 LogAll LogFlags = LogError | LogInfo | LogEnqueued | LogIgnored | LogTrace )
Log levels for the library's logger
type Options ¶
type Options struct { UserAgent string RobotUserAgent string MaxVisits int EnqueueChanBuffer int HostBufferFactor int CrawlDelay time.Duration // Applied per host WorkerIdleTTL time.Duration SameHostOnly bool HeadBeforeGet bool URLNormalizationFlags purell.NormalizationFlags LogFlags LogFlags Extender Extender }
The Options available to control and customize the crawling process.
func NewOptions ¶
type URLContext ¶
type URLContext struct { HeadBeforeGet bool State interface{} // contains filtered or unexported fields }
func (*URLContext) IsRobotsURL ¶
func (this *URLContext) IsRobotsURL() bool
func (*URLContext) NormalizedSourceURL ¶
func (this *URLContext) NormalizedSourceURL() *url.URL
func (*URLContext) NormalizedURL ¶
func (this *URLContext) NormalizedURL() *url.URL
func (*URLContext) SourceURL ¶
func (this *URLContext) SourceURL() *url.URL
func (*URLContext) URL ¶
func (this *URLContext) URL() *url.URL