Documentation ¶
Overview ¶
Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.
It is very much a rewrite of gocrawl (https://github.com/PuerkitoBio/gocrawl) with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!
Installation ¶
To install, simply run in a terminal:
go get github.com/PuerkitoBio/fetchbot
The package has a single external dependency, robotstxt (https://github.com/temoto/robotstxt-go). It also integrates code from the iq package (https://github.com/kylelemons/iq).
The API documentation is available on godoc.org (http://godoc.org/github.com/PuerkitoBio/fetchbot).
Usage ¶
The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.
package main import ( "fmt" "net/http" "github.com/PuerkitoBio/fetchbot" ) func main() { f := fetchbot.New(fetchbot.HandlerFunc(handler)) queue := f.Start() queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc") queue.Close() } func handler(ctx *fetchbot.Context, res *http.Response, err error) { if err != nil { fmt.Printf("error: %s\n", err) return } fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL()) }
A more complex and complete example can be found in the repository, at /example/full/.
Fetcher ¶
Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).
A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.
Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:
type Command interface { URL() *url.URL Method() string } type Handler interface { Handle(*Context, *http.Response, error) }
A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.
A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.
Command-related Interfaces ¶
The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs.
* BasicAuthProvider: Implement this interface to specify the basic authentication credentials to set on the request.
* CookiesProvider: If the Command implements this interface, the provided Cookies will be set on the request.
* HeaderProvider: Implement this interface to specify the headers to set on the request.
* ReaderProvider: Implement this interface to set the body of the request, via an io.Reader.
* ValuesProvider: Implement this interface to set the body of the request, as form-encoded values. If the Content-Type is not specifically set via a HeaderProvider, it is set to "application/x-www-form-urlencoded". ReaderProvider and ValuesProvider should be mutually exclusive as they both set the body of the request. If both are implemented, the ReaderProvider interface is used.
* Handler: Implement this interface if the Command's response should be handled by a specific callback function. By default, the response is handled by the Fetcher's Handler, but if the Command implements this, this handler function takes precedence and the Fetcher's Handler is ignored.
Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString\* methods.
There is also a convenience HandlerCmd struct for the commands that should be handled by a specific callback function. It is a Command with a Handler interface implementation.
Fetcher Options ¶
The Fetcher has a number of fields that provide further customization:
* HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.
* CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.
* UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.
* WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.
* AutoClose : If true, closes the queue automatically once the number of active hosts reach 0.
* DisablePoliteness : If true, ignores the robots.txt policies of the hosts.
What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using https://github.com/PuerkitoBio/purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.
License ¶
The BSD 3-Clause license (http://opensource.org/licenses/BSD-3-Clause), the same as the Go language. The iq_slice.go file is under the CDDL-1.0 license (details in the source file).
Index ¶
- Constants
- Variables
- type BasicAuthProvider
- type Cmd
- type Command
- type Context
- type CookiesProvider
- type DebugInfo
- type Doer
- type Fetcher
- type Handler
- type HandlerCmd
- type HandlerFunc
- type HeaderProvider
- type Mux
- type Queue
- func (q *Queue) Block()
- func (q *Queue) Cancel() error
- func (q *Queue) Close() error
- func (q *Queue) Send(c Command) error
- func (q *Queue) SendString(method string, rawurl ...string) (int, error)
- func (q *Queue) SendStringGet(rawurl ...string) (int, error)
- func (q *Queue) SendStringHead(rawurl ...string) (int, error)
- type ReaderProvider
- type ResponseMatcher
- func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher
- func (r *ResponseMatcher) Custom(predicate func(*http.Response) bool) *ResponseMatcher
- func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher
- func (r *ResponseMatcher) Host(host string) *ResponseMatcher
- func (r *ResponseMatcher) Method(m string) *ResponseMatcher
- func (r *ResponseMatcher) Path(p string) *ResponseMatcher
- func (r *ResponseMatcher) Scheme(scheme string) *ResponseMatcher
- func (r *ResponseMatcher) Status(code int) *ResponseMatcher
- func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher
- type ValuesProvider
Constants ¶
const ( // DefaultCrawlDelay represents the delay to use if there is no robots.txt // specified delay. DefaultCrawlDelay = 5 * time.Second // DefaultUserAgent is the default user agent string. DefaultUserAgent = "Fetchbot (https://github.com/PuerkitoBio/fetchbot)" // DefaultWorkerIdleTTL is the default time-to-live of an idle host worker goroutine. // If no URL is sent for a given host within this duration, this host's goroutine // is disposed of. DefaultWorkerIdleTTL = 30 * time.Second )
Variables ¶
var ( // ErrEmptyHost is returned if a command to be enqueued has an URL with an empty host. ErrEmptyHost = errors.New("fetchbot: invalid empty host") // ErrDisallowed is returned when the requested URL is disallowed by the robots.txt // policy. ErrDisallowed = errors.New("fetchbot: disallowed by robots.txt") // ErrQueueClosed is returned when a Send call is made on a closed Queue. ErrQueueClosed = errors.New("fetchbot: send on a closed queue") )
Functions ¶
This section is empty.
Types ¶
type BasicAuthProvider ¶
BasicAuthProvider interface gets the credentials to use to perform the request with Basic Authentication.
type Cmd ¶
Cmd defines a basic Command implementation.
type Context ¶
Context is a Command's fetch context, passed to the Handler. It gives access to the original Command and the associated Queue.
type CookiesProvider ¶
CookiesProvider interface gets the cookies to send with the request.
type DebugInfo ¶
type DebugInfo struct {
NumHosts int
}
The DebugInfo holds information to introspect the Fetcher's state.
type Doer ¶
Doer defines the method required to use a type as HttpClient. The net/*http.Client type satisfies this interface.
type Fetcher ¶
type Fetcher struct { // The Handler to be called for each request. All successfully enqueued requests // produce a Handler call. Handler Handler // DisablePoliteness disables fetching and using the robots.txt policies of // hosts. DisablePoliteness bool // Default delay to use between requests to a same host if there is no robots.txt // crawl delay or if DisablePoliteness is true. CrawlDelay time.Duration // The *http.Client to use for the requests. If nil, defaults to the net/http // package's default client. Should be HTTPClient to comply with go lint, but // this is a breaking change, won't fix. HttpClient Doer // The user-agent string to use for robots.txt validation and URL fetching. UserAgent string // The time a host-dedicated worker goroutine can stay idle, with no Command to enqueue, // before it is stopped and cleared from memory. WorkerIdleTTL time.Duration // AutoClose makes the fetcher close its queue automatically once the number // of hosts reach 0. A host is removed once it has been idle for WorkerIdleTTL // duration. AutoClose bool // contains filtered or unexported fields }
A Fetcher defines the parameters for running a web crawler.
type Handler ¶
The Handler interface is used to process the Fetcher's requests. It is similar to the net/http.Handler interface.
type HandlerCmd ¶
type HandlerCmd struct { *Cmd HandlerFunc }
HandlerCmd is a basic Command with its own Handler function that is called to handle the HTTP response.
func NewHandlerCmd ¶
func NewHandlerCmd(method, rawURL string, fn func(*Context, *http.Response, error)) (*HandlerCmd, error)
NewHandlerCmd creates a HandlerCmd for the provided request and callback handler function.
type HandlerFunc ¶
A HandlerFunc is a function signature that implements the Handler interface. A function with this signature can thus be used as a Handler.
type HeaderProvider ¶
HeaderProvider interface gets the headers to set on the request. If an Authorization header is set, it will be overridden by the BasicAuthProvider, if implemented.
type Mux ¶
type Mux struct { DefaultHandler Handler // contains filtered or unexported fields }
Mux is a simple multiplexer for the Handler interface, similar to net/http.ServeMux. It is itself a Handler, and dispatches the calls to the matching Handlers.
For error Handlers, if there is a Handler registered for the same error value, it will be called. Otherwise, if there is a Handler registered for any error, this Handler will be called.
For Response Handlers, a match with a path criteria has higher priority than other matches, and the longer path match will get called.
If multiple Response handlers with the same path length (or no path criteria) match a response, the actual handler called is undefined, but one and only one will be called.
In any case, if no Handler matches, the DefaultHandler is called, and it defaults to a no-op.
func (*Mux) Handle ¶
Handle is the Handler interface implementation for Mux. It dispatches the calls to the matching Handler.
func (*Mux) HandleError ¶
HandleError registers a Handler for a specific error value. Multiple calls with the same error value override previous calls. As a special case, a nil error value registers a Handler for any error that doesn't have a specific Handler.
func (*Mux) HandleErrors ¶
HandleErrors registers a Handler for any error that doesn't have a specific Handler.
func (*Mux) Response ¶
func (mux *Mux) Response() *ResponseMatcher
Response initializes an entry for a Response Handler based on various criteria. The Response Handler is not registered until Handle is called.
type Queue ¶
type Queue struct {
// contains filtered or unexported fields
}
Queue offers methods to send Commands to the Fetcher, and to Stop the crawling process. It is safe to use from concurrent goroutines.
func (*Queue) Block ¶
func (q *Queue) Block()
Block blocks the current goroutine until the Queue is closed and all pending commands are drained.
func (*Queue) Cancel ¶
Cancel closes the Queue and drains the pending commands without processing them, allowing for a fast "stop immediately"-ish operation.
func (*Queue) Close ¶
Close closes the Queue so that no more Commands can be sent. It blocks until the Fetcher drains all pending commands. After the call, the Fetcher is stopped. Attempts to enqueue new URLs after Close has been called will always result in a ErrQueueClosed error.
func (*Queue) Send ¶
Send enqueues a Command into the Fetcher. If the Queue has been closed, it returns ErrQueueClosed. The Command's URL must have a Host.
func (*Queue) SendString ¶
SendString enqueues a method and some URL strings into the Fetcher. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.
func (*Queue) SendStringGet ¶
SendStringGet enqueues the URL strings to be fetched with a GET method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.
func (*Queue) SendStringHead ¶
SendStringHead enqueues the URL strings to be fetched with a HEAD method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.
type ReaderProvider ¶
ReaderProvider interface gets the Reader to use as the Body of the request. It has higher priority than the ValuesProvider interface, so that if both interfaces are implemented, the ReaderProvider is used.
type ResponseMatcher ¶
type ResponseMatcher struct {
// contains filtered or unexported fields
}
A ResponseMatcher holds the criteria for a response Handler.
func (*ResponseMatcher) ContentType ¶
func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher
ContentType sets a criteria based on the Content-Type header for the Response Handler. Its Handler will only be called if it has this content type, ignoring any additional parameter on the Header value (following the semicolon, i.e. "text/html; charset=utf-8").
func (*ResponseMatcher) Custom ¶
func (r *ResponseMatcher) Custom(predicate func(*http.Response) bool) *ResponseMatcher
Custom sets a criteria based on a function that receives the HTTP response and returns true if the matcher should be used to handle this response, false otherwise.
func (*ResponseMatcher) Handler ¶
func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher
Handler sets the Handler to be called when this Response Handler is the match for a given response. It registers the Response Handler in its parent Mux.
func (*ResponseMatcher) Host ¶
func (r *ResponseMatcher) Host(host string) *ResponseMatcher
Host sets a criteria based on the host of the URL for the Response Handler. Its Handler will only be called if the host of the URL matches exactly the specified host.
func (*ResponseMatcher) Method ¶
func (r *ResponseMatcher) Method(m string) *ResponseMatcher
Method sets a method criteria for the Response Handler. Its Handler will only be called if it has this HTTP method (i.e. "GET", "HEAD", ...).
func (*ResponseMatcher) Path ¶
func (r *ResponseMatcher) Path(p string) *ResponseMatcher
Path sets a criteria based on the path of the URL for the Response Handler. Its Handler will only be called if the path of the URL starts with this path. Longer matches have priority over shorter ones.
func (*ResponseMatcher) Scheme ¶
func (r *ResponseMatcher) Scheme(scheme string) *ResponseMatcher
Scheme sets a criteria based on the scheme of the URL for the Response Handler. Its Handler will only be called if the scheme of the URL matches exactly the specified scheme.
func (*ResponseMatcher) Status ¶
func (r *ResponseMatcher) Status(code int) *ResponseMatcher
Status sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has this status code.
func (*ResponseMatcher) StatusRange ¶
func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher
StatusRange sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has a status code between the min and max. If min is greater than max, the values are switched.
type ValuesProvider ¶
ValuesProvider interface gets the values to send as the Body of the request. It has lower priority than the ReaderProvider interface, so that if both interfaces are implemented, the ReaderProvider is used. If the request has no explicit Content-Type set, it will be automatically set to "application/x-www-form-urlencoded".