Documentation ¶
Overview ¶
Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.
It is very much a rewrite of gocrawl (https://github.com/PuerkitoBio/gocrawl) with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!
Installation ¶
To install, simply run in a terminal:
go get github.com/PuerkitoBio/fetchbot
The package has a single external dependency, robotstxt (https://github.com/temoto/robotstxt-go). It also integrates code from the iq package (https://github.com/kylelemons/iq).
The API documentation is available on godoc.org (http://godoc.org/github.com/PuerkitoBio/fetchbot).
Usage ¶
The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.
package main import ( "fmt" "net/http" "github.com/PuerkitoBio/fetchbot" ) func main() { f := fetchbot.New(fetchbot.HandlerFunc(handler)) queue := f.Start() queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc") queue.Close() } func handler(ctx *fetchbot.Context, res *http.Response, err error) { if err != nil { fmt.Printf("error: %s\n", err) return } fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL()) }
A more complex and complete example can be found in the repository, at /example/full/.
Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).
A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.
Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:
type Command interface { URL() *url.URL Method() string } type Handler interface { Handle(*Context, *http.Response, error) }
A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.
A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.
The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs. If the Command implements the BasicAuthProvider interface, a Basic Authentication header will be put in place with the given credentials to fetch the URL.
Similarly, the CookiesProvider and HeaderProvider interfaces offer the expected features (setting cookies and header values on the request). The ReaderProvider and ValuesProvider interfaces are also supported, although they should be mutually exclusive as they both set the body of the request. If both are supported, the ReaderProvider interface is used. It sets the body of the request (e.g. for a "POST") using the given io.Reader instance. The ValuesProvider does the same, but using the given url.Values instance, and sets the Content-Type of the body to "application/x-www-form-urlencoded" (unless it is explicitly set by a HeaderProvider).
Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.
The Fetcher has a number of fields that provide further customization:
- HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.
- CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.
- UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.
- WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.
What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using https://github.com/PuerkitoBio/purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.
License ¶
The BSD 3-Clause license (http://opensource.org/licenses/BSD-3-Clause), the same as the Go language. The iq_slice.go file is under the CDDL-1.0 license (details in the source file).
Index ¶
- Constants
- Variables
- type BasicAuthProvider
- type Cmd
- type CmdHandler
- type CmdHandlerFunc
- type Command
- type Context
- type CookiesProvider
- type CrawlConfig
- type DebugInfo
- type Fetcher
- type Handler
- type HandlerFunc
- type HeaderProvider
- type HostFetcher
- type Mux
- type Queue
- type RateThrottler
- type ReaderProvider
- type ResponseMatcher
- func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher
- func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher
- func (r *ResponseMatcher) Host(host string) *ResponseMatcher
- func (r *ResponseMatcher) Method(m string) *ResponseMatcher
- func (r *ResponseMatcher) Path(p string) *ResponseMatcher
- func (r *ResponseMatcher) Status(code int) *ResponseMatcher
- func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher
- type UnsafeHostFetcher
- type ValuesProvider
Constants ¶
const ( // The default crawl delay to use if there is no robots.txt specified delay. DefaultCrawlDelay = 5 * time.Second // The default user agent string. DefaultUserAgent = "Fetchbot (https://github.com/PuerkitoBio/fetchbot)" // The default time-to-live of an idle host worker goroutine. If no URL is sent // for a given host within this duration, this host's goroutine is disposed of. DefaultWorkerIdleTTL = 30 * time.Second )
Variables ¶
var ( // A Command cannot be enqueued if it has an URL with an empty host. ErrEmptyHost = errors.New("fetchbot: invalid empty host") // Error when the requested URL is disallowed by the robots.txt policy. ErrDisallowed = errors.New("fetchbot: disallowed by robots.txt") // Error when a Send call is made on a closed Queue. ErrQueueClosed = errors.New("fetchbot: send on a closed queue") )
var DefaultCrawlConfig = CrawlConfig{ CrawlDelay: DefaultCrawlDelay, HttpClient: http.DefaultClient, UserAgent: DefaultUserAgent, }
Functions ¶
This section is empty.
Types ¶
type BasicAuthProvider ¶
The BasicAuthProvider interface gets the credentials to use to perform the request with Basic Authentication.
type Cmd ¶
The Cmd struct defines a basic Command implementation.
type CmdHandler ¶
The CmdHandler interface is used to process the HostFetcher's requests. It is similar to the net/http.Handler interface.
type CmdHandlerFunc ¶
A CmdHandlerFunc is a function signature that implements the CmdHandler interface. A function with this signature can thus be used as a CmdHandler.
type Command ¶
The Command interface defines the methods required by the Fetcher to request a resource.
type Context ¶
Context is a Command's fetch context, passed to the Handler. It gives access to the original Command and the associated Queue.
type CookiesProvider ¶
The CookiesProvider interface gets the cookies to send with the request.
type CrawlConfig ¶
type CrawlConfig struct { // Default delay to use between requests to a same host if there is no robots.txt // crawl delay. CrawlDelay time.Duration // The *http.Client to use for the requests. If nil, defaults to the net/http // package's default client. HttpClient *http.Client // The user-agent string to use for robots.txt validation and URL fetching. UserAgent string }
type DebugInfo ¶
type DebugInfo struct {
NumHosts int
}
The DebugInfo holds information to introspect the Fetcher's state.
type Fetcher ¶
type Fetcher struct { CrawlConfig // The Handler to be called for each request. All successfully enqueued requests // produce a Handler call. Handler Handler // The time a host-dedicated worker goroutine can stay idle, with no Command to enqueue, // before it is stopped and cleared from memory. WorkerIdleTTL time.Duration // contains filtered or unexported fields }
A Fetcher defines the parameters for running a web crawler.
type Handler ¶
The Handler interface is used to process the Fetcher's requests. It is similar to the net/http.Handler interface.
type HandlerFunc ¶
A HandlerFunc is a function signature that implements the Handler interface. A function with this signature can thus be used as a Handler.
type HeaderProvider ¶
The HeaderProvider interface gets the headers to set on the request. If an Authorization header is set, it will be overridden by the BasicAuthProvider, if implemented.
type HostFetcher ¶
type HostFetcher struct { *UnsafeHostFetcher // contains filtered or unexported fields }
HostFetcher is an UnsafeHostFetcher that supports robots.txt.
func NewHostFetcher ¶
func NewHostFetcher(c CrawlConfig, baseurl *url.URL, chand CmdHandler, cmd chan Command) *HostFetcher
NewHostFetcher creates a new HostFetcher.
func (*HostFetcher) Run ¶
func (hf *HostFetcher) Run()
Run fetches robots.txt, then runs continuously until the command channel is closed, executing the commands sent to it.
type Mux ¶
type Mux struct { DefaultHandler Handler // contains filtered or unexported fields }
Mux is a simple multiplexer for the Handler interface, similar to net/http.ServeMux. It is itself a Handler, and dispatches the calls to the matching Handlers.
For error Handlers, if there is a Handler registered for the same error value, it will be called. Otherwise, if there is a Handler registered for any error, this Handler will be called.
For Response Handlers, a match with a path criteria has higher priority than other matches, and the longer path match will get called.
If multiple Response handlers with the same path length (or no path criteria) match a response, the actual handler called is undefined, but one and only one will be called.
In any case, if no Handler matches, the DefaultHandler is called, and it defaults to a no-op.
func (*Mux) Handle ¶
Handle is the Handler interface implementation for Mux. It dispatches the calls to the matching Handler.
func (*Mux) HandleError ¶
HandleError registers a Handler for a specific error value. Multiple calls with the same error value override previous calls. As a special case, a nil error value registers a Handler for any error that doesn't have a specific Handler.
func (*Mux) HandleErrors ¶
HandleErrors registers a Handler for any error that doesn't have a specific Handler.
func (*Mux) Response ¶
func (mux *Mux) Response() *ResponseMatcher
Response initializes an entry for a Response Handler based on various criteria. The Response Handler is not registered until Handle is called.
type Queue ¶
type Queue struct {
// contains filtered or unexported fields
}
Queue offers methods to send Commands to the Fetcher, and to Stop the crawling process. It is safe to use from concurrent goroutines.
func (*Queue) Block ¶
func (q *Queue) Block()
Block blocks the current goroutine until the Queue is closed.
func (*Queue) Close ¶
Close closes the Queue so that no more Commands can be sent. It blocks until the Fetcher drains all pending commands. After the call, the Fetcher is stopped.
func (*Queue) Send ¶
Send enqueues a Command into the Fetcher. If the Queue has been closed, it returns ErrQueueClosed.
func (*Queue) SendString ¶
SendString enqueues a method and some URL strings into the Fetcher. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.
func (*Queue) SendStringGet ¶
SendStringGet enqueues the URL strings to be fetched with a GET method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.
func (*Queue) SendStringHead ¶
SendStringHead enqueues the URL strings to be fetched with a HEAD method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.
type RateThrottler ¶
RateThrottler records when an action is done, and how frequently it is allowed.
func (*RateThrottler) ActMaybeWait ¶
func (t *RateThrottler) ActMaybeWait()
Sleep until next action, then note the time at which it is performed.
type ReaderProvider ¶
The ReaderProvider interface gets the Reader to use as the Body of the request. It has higher priority than the ValuesProvider interface, so that if both interfaces are implemented, the ReaderProvider is used.
type ResponseMatcher ¶
type ResponseMatcher struct {
// contains filtered or unexported fields
}
A ResponseMatcher holds the criteria for a response Handler.
func (*ResponseMatcher) ContentType ¶
func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher
ContentType sets a criteria based on the Content-Type header for the Response Handler. Its Handler will only be called if it has this content type, ignoring any additional parameter on the Header value (following the semicolon, i.e. "text/html; charset=utf-8").
func (*ResponseMatcher) Handler ¶
func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher
Handler sets the Handler to be called when this Response Handler is the match for a given response. It registers the Response Handler in its parent Mux.
func (*ResponseMatcher) Host ¶
func (r *ResponseMatcher) Host(host string) *ResponseMatcher
Host sets a criteria based on the host of the URL for the Response Handler. Its Handler will only be called if the host of the URL matches exactly the specified host.
func (*ResponseMatcher) Method ¶
func (r *ResponseMatcher) Method(m string) *ResponseMatcher
Method sets a method criteria for the Response Handler. Its Handler will only be called if it has this HTTP method (i.e. "GET", "HEAD", ...).
func (*ResponseMatcher) Path ¶
func (r *ResponseMatcher) Path(p string) *ResponseMatcher
Path sets a criteria based on the path of the URL for the Response Handler. Its Handler will only be called if the path of the URL starts with this path. Longer matches have priority over shorter ones.
func (*ResponseMatcher) Status ¶
func (r *ResponseMatcher) Status(code int) *ResponseMatcher
Status sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has this status code.
func (*ResponseMatcher) StatusRange ¶
func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher
StatusRange sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has a status code between the min and max. If min is greater than max, the values are switched.
type UnsafeHostFetcher ¶
type UnsafeHostFetcher struct { BaseURL *url.URL CmdHandler CmdIn chan Command HttpClient *http.Client UserAgent string RateThrottler }
UnsafeHostFetcher receives commands on CmdIn that all pertain to the host in BaseURL. It executes them using HttpClient, and submits errors and responses to CmdHandler. Use HostFetcher instead of UnsafeHostFetcher unless you know it is safe to disregard robots.txt.
func NewUnsafeHostFetcher ¶
func NewUnsafeHostFetcher(c CrawlConfig, baseurl *url.URL, chand CmdHandler, cmd chan Command) *UnsafeHostFetcher
NewUnsafeHostFetcher creates a new UnsafeHostFetcher.
func (*UnsafeHostFetcher) DoCommand ¶
func (uhf *UnsafeHostFetcher) DoCommand(cmd Command)
DoCommand executes a single command. Normally only used by Run, but nothing prevents users of UnsafeHostFetcher from giving a nil CmdIn can invoking DoCommand themselves. DoCommand will respect the throttling specified in the UnsafeHostFetcher's RateThrottler, which is initialized based on the CrawlDelay used in building it.
func (*UnsafeHostFetcher) Run ¶
func (uhf *UnsafeHostFetcher) Run()
Run reads and execute commands from CmdIn until it is closed.
type ValuesProvider ¶
The ValuesProvider interface gets the values to send as the Body of the request. It has lower priority than the ReaderProvider interface, so that if both interfaces are implemented, the ReaderProvider is used. If the request has no explicit Content-Type set, it will be automatically set to "application/x-www-form-urlencoded".