fetchbot

package module
v0.0.0-...-71664b4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2014 License: BSD-3-Clause Imports: 12 Imported by: 0

README

fetchbot

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

build status

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt. It also integrates code from the iq package.

The API documentation is available on godoc.org.

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs. If the Command implements the BasicAuthProvider interface, a Basic Authentication header will be put in place with the given credentials to fetch the URL.

Similarly, the CookiesProvider and HeaderProvider interfaces offer the expected features (setting cookies and header values on the request). The ReaderProvider and ValuesProvider interfaces are also supported, although they should be mutually exclusive as they both set the body of the request. If both are supported, the ReaderProvider interface is used. It sets the body of the request (e.g. for a "POST") using the given io.Reader instance. The ValuesProvider does the same, but using the given url.Values instance, and sets the Content-Type of the body to "application/x-www-form-urlencoded" (unless it is explicitly set by a HeaderProvider).

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

The Fetcher has a number of fields that provide further customization:

  • HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.

  • CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.

  • UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.

  • WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license, the same as the Go language. The iq package source code is under the CDDL-1.0 license (details in the source file).

Documentation

Overview

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.

It is very much a rewrite of gocrawl (https://github.com/PuerkitoBio/gocrawl) with a simpler API, less features built-in, but at the same time more flexibility. As for Go itself, sometimes less is more!

Installation

To install, simply run in a terminal:

go get github.com/PuerkitoBio/fetchbot

The package has a single external dependency, robotstxt (https://github.com/temoto/robotstxt-go). It also integrates code from the iq package (https://github.com/kylelemons/iq).

The API documentation is available on godoc.org (http://godoc.org/github.com/PuerkitoBio/fetchbot).

Usage

The following example (taken from /example/short/main.go) shows how to create and start a Fetcher, one way to send commands, and how to stop the fetcher once all commands have been handled.

package main

import (
	"fmt"
	"net/http"

	"github.com/PuerkitoBio/fetchbot"
)

func main() {
	f := fetchbot.New(fetchbot.HandlerFunc(handler))
	queue := f.Start()
	queue.SendStringHead("http://google.com", "http://golang.org", "http://golang.org/doc")
	queue.Close()
}

func handler(ctx *fetchbot.Context, res *http.Response, err error) {
	if err != nil {
		fmt.Printf("error: %s\n", err)
		return
	}
	fmt.Printf("[%d] %s %s\n", res.StatusCode, ctx.Cmd.Method(), ctx.Cmd.URL())
}

A more complex and complete example can be found in the repository, at /example/full/.

Basically, a Fetcher is an instance of a web crawler, independent of other Fetchers. It receives Commands via the Queue, executes the requests, and calls a Handler to process the responses. A Command is an interface that tells the Fetcher which URL to fetch, and which HTTP method to use (i.e. "GET", "HEAD", ...).

A call to Fetcher.Start() returns the Queue associated with this Fetcher. This is the thread-safe object that can be used to send commands, or to stop the crawler.

Both the Command and the Handler are interfaces, and may be implemented in various ways. They are defined like so:

type Command interface {
	URL() *url.URL
	Method() string
}
type Handler interface {
	Handle(*Context, *http.Response, error)
}

A Context is a struct that holds the Command and the Queue, so that the Handler always knows which Command initiated this call, and has a handle to the Queue.

A Handler is similar to the net/http Handler, and middleware-style combinations can be built on top of it. A HandlerFunc type is provided so that simple functions with the right signature can be used as Handlers (like net/http.HandlerFunc), and there is also a multiplexer Mux that can be used to dispatch calls to different Handlers based on some criteria.

The Fetcher recognizes a number of interfaces that the Command may implement, for more advanced needs. If the Command implements the BasicAuthProvider interface, a Basic Authentication header will be put in place with the given credentials to fetch the URL.

Similarly, the CookiesProvider and HeaderProvider interfaces offer the expected features (setting cookies and header values on the request). The ReaderProvider and ValuesProvider interfaces are also supported, although they should be mutually exclusive as they both set the body of the request. If both are supported, the ReaderProvider interface is used. It sets the body of the request (e.g. for a "POST") using the given io.Reader instance. The ValuesProvider does the same, but using the given url.Values instance, and sets the Content-Type of the body to "application/x-www-form-urlencoded" (unless it is explicitly set by a HeaderProvider).

Since the Command is an interface, it can be a custom struct that holds additional information, such as an ID for the URL (e.g. from a database), or a depth counter so that the crawling stops at a certain depth, etc. For basic commands that don't require additional information, the package provides the Cmd struct that implements the Command interface. This is the Command implementation used when using the various Queue.SendString* methods.

The Fetcher has a number of fields that provide further customization:

- HttpClient : By default, the Fetcher uses the net/http default Client to make requests. A different client can be set on the Fetcher.HttpClient field.

- CrawlDelay : That value is used only if there is no delay specified by the robots.txt of a given host.

- UserAgent : Sets the user agent string to use for the requests and to validate against the robots.txt entries.

- WorkerIdleTTL : Sets the duration that a worker goroutine can wait without receiving new commands to fetch. If the idle time-to-live is reached, the worker goroutine is stopped and its resources are released. This can be especially useful for long-running crawlers.

What fetchbot doesn't do - especially compared to gocrawl - is that it doesn't keep track of already visited URLs, and it doesn't normalize the URLs. This is outside the scope of this package - all commands sent on the Queue will be fetched. Normalization can easily be done (e.g. using https://github.com/PuerkitoBio/purell) before sending the Command to the Fetcher. How to keep track of visited URLs depends on the use-case of the specific crawler, but for an example, see /example/full/main.go.

License

The BSD 3-Clause license (http://opensource.org/licenses/BSD-3-Clause), the same as the Go language. The iq_slice.go file is under the CDDL-1.0 license (details in the source file).

Index

Constants

View Source
const (
	// The default crawl delay to use if there is no robots.txt specified delay.
	DefaultCrawlDelay = 5 * time.Second
	// The default user agent string.
	DefaultUserAgent = "Fetchbot (https://github.com/PuerkitoBio/fetchbot)"
	// The default time-to-live of an idle host worker goroutine. If no URL is sent
	// for a given host within this duration, this host's goroutine is disposed of.
	DefaultWorkerIdleTTL = 30 * time.Second
)

Variables

View Source
var (
	// A Command cannot be enqueued if it has an URL with an empty host.
	ErrEmptyHost = errors.New("fetchbot: invalid empty host")

	// Error when the requested URL is disallowed by the robots.txt policy.
	ErrDisallowed = errors.New("fetchbot: disallowed by robots.txt")

	// Error when a Send call is made on a closed Queue.
	ErrQueueClosed = errors.New("fetchbot: send on a closed queue")
)
View Source
var DefaultCrawlConfig = CrawlConfig{
	CrawlDelay: DefaultCrawlDelay,
	HttpClient: http.DefaultClient,
	UserAgent:  DefaultUserAgent,
}

Functions

This section is empty.

Types

type BasicAuthProvider

type BasicAuthProvider interface {
	BasicAuth() (user string, pwd string)
}

The BasicAuthProvider interface gets the credentials to use to perform the request with Basic Authentication.

type Cmd

type Cmd struct {
	U *url.URL
	M string
}

The Cmd struct defines a basic Command implementation.

func (*Cmd) Method

func (c *Cmd) Method() string

Method returns the HTTP verb to use to process this command (i.e. "GET", "HEAD", etc.).

func (*Cmd) URL

func (c *Cmd) URL() *url.URL

URL returns the resource targeted by this command.

type CmdHandler

type CmdHandler interface {
	HandleCmd(Command, *http.Response, error)
}

The CmdHandler interface is used to process the HostFetcher's requests. It is similar to the net/http.Handler interface.

type CmdHandlerFunc

type CmdHandlerFunc func(Command, *http.Response, error)

A CmdHandlerFunc is a function signature that implements the CmdHandler interface. A function with this signature can thus be used as a CmdHandler.

func (CmdHandlerFunc) HandleCmd

func (h CmdHandlerFunc) HandleCmd(cmd Command, res *http.Response, err error)

HandleCmd is the CmdHandler interface implementation for the CmdHandlerFunc type.

type Command

type Command interface {
	URL() *url.URL
	Method() string
}

The Command interface defines the methods required by the Fetcher to request a resource.

type Context

type Context struct {
	Cmd Command
	Q   *Queue
}

Context is a Command's fetch context, passed to the Handler. It gives access to the original Command and the associated Queue.

type CookiesProvider

type CookiesProvider interface {
	Cookies() []*http.Cookie
}

The CookiesProvider interface gets the cookies to send with the request.

type CrawlConfig

type CrawlConfig struct {
	// Default delay to use between requests to a same host if there is no robots.txt
	// crawl delay.
	CrawlDelay time.Duration

	// The *http.Client to use for the requests. If nil, defaults to the net/http
	// package's default client.
	HttpClient *http.Client

	// The user-agent string to use for robots.txt validation and URL fetching.
	UserAgent string
}

type DebugInfo

type DebugInfo struct {
	NumHosts int
}

The DebugInfo holds information to introspect the Fetcher's state.

type Fetcher

type Fetcher struct {
	CrawlConfig

	// The Handler to be called for each request. All successfully enqueued requests
	// produce a Handler call.
	Handler Handler

	// The time a host-dedicated worker goroutine can stay idle, with no Command to enqueue,
	// before it is stopped and cleared from memory.
	WorkerIdleTTL time.Duration
	// contains filtered or unexported fields
}

A Fetcher defines the parameters for running a web crawler.

func New

func New(h Handler) *Fetcher

New returns an initialized Fetcher.

func (*Fetcher) Debug

func (f *Fetcher) Debug() <-chan *DebugInfo

Debug returns the channel to use to receive the debugging information. It is not intended to be used by package users.

func (*Fetcher) Start

func (f *Fetcher) Start() *Queue

Start the Fetcher, and returns the Queue to use to send Commands to be fetched.

type Handler

type Handler interface {
	Handle(*Context, *http.Response, error)
}

The Handler interface is used to process the Fetcher's requests. It is similar to the net/http.Handler interface.

type HandlerFunc

type HandlerFunc func(*Context, *http.Response, error)

A HandlerFunc is a function signature that implements the Handler interface. A function with this signature can thus be used as a Handler.

func (HandlerFunc) Handle

func (h HandlerFunc) Handle(ctx *Context, res *http.Response, err error)

Handle is the Handler interface implementation for the HandlerFunc type.

type HeaderProvider

type HeaderProvider interface {
	Header() http.Header
}

The HeaderProvider interface gets the headers to set on the request. If an Authorization header is set, it will be overridden by the BasicAuthProvider, if implemented.

type HostFetcher

type HostFetcher struct {
	*UnsafeHostFetcher
	// contains filtered or unexported fields
}

HostFetcher is an UnsafeHostFetcher that supports robots.txt.

func NewHostFetcher

func NewHostFetcher(c CrawlConfig, baseurl *url.URL, chand CmdHandler, cmd chan Command) *HostFetcher

NewHostFetcher creates a new HostFetcher.

func (*HostFetcher) Run

func (hf *HostFetcher) Run()

Run fetches robots.txt, then runs continuously until the command channel is closed, executing the commands sent to it.

type Mux

type Mux struct {
	DefaultHandler Handler
	// contains filtered or unexported fields
}

Mux is a simple multiplexer for the Handler interface, similar to net/http.ServeMux. It is itself a Handler, and dispatches the calls to the matching Handlers.

For error Handlers, if there is a Handler registered for the same error value, it will be called. Otherwise, if there is a Handler registered for any error, this Handler will be called.

For Response Handlers, a match with a path criteria has higher priority than other matches, and the longer path match will get called.

If multiple Response handlers with the same path length (or no path criteria) match a response, the actual handler called is undefined, but one and only one will be called.

In any case, if no Handler matches, the DefaultHandler is called, and it defaults to a no-op.

func NewMux

func NewMux() *Mux

NewMux returns an initialized Mux.

func (*Mux) Handle

func (mux *Mux) Handle(ctx *Context, res *http.Response, err error)

Handle is the Handler interface implementation for Mux. It dispatches the calls to the matching Handler.

func (*Mux) HandleError

func (mux *Mux) HandleError(err error, h Handler)

HandleError registers a Handler for a specific error value. Multiple calls with the same error value override previous calls. As a special case, a nil error value registers a Handler for any error that doesn't have a specific Handler.

func (*Mux) HandleErrors

func (mux *Mux) HandleErrors(h Handler)

HandleErrors registers a Handler for any error that doesn't have a specific Handler.

func (*Mux) Response

func (mux *Mux) Response() *ResponseMatcher

Response initializes an entry for a Response Handler based on various criteria. The Response Handler is not registered until Handle is called.

type Queue

type Queue struct {
	// contains filtered or unexported fields
}

Queue offers methods to send Commands to the Fetcher, and to Stop the crawling process. It is safe to use from concurrent goroutines.

func (*Queue) Block

func (q *Queue) Block()

Block blocks the current goroutine until the Queue is closed.

func (*Queue) Close

func (q *Queue) Close() error

Close closes the Queue so that no more Commands can be sent. It blocks until the Fetcher drains all pending commands. After the call, the Fetcher is stopped.

func (*Queue) Send

func (q *Queue) Send(c Command) error

Send enqueues a Command into the Fetcher. If the Queue has been closed, it returns ErrQueueClosed.

func (*Queue) SendString

func (q *Queue) SendString(method string, rawurl ...string) (int, error)

SendString enqueues a method and some URL strings into the Fetcher. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

func (*Queue) SendStringGet

func (q *Queue) SendStringGet(rawurl ...string) (int, error)

SendStringGet enqueues the URL strings to be fetched with a GET method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

func (*Queue) SendStringHead

func (q *Queue) SendStringHead(rawurl ...string) (int, error)

SendStringHead enqueues the URL strings to be fetched with a HEAD method. It returns an error if the URL string cannot be parsed, or if the Queue has been closed. The first return value is the number of URLs successfully enqueued.

type RateThrottler

type RateThrottler struct {
	Rate    time.Duration
	LastAct time.Time
}

RateThrottler records when an action is done, and how frequently it is allowed.

func (*RateThrottler) ActMaybeWait

func (t *RateThrottler) ActMaybeWait()

Sleep until next action, then note the time at which it is performed.

type ReaderProvider

type ReaderProvider interface {
	Reader() io.Reader
}

The ReaderProvider interface gets the Reader to use as the Body of the request. It has higher priority than the ValuesProvider interface, so that if both interfaces are implemented, the ReaderProvider is used.

type ResponseMatcher

type ResponseMatcher struct {
	// contains filtered or unexported fields
}

A ResponseMatcher holds the criteria for a response Handler.

func (*ResponseMatcher) ContentType

func (r *ResponseMatcher) ContentType(ct string) *ResponseMatcher

ContentType sets a criteria based on the Content-Type header for the Response Handler. Its Handler will only be called if it has this content type, ignoring any additional parameter on the Header value (following the semicolon, i.e. "text/html; charset=utf-8").

func (*ResponseMatcher) Handler

func (r *ResponseMatcher) Handler(h Handler) *ResponseMatcher

Handler sets the Handler to be called when this Response Handler is the match for a given response. It registers the Response Handler in its parent Mux.

func (*ResponseMatcher) Host

func (r *ResponseMatcher) Host(host string) *ResponseMatcher

Host sets a criteria based on the host of the URL for the Response Handler. Its Handler will only be called if the host of the URL matches exactly the specified host.

func (*ResponseMatcher) Method

func (r *ResponseMatcher) Method(m string) *ResponseMatcher

Method sets a method criteria for the Response Handler. Its Handler will only be called if it has this HTTP method (i.e. "GET", "HEAD", ...).

func (*ResponseMatcher) Path

Path sets a criteria based on the path of the URL for the Response Handler. Its Handler will only be called if the path of the URL starts with this path. Longer matches have priority over shorter ones.

func (*ResponseMatcher) Status

func (r *ResponseMatcher) Status(code int) *ResponseMatcher

Status sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has this status code.

func (*ResponseMatcher) StatusRange

func (r *ResponseMatcher) StatusRange(min, max int) *ResponseMatcher

StatusRange sets a criteria based on the Status code of the response for the Response Handler. Its Handler will only be called if the response has a status code between the min and max. If min is greater than max, the values are switched.

type UnsafeHostFetcher

type UnsafeHostFetcher struct {
	BaseURL *url.URL
	CmdHandler
	CmdIn      chan Command
	HttpClient *http.Client
	UserAgent  string
	RateThrottler
}

UnsafeHostFetcher receives commands on CmdIn that all pertain to the host in BaseURL. It executes them using HttpClient, and submits errors and responses to CmdHandler. Use HostFetcher instead of UnsafeHostFetcher unless you know it is safe to disregard robots.txt.

func NewUnsafeHostFetcher

func NewUnsafeHostFetcher(c CrawlConfig, baseurl *url.URL, chand CmdHandler, cmd chan Command) *UnsafeHostFetcher

NewUnsafeHostFetcher creates a new UnsafeHostFetcher.

func (*UnsafeHostFetcher) DoCommand

func (uhf *UnsafeHostFetcher) DoCommand(cmd Command)

DoCommand executes a single command. Normally only used by Run, but nothing prevents users of UnsafeHostFetcher from giving a nil CmdIn can invoking DoCommand themselves. DoCommand will respect the throttling specified in the UnsafeHostFetcher's RateThrottler, which is initialized based on the CrawlDelay used in building it.

func (*UnsafeHostFetcher) Run

func (uhf *UnsafeHostFetcher) Run()

Run reads and execute commands from CmdIn until it is closed.

type ValuesProvider

type ValuesProvider interface {
	Values() url.Values
}

The ValuesProvider interface gets the values to send as the Body of the request. It has lower priority than the ReaderProvider interface, so that if both interfaces are implemented, the ReaderProvider is used. If the request has no explicit Content-Type set, it will be automatically set to "application/x-www-form-urlencoded".

Directories

Path Synopsis
example

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL