Documentation ¶
Index ¶
- Constants
- Variables
- func Must(err error)
- func MustParseURLs(urls []string) []*url.URL
- func NewHTTPClient() *http.Client
- func NewHTTPClientWithOverrides(dnsMap map[string]string, localAddr *net.IPAddr) *http.Client
- type Crawler
- type Fetcher
- type FetcherFunc
- type Handler
- type HandlerFunc
- type Outlink
- type Publisher
- type Scope
- func AND(elems ...Scope) Scope
- func NewDepthScope(maxDepth int) Scope
- func NewIncludeRelatedScope() Scope
- func NewRegexpIgnoreScope(ignores []*regexp.Regexp) Scope
- func NewSchemeScope(schemes []string) Scope
- func NewSeedScope(seeds []*url.URL) Scope
- func NewURLPrefixScope(prefixes URLPrefixMap) Scope
- func OR(elems ...Scope) Scope
- type URLPrefixMap
Constants ¶
const ( // TagPrimary is a primary reference (another web page). TagPrimary = iota // TagRelated is a secondary resource, related to a page. TagRelated )
Variables ¶
DefaultClient points at a shared http.Client suitable for crawling: does not follow redirects, accepts invalid TLS certificates, sets a reasonable timeout for requests.
ErrRetryRequest is returned by a Handler when the request should be retried after some time.
Functions ¶
func Must ¶
func Must(err error)
Must will abort the program with a message when we encounter an error that we can't recover from.
func MustParseURLs ¶
MustParseURLs parses a list of URLs and aborts on failure.
func NewHTTPClient ¶
NewHTTPClient returns an http.Client suitable for crawling.
Types ¶
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
The Crawler object contains the crawler state.
func NewCrawler ¶
NewCrawler creates a new Crawler object with the specified behavior.
func (*Crawler) Close ¶
func (c *Crawler) Close()
Close the database and release resources associated with the crawler state.
func (*Crawler) Enqueue ¶
Enqueue a (possibly new) URL for processing.
func (*Crawler) Run ¶
Run the crawl with the specified number of workers. This function does not exit until all work is done (no URLs left in the queue).
type Fetcher ¶
type Fetcher interface { // Fetch retrieves a URL and returns the response. Fetch(string) (*http.Response, error) }
A Fetcher retrieves contents from remote URLs.
type FetcherFunc ¶
FetcherFunc wraps a simple function into the Fetcher interface.
type Handler ¶
type Handler interface { // Handle the response from a URL. Handle(Publisher, string, int, int, *http.Response, *os.File, error) error }
A Handler processes crawled contents. Any errors returned by public implementations of this interface are considered fatal and will cause the crawl to abort. The URL will be removed from the queue unless the handler returns the special error ErrRetryRequest.
func FilterErrors ¶
FilterErrors returns a Handler that forwards only requests with a "successful" HTTP status code (anything < 400). When using this wrapper, subsequent Handle calls will always have err set to nil.
func FollowRedirects ¶
FollowRedirects returns a Handler that follows HTTP redirects and adds them to the queue for crawling. It will call the wrapped handler on all requests regardless.
type HandlerFunc ¶
HandlerFunc wraps a function into the Handler interface.
type Publisher ¶
Publisher is an interface to something with an Enqueue() method to add new potential URLs to crawl.
type Scope ¶
type Scope interface { // Check a URL to see if it's in scope for crawling. Check(Outlink, int) bool }
Scope defines the crawling scope.
func NewDepthScope ¶
NewDepthScope returns a Scope that will limit crawls to a maximum link depth with respect to the crawl seeds.
func NewIncludeRelatedScope ¶
func NewIncludeRelatedScope() Scope
NewIncludeRelatedScope always includes resources with TagRelated.
func NewRegexpIgnoreScope ¶
NewRegexpIgnoreScope returns a Scope that filters out URLs according to a list of regular expressions.
func NewSchemeScope ¶
NewSchemeScope limits the crawl to the specified URL schemes.
func NewSeedScope ¶
NewSeedScope returns a Scope that will only allow crawling the seed prefixes.
func NewURLPrefixScope ¶
func NewURLPrefixScope(prefixes URLPrefixMap) Scope
NewURLPrefixScope returns a Scope that limits the crawl to a set of allowed URL prefixes.
type URLPrefixMap ¶
type URLPrefixMap map[string]struct{}
A URLPrefixMap makes it easy to check for URL prefixes (even for very large lists). The URL scheme is ignored, along with an eventual "www." prefix.
func (URLPrefixMap) Contains ¶
func (m URLPrefixMap) Contains(uri *url.URL) bool
Contains returns true if the given URL matches the prefix map.