gocrawl

package module

v0.0.0-...-a62c0e3 Latest Latest Go to latest Published: Jan 16, 2015 License: BSD-3-Clause Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zchking/crawler

Links

Open Source Insights

README ¶

crawler
=======

crawler 致力于实现中文友好的网络抓取系统，项目基于PuerkitoBio的一个初级的并行的轻量级抓取库gocrawl.本项目目标：

    1. 实现分布式
    #. 增加机器学习算法
    #. 优化中文编码处理
    #. 完善文档


Features
========
*    Full control over the URLs to visit, inspect and query (using a pre-initialized [goquery][] document)
*    Crawl delays applied per host
*    Obedience to robots.txt rules (using the [robotstxt.go][robots] library)
*    Concurrent execution using goroutines
*    Configurable logging
*    Open, customizable design providing hooks into the execution logic

Installation and dependencies
=============================

crawl depends on the following userland libraries:

*    [goquery][]
*    [purell][]
*    [robotstxt.go][robots]

To install:

   *go get github.com/zchking/crawl*

To install a previous version, you have to `git clone https://github.com/zchking/crawl` into your `$GOPATH/src/github.com/zchking/gocrawl/` directory, and then run (for example) `git checkout v0.3.2` to checkout a specific version, and `go install` to build and install the Go package.

Changelog
=========

**2015.01.16** start from PuerkitoBio.Thanks. 


Example
=======

From `example_test.go`::

    package gocrawl

    import (
      "github.com/PuerkitoBio/goquery"
      "net/http"
      "regexp"
      "time"
    )

    // Only enqueue the root and paths beginning with an "a"
    var rxOk = regexp.MustCompile(`http://duckduckgo\.com(/a.*)?$`)

    // Create the Extender implementation, based on the gocrawl-provided DefaultExtender,
    // because we don't want/need to override all methods.
    type ExampleExtender struct {
      DefaultExtender // Will use the default implementation of all but Visit() and Filter()
    }

    // Override Visit for our need.
    func (this *ExampleExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (interface{}, bool) {
      // Use the goquery document or res.Body to manipulate the data
      // ...

      // Return nil and true - let gocrawl find the links
      return nil, true
    }

    // Override Filter for our need.
    func (this *ExampleExtender) Filter(ctx *URLContext, isVisited bool) bool {
      return !isVisited && rxOk.MatchString(ctx.NormalizedURL().String())
    }

    func ExampleCrawl() {
      // Set custom options
      opts := NewOptions(new(ExampleExtender))
      opts.CrawlDelay = 1 * time.Second
      opts.LogFlags = LogAll

      // Play nice with ddgo when running the test!
      opts.MaxVisits = 2

      // Create crawler and start at root of duckduckgo
      c := NewCrawlerWithOptions(opts)
      c.Run("https://duckduckgo.com/")

      // Remove "x" before Output: to activate the example (will run on go test)

      // xOutput: voluntarily fail to see log output
    }
    ```

Document
========
@TODO crawler Document

Refer to PuerkitoBio/gocrawl


Thanks
======
    
    - PuerkitoBio
    - Richard Penman
    - Dmitry Bondarenko
    - Markus Sonderegger

License
=======

The [BSD 3-Clause license][bsd].

[bsd]: http://opensource.org/licenses/BSD-3-Clause

[goquery]: https://github.com/PuerkitoBio/goquery

[robots]: https://github.com/temoto/robotstxt.go

[purell]: https://github.com/PuerkitoBio/purell

[robprot]: http://www.robotstxt.org/robotstxt.html

[robspec]: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

Documentation ¶

Overview ¶

gocrawl is a polite, slim and concurrent web crawler written in Go.

Index ¶

Constants
Variables
type CrawlError
- func (this CrawlError) Error() string
type CrawlErrorKind
- func (this CrawlErrorKind) String() string
type Crawler
- func NewCrawler(ext Extender) *Crawler
- func NewCrawlerWithOptions(opts *Options) *Crawler
- func (this *Crawler) Run(seeds interface{}) error
- func (this *Crawler) Stop()
type DefaultExtender
- func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration
- func (this *DefaultExtender) Disallowed(ctx *URLContext)
- func (this *DefaultExtender) End(err error)
- func (this *DefaultExtender) Enqueued(ctx *URLContext)
- func (this *DefaultExtender) Error(err *CrawlError)
- func (this *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)
- func (this *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)
- func (this *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool
- func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)
- func (this *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool
- func (this *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)
- func (this *DefaultExtender) Start(seeds interface{}) interface{}
- func (this *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)
- func (this *DefaultExtender) Visited(ctx *URLContext, harvested interface{})
type DelayInfo
type Extender
type FetchInfo
type LogFlags
type Options
- func NewOptions(ext Extender) *Options
type S
type U
type URLContext
- func (this *URLContext) IsRobotsURL() bool
- func (this *URLContext) NormalizedSourceURL() *url.URL
- func (this *URLContext) NormalizedURL() *url.URL
- func (this *URLContext) SourceURL() *url.URL
- func (this *URLContext) URL() *url.URL

Constants ¶

View Source

const (
	DefaultUserAgent          string                    = `Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2`
	DefaultRobotUserAgent     string                    = `Googlebot (gocrawl v0.4)`
	DefaultEnqueueChanBuffer  int                       = 100
	DefaultHostBufferFactor   int                       = 10
	DefaultCrawlDelay         time.Duration             = 5 * time.Second
	DefaultIdleTTL            time.Duration             = 10 * time.Second
	DefaultNormalizationFlags purell.NormalizationFlags = purell.FlagsAllGreedy
)

Default options

Variables ¶

View Source

var (
	// The error returned when a redirection is requested, so that the
	// worker knows that this is not an actual Fetch error, but a request to
	// enqueue the redirect-to URL.
	ErrEnqueueRedirect = errors.New("redirection not followed")

	// The error returned when the maximum number of visits, as specified by the
	// Options field MaxVisits, is reached.
	ErrMaxVisits = errors.New("the maximum number of visits is reached")

	ErrInterrupted = errors.New("interrupted")
)

View Source

var HttpClient = &http.Client{CheckRedirect: func(req *http.Request, via []*http.Request) error {

	if isRobotsURL(req.URL) {
		if len(via) >= 10 {
			return errors.New("stopped after 10 redirects")
		}
		return nil
	}

	return ErrEnqueueRedirect
}}

The default HTTP client used by DefaultExtender's fetch requests (this is thread-safe). The client's fields can be customized (i.e. for a different redirection strategy, a different Transport object, ...). It should be done prior to starting the crawler.

Functions ¶

This section is empty.

Types ¶

type CrawlError ¶

type CrawlError struct {
	Ctx  *URLContext
	Err  error
	Kind CrawlErrorKind
	// contains filtered or unexported fields
}

Crawl error information.

func (CrawlError) Error ¶

func (this CrawlError) Error() string

Implementation of the error interface.

type CrawlErrorKind ¶

type CrawlErrorKind uint8

Enum indicating the kind of the crawling error.

const (
	CekFetch CrawlErrorKind = iota
	CekParseRobots
	CekHttpStatusCode
	CekReadBody
	CekParseBody
	CekParseURL
	CekProcessLinks
	CekParseRedirectURL
)

func (CrawlErrorKind) String ¶

func (this CrawlErrorKind) String() string

type Crawler ¶

type Crawler struct {
	Options *Options
	// contains filtered or unexported fields
}

The crawler itself, the master of the whole process

func NewCrawler ¶

func NewCrawler(ext Extender) *Crawler

Crawler constructor with the specified extender object.

func NewCrawlerWithOptions ¶

func NewCrawlerWithOptions(opts *Options) *Crawler

Crawler constructor with a pre-initialized Options object.

func (*Crawler) Run ¶

func (this *Crawler) Run(seeds interface{}) error

Run starts the crawling process, based on the given seeds and the current Options settings. Execution stops either when MaxVisits is reached (if specified) or when no more URLs need visiting. If an error occurs, it is returned (if MaxVisits is reached, the error ErrMaxVisits is returned).

func (*Crawler) Stop ¶

func (this *Crawler) Stop()

type DefaultExtender ¶

type DefaultExtender struct {
	EnqueueChan chan<- interface{}
}

Default working implementation of an extender.

func (*DefaultExtender) ComputeDelay ¶

func (this *DefaultExtender) ComputeDelay(host string, di *DelayInfo, lastFetch *FetchInfo) time.Duration

ComputeDelay returns the delay specified in the Crawler's Options, unless a crawl-delay is specified in the robots.txt file, which has precedence.

func (*DefaultExtender) Disallowed ¶

func (this *DefaultExtender) Disallowed(ctx *URLContext)

Disallowed is a no-op.

func (*DefaultExtender) End ¶

func (this *DefaultExtender) End(err error)

End is a no-op.

func (*DefaultExtender) Enqueued ¶

func (this *DefaultExtender) Enqueued(ctx *URLContext)

Enqueued is a no-op.

func (*DefaultExtender) Error ¶

func (this *DefaultExtender) Error(err *CrawlError)

Error is a no-op (logging is done automatically, regardless of the implementation of the Error() hook).

func (*DefaultExtender) Fetch ¶

func (this *DefaultExtender) Fetch(ctx *URLContext, userAgent string, headRequest bool) (*http.Response, error)

Fetch requests the specified URL using the given user agent string. It uses a custom http Client instance that doesn't follow redirections. Instead, the redirected-to URL is enqueued so that it goes through the same Filter() and Fetch() process as any other URL.

Two options were considered for the default Fetch() implementation : 1- Not following any redirections, and enqueuing the redirect-to URL,

failing the current call with the 3xx status code.

2- Following all redirections, enqueuing only the last one (where redirection

stops). Returning the response of the next-to-last request.

Ultimately, 1) was implemented, as it is the most generic solution that makes sense as default for the library. It involves no "magic" and gives full control as to what can happen, with the disadvantage of having the Filter() being aware of all possible intermediary URLs before reaching the final destination of a redirection (i.e. if A redirects to B that redirects to C, Filter has to allow A, B, and C to be Fetched, while solution 2 would only have required Filter to allow A and C).

Solution 2) also has the disadvantage of fetching twice the final URL (once while processing the original URL, so that it knows that there is no more redirection HTTP code, and another time when the actual destination URL is fetched to be visited).

func (*DefaultExtender) FetchedRobots ¶

func (this *DefaultExtender) FetchedRobots(ctx *URLContext, res *http.Response)

FetchedRobots is a no-op.

func (*DefaultExtender) Filter ¶

func (this *DefaultExtender) Filter(ctx *URLContext, isVisited bool) bool

Enqueue the URL if it hasn't been visited yet.

func (*DefaultExtender) Log ¶

func (this *DefaultExtender) Log(logFlags LogFlags, msgLevel LogFlags, msg string)

Log prints to the standard error by default, based on the requested log verbosity.

func (*DefaultExtender) RequestGet ¶

func (this *DefaultExtender) RequestGet(ctx *URLContext, headRes *http.Response) bool

Ask the worker to actually request the URL's body (issue a GET), unless the status code is not 2xx.

func (*DefaultExtender) RequestRobots ¶

func (this *DefaultExtender) RequestRobots(ctx *URLContext, robotAgent string) (data []byte, doRequest bool)

Ask the worker to actually request (fetch) the Robots.txt document.

func (*DefaultExtender) Start ¶

func (this *DefaultExtender) Start(seeds interface{}) interface{}

Return the same seeds as those received (those that were passed to Run() initially).

func (*DefaultExtender) Visit ¶

func (this *DefaultExtender) Visit(ctx *URLContext, res *http.Response, doc *goquery.Document) (harvested interface{}, findLinks bool)

Ask the worker to harvest the links in this page.

func (*DefaultExtender) Visited ¶

func (this *DefaultExtender) Visited(ctx *URLContext, harvested interface{})

Visited is a no-op.

type DelayInfo ¶

type DelayInfo struct {
	OptsDelay   time.Duration
	RobotsDelay time.Duration
	LastDelay   time.Duration
}

Delay information: the Options delay, the Robots.txt delay, and the last delay used.

type Extender ¶

type Extender interface {
	// Start, End, Error and Log are not related to a specific URL, so they don't
	// receive a URLContext struct.
	Start(interface{}) interface{}
	End(error)
	Error(*CrawlError)
	Log(LogFlags, LogFlags, string)

	// ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
	// is related to a URLContext (holds a ctx field).
	ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration

	// All other extender methods are executed in the context of an URL, and thus
	// receive an URLContext struct as first argument.
	Fetch(*URLContext, string, bool) (*http.Response, error)
	RequestGet(*URLContext, *http.Response) bool
	RequestRobots(*URLContext, string) ([]byte, bool)
	FetchedRobots(*URLContext, *http.Response)
	Filter(*URLContext, bool) bool
	Enqueued(*URLContext)
	Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
	Visited(*URLContext, interface{})
	Disallowed(*URLContext)
}

Extension methods required to provide an extender instance.

type FetchInfo ¶

type FetchInfo struct {
	Ctx           *URLContext
	Duration      time.Duration
	StatusCode    int
	IsHeadRequest bool
}

Fetch information: the duration of the fetch, the returned status code, whether or not it was a HEAD request, and whether or not it was a robots.txt request.

type LogFlags ¶

type LogFlags uint

const (
	LogError LogFlags = 1 << iota
	LogInfo
	LogEnqueued
	LogIgnored
	LogTrace
	LogNone LogFlags = 0
	LogAll  LogFlags = LogError | LogInfo | LogEnqueued | LogIgnored | LogTrace
)

Log levels for the library's logger

type Options ¶

type Options struct {
	UserAgent             string
	RobotUserAgent        string
	MaxVisits             int
	EnqueueChanBuffer     int
	HostBufferFactor      int
	CrawlDelay            time.Duration // Applied per host
	WorkerIdleTTL         time.Duration
	SameHostOnly          bool
	HeadBeforeGet         bool
	URLNormalizationFlags purell.NormalizationFlags
	LogFlags              LogFlags
	Extender              Extender
}

The Options available to control and customize the crawling process.

func NewOptions ¶

func NewOptions(ext Extender) *Options

type S ¶

type S map[string]interface{}

type U ¶

type U map[*url.URL]interface{}

type URLContext ¶

type URLContext struct {
	HeadBeforeGet bool
	State         interface{}
	// contains filtered or unexported fields
}

func (*URLContext) IsRobotsURL ¶

func (this *URLContext) IsRobotsURL() bool

func (*URLContext) NormalizedSourceURL ¶

func (this *URLContext) NormalizedSourceURL() *url.URL

func (*URLContext) NormalizedURL ¶

func (this *URLContext) NormalizedURL() *url.URL

func (*URLContext) SourceURL ¶

func (this *URLContext) SourceURL() *url.URL

func (*URLContext) URL ¶

func (this *URLContext) URL() *url.URL

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL