crawler

package module

v0.0.0-...-522e6a3 Latest Latest Go to latest Published: Jul 16, 2016 License: ISC Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mars9/crawler

Links

Open Source Insights

README ¶

crawler

WARNING: This software is new, experimental, and under heavy development. The documentation is lacking, if any. There are almost no tests.

You have been warned

Package crawler provides a load-balanced, concurrent and flexible crawler that follows the robots.txt policies and crawl delays in its default configuration.

Installation

To install, simply run in a terminal:

go get github.com/mars9/crawler

Documentation ¶

Index ¶

Constants
func Accept(url *url.URL, host string, reject, accept []*regexp.Regexp) bool
func Get(url *url.URL, agent string, robots Robots) (io.ReadCloser, error)
type Crawler
- func New(w *Worker, ttl time.Duration, log *log.Logger) *Crawler
type Error
- func (e Error) Error() string
type Queue
- func NewQueue(limit int64, ttl time.Duration) *Queue
type Robots
type Worker

Constants ¶

View Source

const (
	DefaultUserAgent   = "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"
	DefaultRobotsAgent = "Googlebot (crawlbot v1)"
	DefaultTimeToLive  = 3 * DefaultDelay
	DefaultDelay       = 3 * time.Second
)

View Source

const (
	ErrNotAbsoluteURL = Error("not an absolute url")
	ErrRejectedURL    = Error("url rejected")

	ErrQueueClosed  = Error("queue is shut down")
	ErrDuplicateURL = Error("duplicate url")
	ErrEmptyURL     = Error("empty url")
	ErrLimitReached = Error("limit reached")
)

Variables ¶

This section is empty.

Functions ¶

func Accept ¶

func Accept(url *url.URL, host string, reject, accept []*regexp.Regexp) bool

func Get ¶

func Get(url *url.URL, agent string, robots Robots) (io.ReadCloser, error)

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

func New ¶

func New(w *Worker, ttl time.Duration, log *log.Logger) *Crawler

func (*Crawler) Close ¶

func (c *Crawler) Close() error

func (*Crawler) Done ¶

func (c *Crawler) Done() <-chan struct{}

func (*Crawler) Start ¶

func (c *Crawler) Start(sitemap *url.URL, seeds ...*url.URL) error

type Error ¶

type Error string

func (Error) Error ¶

func (e Error) Error() string

type Queue ¶

type Queue struct {
	// contains filtered or unexported fields
}

func NewQueue ¶

func NewQueue(limit int64, ttl time.Duration) *Queue

func (*Queue) Close ¶

func (q *Queue) Close() error

func (*Queue) Pop ¶

func (q *Queue) Pop() <-chan *url.URL

func (*Queue) Push ¶

func (q *Queue) Push(url *url.URL) error

type Robots ¶

type Robots interface {
	Test(*url.URL) bool
}

type Worker ¶

type Worker struct {
	// GetFunc issues a GET request to the specified URL and returns the
	// response body and an error if any.
	GetFunc func(*url.URL) (io.ReadCloser, error)

	// IsAcceptedFunc can be used to control the crawler.
	IsAcceptedFunc func(*url.URL) bool

	// ProcessFunc can be used to scrape data.
	ProcessFunc func(*url.URL, *html.Node, []byte)

	// Host defines the hostname to crawl. Worker is a single-host crawler.
	Host *url.URL

	// UserAgent defines the user-agent string to use for URL fetching.
	UserAgent string
	Accept    []*regexp.Regexp
	Reject    []*regexp.Regexp

	// Delay to use between requests to a same host if there is not
	// robots.txt crawl delay. The delay starts as soon as the response
	// is received from the host.
	Delay time.Duration

	// MaxEnqueue returns the maximum number of pages visited before
	// stopping the crawl. Note that the Crawler will send its stop signal
	// once this number of visits is reached, but workers may be in the
	// process of visiting other pages, so when the crawling stops, the
	// number of pages visited will be at least MaxEnqueues, possibly more.
	MaxEnqueue int64

	Robots Robots

	Concurrent int
}

Worker represents a crawler worker implementation.

func (*Worker) Get ¶

func (w *Worker) Get(url *url.URL) (io.ReadCloser, error)

func (*Worker) IsAccepted ¶

func (w *Worker) IsAccepted(url *url.URL) bool

func (*Worker) Process ¶

func (w *Worker) Process(url *url.URL, node *html.Node, data []byte)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
sitemap
transform

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL