colly

package module

v0.0.0-...-269842a Latest Latest Go to latest Published: Oct 5, 2017 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gummiboll/colly

Links

Open Source Insights

README ¶

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Documentation

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping

Example

func main() {
	c := colly.NewCollector()

    // Find and visit all links
	c.OnHTML("a", func(e *colly.HTMLElement) {
		link := e.Attr("href")
		fmt.Println(link)
		c.Visit(e.Request.AbsoluteURL(link))
	})

	c.Visit("https://en.wikipedia.org/")
}

See examples folder for more detailed examples.

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Documentation ¶

Overview ¶

Package colly implements a HTTP scraping framework

Index ¶

type Collector
- func NewCollector() *Collector
type Context
- func NewContext() *Context
- func (c *Context) Get(key string) string
- func (c *Context) Put(key, value string)
type HTMLCallback
type HTMLElement
- func (h *HTMLElement) Attr(k string) string
type LimitRule
- func (r *LimitRule) Init() error
- func (r *LimitRule) Match(domain string) bool
type Request
type RequestCallback
type Response
type ResponseCallback

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Collector ¶

type Collector struct {
	// UserAgent is the User-Agent string used by HTTP requests
	UserAgent string
	// MaxDepth limits the recursion depth of visited URLs.
	// Set it to 0 for infinite recursion (default).
	MaxDepth int
	// AllowedDomains is a domain whitelist.
	// Leave it blank to allow any domains to be visited
	AllowedDomains []string
	// AllowURLRevisit allows multiple downloads of the same URL
	AllowURLRevisit bool
	// contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector ¶

func NewCollector() *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) DisableCookies ¶

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling for this collector

func (*Collector) Init ¶

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit ¶

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new `LimitRule` to the collector

func (*Collector) Limits ¶

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new `LimitRule`s to the collector

func (*Collector) OnHTML ¶

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the `goquerySelector` parameter. `goquerySelector` is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnRequest ¶

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse ¶

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) Post ¶

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts collecting job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Collector) Visit ¶

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Collector) Wait ¶

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport ¶

func (c *Collector) WithTransport(transport *http.Transport)

WithTransport allows you to set a custom http.Transport for this collector.

type Context ¶

type Context struct {
	// contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext ¶

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) Get ¶

func (c *Context) Get(key string) string

Get retrieves a value from Context. If no value found for `k` Get returns an empty string if key not found

func (*Context) Put ¶

func (c *Context) Put(key, value string)

Put stores a value in Context

type HTMLCallback ¶

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement ¶

type HTMLElement struct {
	// Name is the name of the tag
	Name string
	Text string

	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the goquery parsed DOM object of the page. DOM is relative
	// to the current HTMLElement
	DOM *goquery.Selection
	// contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func (*HTMLElement) Attr ¶

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

type LimitRule ¶

type LimitRule struct {
	// DomainRegexp is a regular expression to match against domains
	DomainRegexp string
	// DomainRegexp is a glob pattern to match against domains
	DomainGlob string
	// Delay is the duration to wait before creating a new request to the matching domains
	Delay time.Duration
	// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
	Parallelism int
	// contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. There can be two kind of limitations:

Parallelism: Set limit for the number of concurrent requests to a domain
Delay: Set rate limit for a domain (this means no parallelism on the matching domains)

func (*LimitRule) Init ¶

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match ¶

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type Request ¶

type Request struct {
	// URL is the parsed URL of the HTTP request
	URL *url.URL
	// Headers contains the Request's HTTP headers
	Headers *http.Header
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Depth is the number of the parents of this request
	Depth int
	// contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) AbsoluteURL ¶

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Post ¶

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

func (*Request) Visit ¶

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks

type RequestCallback ¶

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response ¶

type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
}

Response is the representation of a HTTP response made by a Collector

type ResponseCallback ¶

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
basic
coursera_courses
max_depth
parallel
rate_limit
request_context

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL