Documentation ¶
Overview ¶
Package colly implements a HTTP scraping framework
Index ¶
- type Collector
- func (c *Collector) DisableCookies()
- func (c *Collector) Init()
- func (c *Collector) Limit(rule *LimitRule) error
- func (c *Collector) Limits(rules []*LimitRule) error
- func (c *Collector) OnError(f ErrorCallback)
- func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
- func (c *Collector) OnRequest(f RequestCallback)
- func (c *Collector) OnResponse(f ResponseCallback)
- func (c *Collector) Post(URL string, requestData map[string]string) error
- func (c *Collector) SetRequestTimeout(timeout time.Duration)
- func (c *Collector) Visit(URL string) error
- func (c *Collector) Wait()
- func (c *Collector) WithTransport(transport *http.Transport)
- type Context
- type ErrorCallback
- type HTMLCallback
- type HTMLElement
- type LimitRule
- type Request
- type RequestCallback
- type Response
- type ResponseCallback
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Collector ¶
type Collector struct { // UserAgent is the User-Agent string used by HTTP requests UserAgent string // MaxDepth limits the recursion depth of visited URLs. // Set it to 0 for infinite recursion (default). MaxDepth int // AllowedDomains is a domain whitelist. // Leave it blank to allow any domains to be visited AllowedDomains []string // AllowURLRevisit allows multiple downloads of the same URL AllowURLRevisit bool // MaxBodySize limits the retrieved response body. `0` means unlimited. // The default value for MaxBodySize is 10240 (10MB) MaxBodySize int // contains filtered or unexported fields }
Collector provides the scraper instance for a scraping job
func NewCollector ¶
func NewCollector() *Collector
NewCollector creates a new Collector instance with default configuration
func (*Collector) DisableCookies ¶
func (c *Collector) DisableCookies()
DisableCookies turns off cookie handling for this collector
func (*Collector) Init ¶
func (c *Collector) Init()
Init initializes the Collector's private variables and sets default configuration for the Collector
func (*Collector) OnError ¶
func (c *Collector) OnError(f ErrorCallback)
OnError registers a function. Function will be executed on every error.
func (*Collector) OnHTML ¶
func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)
OnHTML registers a function. Function will be executed on every HTML element matched by the `goquerySelector` parameter. `goquerySelector` is a selector used by https://github.com/PuerkitoBio/goquery
func (*Collector) OnRequest ¶
func (c *Collector) OnRequest(f RequestCallback)
OnRequest registers a function. Function will be executed on every request made by the Collector
func (*Collector) OnResponse ¶
func (c *Collector) OnResponse(f ResponseCallback)
OnResponse registers a function. Function will be executed on every response
func (*Collector) Post ¶
Post starts collecting job by creating a POST request. Post also calls the previously provided OnRequest, OnResponse, OnHTML callbacks
func (*Collector) SetRequestTimeout ¶
SetRequestTimeout overrides the default timeout (10 seconds) for this collector
func (*Collector) Visit ¶
Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided OnRequest, OnResponse, OnHTML callbacks
func (*Collector) Wait ¶
func (c *Collector) Wait()
Wait returns when the collector jobs are finished
func (*Collector) WithTransport ¶
WithTransport allows you to set a custom http.Transport for this collector.
type Context ¶
type Context struct {
// contains filtered or unexported fields
}
Context provides a tiny layer for passing data between callbacks
type ErrorCallback ¶
ErrorCallback is a type alias for OnError callback functions
type HTMLCallback ¶
type HTMLCallback func(*HTMLElement)
HTMLCallback is a type alias for OnHTML callback functions
type HTMLElement ¶
type HTMLElement struct { // Name is the name of the tag Name string Text string // Request is the request object of the element's HTML document Request *Request // Response is the Response object of the element's HTML document Response *Response // DOM is the goquery parsed DOM object of the page. DOM is relative // to the current HTMLElement DOM *goquery.Selection // contains filtered or unexported fields }
HTMLElement is the representation of a HTML tag.
func (*HTMLElement) Attr ¶
func (h *HTMLElement) Attr(k string) string
Attr returns the selected attribute of a HTMLElement or empty string if no attribute found
type LimitRule ¶
type LimitRule struct { // DomainRegexp is a regular expression to match against domains DomainRegexp string // DomainRegexp is a glob pattern to match against domains DomainGlob string // Delay is the duration to wait before creating a new request to the matching domains Delay time.Duration // Parallelism is the number of the maximum allowed concurrent requests of the matching domains Parallelism int // contains filtered or unexported fields }
LimitRule provides connection restrictions for domains. There can be two kind of limitations:
- Parallelism: Set limit for the number of concurrent requests to a domain
- Delay: Set rate limit for a domain (this means no parallelism on the matching domains)
type Request ¶
type Request struct { // URL is the parsed URL of the HTTP request URL *url.URL // Headers contains the Request's HTTP headers Headers *http.Header // Ctx is a context between a Request and a Response Ctx *Context // Depth is the number of the parents of this request Depth int // contains filtered or unexported fields }
Request is the representation of a HTTP request made by a Collector
func (*Request) AbsoluteURL ¶
AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed
type RequestCallback ¶
type RequestCallback func(*Request)
RequestCallback is a type alias for OnRequest callback functions
type Response ¶
type Response struct { // StatusCode is the status code of the Response StatusCode int // Body is the content of the Response Body []byte // Ctx is a context between a Request and a Response Ctx *Context // Request is the Request object of the response Request *Request // Headers contains the Response's HTTP headers Headers *http.Header }
Response is the representation of a HTTP response made by a Collector
type ResponseCallback ¶
type ResponseCallback func(*Response)
ResponseCallback is a type alias for OnResponse callback functions