colly

package module
v2.0.0-...-bd4983f Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2023 License: Apache-2.0 Imports: 43 Imported by: 0

README

Colly

Lightning Fast and Elegant Scraping Framework for Gophers

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

GoDoc Backers on Open Collective Sponsors on Open Collective build status report card view examples Code Coverage FOSSA Status Twitter URL

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

Example

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

See examples folder for more detailed examples.

Installation

Add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Bugs

Bugs or suggestions? Visit the issue tracker or join #colly on freenode

Other Projects Using Colly

Below is a list of public, open source projects that use Colly:

If you are using Colly in a project please send a pull request to add it to the list.

Contributors

This project exists thanks to all the people who contribute. [Contribute].

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

License

FOSSA Status

Documentation

Overview

Package colly implements a HTTP scraping framework

Index

Constants

View Source
const ProxyURLKey key = iota

ProxyURLKey is the context key for the request proxy address.

Variables

View Source
var (
	// ErrForbiddenDomain is the error thrown if visiting
	// a domain which is not allowed in AllowedDomains
	ErrForbiddenDomain = errors.New("Forbidden domain")
	// ErrMissingURL is the error type for missing URL errors
	ErrMissingURL = errors.New("Missing URL")
	// ErrMaxDepth is the error type for exceeding max depth
	ErrMaxDepth = errors.New("Max depth limit reached")
	// ErrForbiddenURL is the error thrown if visiting
	// a URL which is not allowed by URLFilters
	ErrForbiddenURL = errors.New("ForbiddenURL")

	// ErrNoURLFiltersMatch is the error thrown if visiting
	// a URL which is not allowed by URLFilters
	ErrNoURLFiltersMatch = errors.New("No URLFilters match")
	// ErrRobotsTxtBlocked is the error type for robots.txt errors
	ErrRobotsTxtBlocked = errors.New("URL blocked by robots.txt")
	// ErrNoCookieJar is the error type for missing cookie jar
	ErrNoCookieJar = errors.New("Cookie jar is not available")
	// ErrNoPattern is the error type for LimitRules without patterns
	ErrNoPattern = errors.New("No pattern defined in LimitRule")
	// ErrEmptyProxyURL is the error type for empty Proxy URL list
	ErrEmptyProxyURL = errors.New("Proxy URL list is empty")
	// ErrAbortedAfterHeaders is the error returned when OnResponseHeaders aborts the transfer.
	ErrAbortedAfterHeaders = errors.New("Aborted after receiving response headers")
	// ErrQueueFull is the error returned when the queue is full
	ErrQueueFull = errors.New("Queue MaxSize reached")
	// ErrMaxRequests is the error returned when exceeding max requests
	ErrMaxRequests = errors.New("Max Requests limit reached")
)

Functions

func SanitizeFileName

func SanitizeFileName(fileName string) string

SanitizeFileName replaces dangerous characters in a string so the return value can be used as a safe file name.

func UnmarshalHTML

func UnmarshalHTML(v interface{}, s *goquery.Selection, structMap map[string]string) error

UnmarshalHTML declaratively extracts text or attributes to a struct from HTML response using struct tags composed of css selectors. Allowed struct tags:

  • "selector" (required): CSS (goquery) selector of the desired data
  • "attr" (optional): Selects the matching element's attribute's value. Leave it blank or omit to get the text of the element.

Example struct declaration:

type Nested struct {
	String  string   `selector:"div > p"`
   Classes []string `selector:"li" attr:"class"`
	Struct  *Nested  `selector:"div > div"`
}

Supported types: struct, *struct, string, []string

Types

type AlreadyVisitedError

type AlreadyVisitedError struct {
	// Destination is the URL that was attempted to be visited.
	// It might not match the URL passed to Visit if redirect
	// was followed.
	Destination *url.URL
}

AlreadyVisitedError is the error type for already visited URLs.

It's returned synchronously by Visit when the URL passed to Visit is already visited.

When already visited URL is encountered after following redirects, this error appears in OnError callback, and if Async mode is not enabled, is also returned by Visit.

func (*AlreadyVisitedError) Error

func (e *AlreadyVisitedError) Error() string

Error implements error interface.

type Collector

type Collector struct {
	// UserAgent is the User-Agent string used by HTTP requests
	UserAgent string
	// Custom headers for the request
	Headers *http.Header
	// MaxDepth limits the recursion depth of visited URLs.
	// Set it to 0 for infinite recursion (default).
	MaxDepth int
	// AllowedDomains is a domain whitelist.
	// Leave it blank to allow any domains to be visited
	AllowedDomains []string
	// DisallowedDomains is a domain blacklist.
	DisallowedDomains []string
	// DisallowedURLFilters is a list of regular expressions which restricts
	// visiting URLs. If any of the rules matches to a URL the
	// request will be stopped. DisallowedURLFilters will
	// be evaluated before URLFilters
	// Leave it blank to allow any URLs to be visited
	DisallowedURLFilters []*regexp.Regexp

	// Leave it blank to allow any URLs to be visited
	URLFilters []*regexp.Regexp

	// AllowURLRevisit allows multiple downloads of the same URL
	AllowURLRevisit bool
	// MaxBodySize is the limit of the retrieved response body in bytes.
	// 0 means unlimited.
	// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
	MaxBodySize int
	// CacheDir specifies a location where GET requests are cached as files.
	// When it's not defined, caching is disabled.
	CacheDir string
	// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
	// the target host's robots.txt file.  See http://www.robotstxt.org/ for more
	// information.
	IgnoreRobotsTxt bool
	// Async turns on asynchronous network communication. Use Collector.Wait() to
	// be sure all requests have been finished.
	Async bool
	// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
	// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
	// to true to enable it.
	ParseHTTPErrorResponse bool
	// ID is the unique identifier of a collector
	ID uint32
	// DetectCharset can enable character encoding detection for non-utf8 response bodies
	// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
	DetectCharset bool

	// CheckHead performs a HEAD request before every GET to pre-validate the response
	CheckHead bool
	// TraceHTTP enables capturing and reporting request performance for crawler tuning.
	// When set to true, the Response.Trace will be filled in with an HTTPTrace object.
	TraceHTTP bool
	// Context is the context that will be used for HTTP requests. You can set this
	// to support clean cancellation of scraping.
	Context context.Context
	// MaxRequests limit the number of requests done by the instance.
	// Set it to 0 for infinite requests (default).
	MaxRequests uint32
	// contains filtered or unexported fields
}

Collector provides the scraper instance for a scraping job

func NewCollector

func NewCollector(options ...CollectorOption) *Collector

NewCollector creates a new Collector instance with default configuration

func (*Collector) Appengine

func (c *Collector) Appengine(ctx context.Context)

Appengine will replace the Collector's backend http.Client With an Http.Client that is provided by appengine/urlfetch This function should be used when the scraper is run on Google App Engine. Example:

func startScraper(w http.ResponseWriter, r *http.Request) {
  ctx := appengine.NewContext(r)
  c := colly.NewCollector()
  c.Appengine(ctx)
   ...
  c.Visit("https://google.ca")
}

func (*Collector) Clone

func (c *Collector) Clone() *Collector

Clone creates an exact copy of a Collector without callbacks. HTTP backend, robots.txt cache and cookie jar are shared between collectors.

func (*Collector) Cookies

func (c *Collector) Cookies(URL string) []*http.Cookie

Cookies returns the cookies to send in a request for the given URL.

func (*Collector) DisableCookies

func (c *Collector) DisableCookies()

DisableCookies turns off cookie handling

func (*Collector) HasPosted

func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error)

HasPosted checks if the provided URL and requestData has been visited This method is useful more likely to prevent re-visit same URL and POST body

func (*Collector) HasVisited

func (c *Collector) HasVisited(URL string) (bool, error)

HasVisited checks if the provided URL has been visited

func (*Collector) Head

func (c *Collector) Head(URL string) error

Head starts a collector job by creating a HEAD request.

func (*Collector) Init

func (c *Collector) Init()

Init initializes the Collector's private variables and sets default configuration for the Collector

func (*Collector) Limit

func (c *Collector) Limit(rule *LimitRule) error

Limit adds a new LimitRule to the collector

func (*Collector) Limits

func (c *Collector) Limits(rules []*LimitRule) error

Limits adds new LimitRules to the collector

func (*Collector) OnError

func (c *Collector) OnError(f ErrorCallback)

OnError registers a function. Function will be executed if an error occurs during the HTTP request.

func (*Collector) OnHTML

func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback)

OnHTML registers a function. Function will be executed on every HTML element matched by the GoQuery Selector parameter. GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery

func (*Collector) OnHTMLDetach

func (c *Collector) OnHTMLDetach(goquerySelector string)

OnHTMLDetach deregister a function. Function will not be execute after detached

func (*Collector) OnRequest

func (c *Collector) OnRequest(f RequestCallback)

OnRequest registers a function. Function will be executed on every request made by the Collector

func (*Collector) OnResponse

func (c *Collector) OnResponse(f ResponseCallback)

OnResponse registers a function. Function will be executed on every response

func (*Collector) OnResponseHeaders

func (c *Collector) OnResponseHeaders(f ResponseHeadersCallback)

OnResponseHeaders registers a function. Function will be executed on every response when headers and status are already received, but body is not yet read.

Like in OnRequest, you can call Request.Abort to abort the transfer. This might be useful if, for example, you're following all hyperlinks, but want to avoid downloading files.

Be aware that using this will prevent HTTP/1.1 connection reuse, as the only way to abort a download is to immediately close the connection. HTTP/2 doesn't suffer from this problem, as it's possible to close specific stream inside the connection.

func (*Collector) OnScraped

func (c *Collector) OnScraped(f ScrapedCallback)

OnScraped registers a function. Function will be executed after OnHTML, as a final part of the scraping.

func (*Collector) OnXML

func (c *Collector) OnXML(xpathQuery string, f XMLCallback)

OnXML registers a function. Function will be executed on every XML element matched by the xpath Query parameter. xpath Query is used by https://github.com/antchfx/xmlquery

func (*Collector) OnXMLDetach

func (c *Collector) OnXMLDetach(xpathQuery string)

OnXMLDetach deregister a function. Function will not be execute after detached

func (*Collector) Post

func (c *Collector) Post(URL string, requestData map[string]string) error

Post starts a collector job by creating a POST request. Post also calls the previously provided callbacks

func (*Collector) PostMultipart

func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided callbacks

func (*Collector) PostRaw

func (c *Collector) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. Post also calls the previously provided callbacks

func (*Collector) Request

func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error

Request starts a collector job by creating a custom HTTP request where method, context, headers and request data can be specified. Set requestData, ctx, hdr parameters to nil if you don't want to use them. Valid methods:

  • "GET"
  • "HEAD"
  • "POST"
  • "PUT"
  • "DELETE"
  • "PATCH"
  • "OPTIONS"

func (*Collector) SetClient

func (c *Collector) SetClient(client *http.Client)

SetClient will override the previously set http.Client

func (*Collector) SetCookieJar

func (c *Collector) SetCookieJar(j http.CookieJar)

SetCookieJar overrides the previously set cookie jar

func (*Collector) SetCookies

func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error

SetCookies handles the receipt of the cookies in a reply for the given URL

func (*Collector) SetDebugger

func (c *Collector) SetDebugger(d debug.Debugger)

SetDebugger attaches a debugger to the collector

func (*Collector) SetProxy

func (c *Collector) SetProxy(proxyURL string) error

SetProxy sets a proxy for the collector. This method overrides the previously used http.Transport if the type of the transport is not http.RoundTripper. The proxy type is determined by the URL scheme. "http" and "socks5" are supported. If the scheme is empty, "http" is assumed.

func (*Collector) SetProxyFunc

func (c *Collector) SetProxyFunc(p ProxyFunc)

SetProxyFunc sets a custom proxy setter/switcher function. See built-in ProxyFuncs for more details. This method overrides the previously used http.Transport if the type of the transport is not http.RoundTripper. The proxy type is determined by the URL scheme. "http" and "socks5" are supported. If the scheme is empty, "http" is assumed.

func (*Collector) SetRedirectHandler

func (c *Collector) SetRedirectHandler(f func(req *http.Request, via []*http.Request) error)

SetRedirectHandler instructs the Collector to allow multiple downloads of the same URL

func (*Collector) SetRequestTimeout

func (c *Collector) SetRequestTimeout(timeout time.Duration)

SetRequestTimeout overrides the default timeout (10 seconds) for this collector

func (*Collector) SetStorage

func (c *Collector) SetStorage(s storage.Storage) error

SetStorage overrides the default in-memory storage. Storage stores scraping related data like cookies and visited urls

func (*Collector) String

func (c *Collector) String() string

String is the text representation of the collector. It contains useful debug information about the collector's internals

func (*Collector) UnmarshalRequest

func (c *Collector) UnmarshalRequest(r []byte) (*Request, error)

UnmarshalRequest creates a Request from serialized data

func (*Collector) Visit

func (c *Collector) Visit(URL string) error

Visit starts Collector's collecting job by creating a request to the URL specified in parameter. Visit also calls the previously provided callbacks

func (*Collector) Wait

func (c *Collector) Wait()

Wait returns when the collector jobs are finished

func (*Collector) WithTransport

func (c *Collector) WithTransport(transport http.RoundTripper)

WithTransport allows you to set a custom http.RoundTripper (transport)

type CollectorOption

type CollectorOption func(*Collector)

A CollectorOption sets an option on a Collector.

func AllowURLRevisit

func AllowURLRevisit() CollectorOption

AllowURLRevisit instructs the Collector to allow multiple downloads of the same URL

func AllowedDomains

func AllowedDomains(domains ...string) CollectorOption

AllowedDomains sets the domain whitelist used by the Collector.

func Async

func Async(a ...bool) CollectorOption

Async turns on asynchronous network requests.

func CacheDir

func CacheDir(path string) CollectorOption

CacheDir specifies the location where GET requests are cached as files.

func CheckHead

func CheckHead() CollectorOption

CheckHead performs a HEAD request before every GET to pre-validate the response

func Debugger

func Debugger(d debug.Debugger) CollectorOption

Debugger sets the debugger used by the Collector.

func DetectCharset

func DetectCharset() CollectorOption

DetectCharset enables character encoding detection for non-utf8 response bodies without explicit charset declaration. This feature uses https://github.com/saintfish/chardet

func DisallowedDomains

func DisallowedDomains(domains ...string) CollectorOption

DisallowedDomains sets the domain blacklist used by the Collector.

func DisallowedURLFilters

func DisallowedURLFilters(filters ...*regexp.Regexp) CollectorOption

DisallowedURLFilters sets the list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request will be stopped.

func Headers

func Headers(headers map[string]string) CollectorOption

Headers sets the custom headers used by the Collector.

func ID

func ID(id uint32) CollectorOption

ID sets the unique identifier of the Collector.

func IgnoreRobotsTxt

func IgnoreRobotsTxt() CollectorOption

IgnoreRobotsTxt instructs the Collector to ignore any restrictions set by the target host's robots.txt file.

func MaxBodySize

func MaxBodySize(sizeInBytes int) CollectorOption

MaxBodySize sets the limit of the retrieved response body in bytes.

func MaxDepth

func MaxDepth(depth int) CollectorOption

MaxDepth limits the recursion depth of visited URLs.

func MaxRequests

func MaxRequests(max uint32) CollectorOption

MaxRequests limit the number of requests done by the instance. Set it to 0 for infinite requests (default).

func ParseHTTPErrorResponse

func ParseHTTPErrorResponse() CollectorOption

ParseHTTPErrorResponse allows parsing responses with HTTP errors

func StdlibContext

func StdlibContext(ctx context.Context) CollectorOption

StdlibContext sets the context that will be used for HTTP requests. You can set this to support clean cancellation of scraping.

func TraceHTTP

func TraceHTTP() CollectorOption

TraceHTTP instructs the Collector to collect and report request trace data on the Response.Trace.

func URLFilters

func URLFilters(filters ...*regexp.Regexp) CollectorOption

URLFilters sets the list of regular expressions which restricts visiting URLs. If any of the rules matches to a URL the request won't be stopped.

func UserAgent

func UserAgent(ua string) CollectorOption

UserAgent sets the user agent used by the Collector.

type Context

type Context struct {
	// contains filtered or unexported fields
}

Context provides a tiny layer for passing data between callbacks

func NewContext

func NewContext() *Context

NewContext initializes a new Context instance

func (*Context) ForEach

func (c *Context) ForEach(fn func(k string, v interface{}) interface{}) []interface{}

ForEach iterate context

func (*Context) Get

func (c *Context) Get(key string) string

Get retrieves a string value from Context. Get returns an empty string if key not found

func (*Context) GetAny

func (c *Context) GetAny(key string) interface{}

GetAny retrieves a value from Context. GetAny returns nil if key not found

func (*Context) MarshalBinary

func (c *Context) MarshalBinary() (_ []byte, _ error)

MarshalBinary encodes Context value This function is used by request caching

func (*Context) Put

func (c *Context) Put(key string, value interface{})

Put stores a value of any type in Context

func (*Context) UnmarshalBinary

func (c *Context) UnmarshalBinary(_ []byte) error

UnmarshalBinary decodes Context value to nil This function is used by request caching

type ErrorCallback

type ErrorCallback func(*Response, error)

ErrorCallback is a type alias for OnError callback functions

type HTMLCallback

type HTMLCallback func(*HTMLElement)

HTMLCallback is a type alias for OnHTML callback functions

type HTMLElement

type HTMLElement struct {
	// Name is the name of the tag
	Name string
	Text string

	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the goquery parsed DOM object of the page. DOM is relative
	// to the current HTMLElement
	DOM *goquery.Selection
	// Index stores the position of the current element within all the elements matched by an OnHTML callback
	Index int
	// contains filtered or unexported fields
}

HTMLElement is the representation of a HTML tag.

func NewHTMLElementFromSelectionNode

func NewHTMLElementFromSelectionNode(resp *Response, s *goquery.Selection, n *html.Node, idx int) *HTMLElement

NewHTMLElementFromSelectionNode creates a HTMLElement from a goquery.Selection Node.

func (*HTMLElement) Attr

func (h *HTMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

func (*HTMLElement) ChildAttr

func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string

ChildAttr returns the stripped text content of the first matching element's attribute.

func (*HTMLElement) ChildAttrs

func (h *HTMLElement) ChildAttrs(goquerySelector, attrName string) []string

ChildAttrs returns the stripped text content of all the matching element's attributes.

func (*HTMLElement) ChildText

func (h *HTMLElement) ChildText(goquerySelector string) string

ChildText returns the concatenated and stripped text content of the matching elements.

func (*HTMLElement) ChildTexts

func (h *HTMLElement) ChildTexts(goquerySelector string) []string

ChildTexts returns the stripped text content of all the matching elements.

func (*HTMLElement) ForEach

func (h *HTMLElement) ForEach(goquerySelector string, callback func(int, *HTMLElement))

ForEach iterates over the elements matched by the first argument and calls the callback function on every HTMLElement match.

func (*HTMLElement) ForEachWithBreak

func (h *HTMLElement) ForEachWithBreak(goquerySelector string, callback func(int, *HTMLElement) bool)

ForEachWithBreak iterates over the elements matched by the first argument and calls the callback function on every HTMLElement match. It is identical to ForEach except that it is possible to break out of the loop by returning false in the callback function. It returns the current Selection object.

func (*HTMLElement) Unmarshal

func (h *HTMLElement) Unmarshal(v interface{}) error

Unmarshal is a shorthand for colly.UnmarshalHTML

func (*HTMLElement) UnmarshalWithMap

func (h *HTMLElement) UnmarshalWithMap(v interface{}, structMap map[string]string) error

UnmarshalWithMap is a shorthand for colly.UnmarshalHTML, extended to allow maps to be passed in.

type HTTPTrace

type HTTPTrace struct {
	ConnectDuration   time.Duration
	FirstByteDuration time.Duration
	// contains filtered or unexported fields
}

HTTPTrace provides a datastructure for storing an http trace.

func (*HTTPTrace) WithTrace

func (ht *HTTPTrace) WithTrace(req *http.Request) *http.Request

WithTrace returns the given HTTP Request with this HTTPTrace added to its context.

type LimitRule

type LimitRule struct {
	// DomainRegexp is a regular expression to match against domains
	DomainRegexp string
	// DomainGlob is a glob pattern to match against domains
	DomainGlob string
	// Delay is the duration to wait before creating a new request to the matching domains
	Delay time.Duration
	// RandomDelay is the extra randomized duration to wait added to Delay before creating a new request
	RandomDelay time.Duration
	// Parallelism is the number of the maximum allowed concurrent requests of the matching domains
	Parallelism int
	// contains filtered or unexported fields
}

LimitRule provides connection restrictions for domains. Both DomainRegexp and DomainGlob can be used to specify the included domains patterns, but at least one is required. There can be two kind of limitations:

  • Parallelism: Set limit for the number of concurrent requests to matching domains
  • Delay: Wait specified amount of time between requests (parallelism is 1 in this case)

func (*LimitRule) Init

func (r *LimitRule) Init() error

Init initializes the private members of LimitRule

func (*LimitRule) Match

func (r *LimitRule) Match(domain string) bool

Match checks that the domain parameter triggers the rule

type ProxyFunc

type ProxyFunc func(*http.Request) (*url.URL, error)

ProxyFunc is a type alias for proxy setter functions.

type Request

type Request struct {
	// URL is the parsed URL of the HTTP request
	URL *url.URL
	// Headers contains the Request's HTTP headers
	Headers *http.Header
	// the Host header
	Host string
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Depth is the number of the parents of the request
	Depth int
	// Method is the HTTP method of the request
	Method string
	// Body is the request body which is used on POST/PUT requests
	Body io.Reader
	// ResponseCharacterencoding is the character encoding of the response body.
	// Leave it blank to allow automatic character encoding of the response body.
	// It is empty by default and it can be set in OnRequest callback.
	ResponseCharacterEncoding string
	// ID is the Unique identifier of the request
	ID uint32

	// ProxyURL is the proxy address that handles the request
	ProxyURL string
	// contains filtered or unexported fields
}

Request is the representation of a HTTP request made by a Collector

func (*Request) Abort

func (r *Request) Abort()

Abort cancels the HTTP request when called in an OnRequest callback

func (*Request) AbsoluteURL

func (r *Request) AbsoluteURL(u string) string

AbsoluteURL returns with the resolved absolute URL of an URL chunk. AbsoluteURL returns empty string if the URL chunk is a fragment or could not be parsed

func (*Request) Do

func (r *Request) Do() error

Do submits the request

func (*Request) HasVisited

func (r *Request) HasVisited(URL string) (bool, error)

HasVisited checks if the provided URL has been visited

func (*Request) Marshal

func (r *Request) Marshal() ([]byte, error)

Marshal serializes the Request

func (*Request) New

func (r *Request) New(method, URL string, body io.Reader) (*Request, error)

New creates a new request with the context of the original request

func (*Request) Post

func (r *Request) Post(URL string, requestData map[string]string) error

Post continues a collector job by creating a POST request and preserves the Context of the previous request. Post also calls the previously provided callbacks

func (*Request) PostMultipart

func (r *Request) PostMultipart(URL string, requestData map[string][]byte) error

PostMultipart starts a collector job by creating a Multipart POST request with raw binary data. PostMultipart also calls the previously provided. callbacks

func (*Request) PostRaw

func (r *Request) PostRaw(URL string, requestData []byte) error

PostRaw starts a collector job by creating a POST request with raw binary data. PostRaw preserves the Context of the previous request and calls the previously provided callbacks

func (*Request) Retry

func (r *Request) Retry() error

Retry submits HTTP request again with the same parameters

func (*Request) Visit

func (r *Request) Visit(URL string) error

Visit continues Collector's collecting job by creating a request and preserves the Context of the previous request. Visit also calls the previously provided callbacks

type RequestCallback

type RequestCallback func(*Request)

RequestCallback is a type alias for OnRequest callback functions

type Response

type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
	// Trace contains the HTTPTrace for the request. Will only be set by the
	// collector if Collector.TraceHTTP is set to true.
	Trace *HTTPTrace
}

Response is the representation of a HTTP response made by a Collector

func (*Response) FileName

func (r *Response) FileName() string

FileName returns the sanitized file name parsed from "Content-Disposition" header or from URL

func (*Response) Save

func (r *Response) Save(fileName string) error

Save writes response body to disk

type ResponseCallback

type ResponseCallback func(*Response)

ResponseCallback is a type alias for OnResponse callback functions

type ResponseHeadersCallback

type ResponseHeadersCallback func(*Response)

ResponseHeadersCallback is a type alias for OnResponseHeaders callback functions

type ScrapedCallback

type ScrapedCallback func(*Response)

ScrapedCallback is a type alias for OnScraped callback functions

type XMLCallback

type XMLCallback func(*XMLElement)

XMLCallback is a type alias for OnXML callback functions

type XMLElement

type XMLElement struct {
	// Name is the name of the tag
	Name string
	Text string

	// Request is the request object of the element's HTML document
	Request *Request
	// Response is the Response object of the element's HTML document
	Response *Response
	// DOM is the DOM object of the page. DOM is relative
	// to the current XMLElement and is either a html.Node or xmlquery.Node
	// based on how the XMLElement was created.
	DOM interface{}
	// contains filtered or unexported fields
}

XMLElement is the representation of a XML tag.

func NewXMLElementFromHTMLNode

func NewXMLElementFromHTMLNode(resp *Response, s *html.Node) *XMLElement

NewXMLElementFromHTMLNode creates a XMLElement from a html.Node.

func NewXMLElementFromXMLNode

func NewXMLElementFromXMLNode(resp *Response, s *xmlquery.Node) *XMLElement

NewXMLElementFromXMLNode creates a XMLElement from a xmlquery.Node.

func (*XMLElement) Attr

func (h *XMLElement) Attr(k string) string

Attr returns the selected attribute of a HTMLElement or empty string if no attribute found

func (*XMLElement) ChildAttr

func (h *XMLElement) ChildAttr(xpathQuery, attrName string) string

ChildAttr returns the stripped text content of the first matching element's attribute.

func (*XMLElement) ChildAttrs

func (h *XMLElement) ChildAttrs(xpathQuery, attrName string) []string

ChildAttrs returns the stripped text content of all the matching element's attributes.

func (*XMLElement) ChildText

func (h *XMLElement) ChildText(xpathQuery string) string

ChildText returns the concatenated and stripped text content of the matching elements.

func (*XMLElement) ChildTexts

func (h *XMLElement) ChildTexts(xpathQuery string) []string

ChildTexts returns an array of strings corresponding to child elements that match the xpath query. Each item in the array is the stripped text content of the corresponding matching child element.

Directories

Path Synopsis
_examples
cmd
Package extensions implements various helper addons for Colly
Package extensions implements various helper addons for Colly

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL