Documentation
¶
Index ¶
- Variables
- func Attr(n Finder, attr, selector string) string
- func Callbacks(v ...string) []string
- func ConstructHTTPRequest(req *Request) (r *http.Request, err error)
- func FindAny(finder Finder, selectors ...string) (node *goquery.Selection)
- func NodeAttr(attr string) func(int, *goquery.Selection) string
- func NodeResolveURL(resp *Response) func(int, *goquery.Selection) string
- func NodeText(_ int, n *goquery.Selection) string
- func ParseFloat(n Finder, selector string) (res float64, err error)
- func ParseUint(n Finder, selector string) (res uint64, err error)
- func ProxyFromContext(ctx context.Context) (addrs []string, ok bool)
- func Text(n Finder, selector string) string
- func WithProxy(ctx context.Context, addrs ...string) context.Context
- func WriteResponseFile(r *Response, fname string) (err error)
- type Crawler
- type Finder
- type Handler
- type Job
- type Middleware
- type Option
- func WithConcurrency(n int) Option
- func WithDefaultHeaders(headers map[string]string) Option
- func WithDefaultTimeout(d time.Duration) Option
- func WithQueue(queue Queue) Option
- func WithQueueCapacity(n int) Option
- func WithSpiders(spiders ...func(Crawler)) Option
- func WithTransport(transport *http.Transport) Option
- func WithUserAgent(ua string) Option
- type Queue
- type Request
- type RequestError
- type Response
- func (r *Response) Bytes() (body []byte, err error)
- func (r *Response) Close() error
- func (r *Response) Find(selector string) *goquery.Selection
- func (r *Response) ParseHTML() (err error)
- func (r *Response) Query() *goquery.Document
- func (r *Response) Status() string
- func (r *Response) URL() *url.URL
Constants ¶
This section is empty.
Variables ¶
var DefaultHeaders = map[string]string{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
}
DefaultHeaders - Default crawler headers.
var NodeDataPhoto = NodeAttr("data-photo")
NodeDataPhoto - Node "data-photo" attribute selector.
var NodeHref = NodeAttr("href")
NodeHref - Node "href" attribute selector.
var NodeSrc = NodeAttr("src")
NodeSrc - Node "src" attribute selector.
Functions ¶
func ConstructHTTPRequest ¶
ConstructHTTPRequest - Constructs a http.Request structure.
func NodeResolveURL ¶
NodeResolveURL - Returns selector which takes href and resolves url. Returns helper for (*goquery.Selection).Each().
func ParseFloat ¶
ParseFloat - Finds node in response and parses text as float64. When text is not found returns result 0.0 and nil error. Returned error source is strconv.ParseFloat.
func ParseUint ¶
ParseUint - Finds node in response and parses text as uint64. When text is not found returns result 0 and nil error. Returned error source is strconv.ParseUint.
func ProxyFromContext ¶
ProxyFromContext - Returns proxy from context metadata.
func WriteResponseFile ¶
WriteResponseFile - Reads response body to memory and writes to file.
Types ¶
type Crawler ¶
type Crawler interface { // Schedule - Schedules request. // Context is passed to queue in a job. Schedule(context.Context, *Request) error // Execute - Makes a http request respecting context deadline. // If request Raw is not true - ParseHTML() method is executed on Response. // Then all callbacks are executed with context. Execute(context.Context, *Request) (*Response, error) // Handlers - Returns all registered handlers. Handlers() map[string][]Handler // Register - Registers crawl handler. Register(name string, h Handler) // Middleware - Registers a middleware. // Request is not executed if middleware returns an error. Middleware(Middleware) // Start - Starts the crawler. // All errors should be received from Errors() channel. Start() // Close - Closes the queue and the crawler. Close() error // Errors - Returns channel that will receive all crawl errors. // Only errors from queued requests are here. // Not only request errors but also queue errors. Errors() <-chan error }
Crawler - Crawler interface.
type Job ¶
type Job interface { // Request - Returns crawl job. Request() *Request // Context - Returns job context. Context() context.Context // Done - Sets job as done. Done() }
Job - Crawl job interface.
type Middleware ¶
Middleware - Crawler middleware.
type Option ¶
type Option func(*crawl)
Option - Crawl option.
func WithConcurrency ¶
WithConcurrency - Sets crawl concurrency. Default: 1000.
func WithDefaultHeaders ¶
WithDefaultHeaders - Sets crawl default headers. Default: empty.
func WithDefaultTimeout ¶
WithDefaultTimeout - Sets default request timeout duration.
func WithQueue ¶
WithQueue - Sets crawl queue. Default: creates queue using NewQueue() with capacity of WitWithQueueCapacity().
func WithQueueCapacity ¶
WithQueueCapacity - Sets queue capacity. It sets queue capacity if a queue needs to be created and it sets a capacity of channel in-memory queue. It also sets capacity of errors buffered channel. Default: 10000.
func WithSpiders ¶
WithSpiders - Registers spider on a crawler.
func WithTransport ¶
WithTransport - Sets crawl HTTP transport.
func WithUserAgent ¶
WithUserAgent - Sets crawl default user-agent.
type Queue ¶
type Queue interface { // Get - Gets request from Queue channel. // Returns io.EOF if queue is empty. Get() (Job, error) // Schedule - Schedules a Request. // Returns io.ErrClosedPipe if queue is closed. Schedule(context.Context, *Request) error // Close - Closes the queue. Close() error }
Queue - Requests queue.
type Request ¶
type Request struct { // URL - It can be absolute URL or a relative to source URL if referer is set. URL string `json:"url,omitempty"` // Method - "GET" by default. Method string `json:"method,omitempty"` // Referer - Request referer. Referer string `json:"referer,omitempty"` // Form - Form values which set as request body. Form url.Values `json:"form,omitempty"` // Query - Form values which set as url query. Query url.Values `json:"query,omitempty"` // Cookies - Request cookies. Cookies url.Values `json:"cookies,omitempty"` // Header - Header values. Header map[string]string `json:"header,omitempty"` // Raw - when set to false, it means we expect HTML response Raw bool `json:"raw,omitempty"` // Callbacks - Crawl callback list. Callbacks []string `json:"callbacks,omitempty"` }
Request - HTTP Request. Multipart form is not implemented.
type RequestError ¶
RequestError - Crawl error.
func (*RequestError) Error ¶
func (err *RequestError) Error() string
Error - Returns request error message.
type Response ¶
Response - Crawl http response. It is expected it to be a HTML response but not required. It ALWAYS has to be released using Close() method.
Source Files
¶
Directories
¶
Path | Synopsis |
---|---|
examples
|
|
imdb
This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool.
|
This is only an example, please dont harm imdb servers, if you need movies data checkout http://www.imdb.com/interfaces I can also recommend checking out source code of https://github.com/BurntSushi/goim which implements importing data into SQL databases and comes with command line search tool. |
imdb/spider
Package spider implements imdb spider.
|
Package spider implements imdb spider. |
Package forms implements helpers for filling forms.
|
Package forms implements helpers for filling forms. |
nsq
|
|
consumer
Package consumer implements command line crawl consumer from nsq.
|
Package consumer implements command line crawl consumer from nsq. |