Documentation ¶
Overview ¶
Example ¶
package main import ( "fmt" "github.com/ernesto-jimenez/crawler" ) func main() { startURL := "https://godoc.org" cr, err := crawler.New() if err != nil { panic(err) } err = cr.Crawl(startURL, func(url string, res *crawler.Response, err error) error { if err != nil { fmt.Printf("error: %s", err.Error()) return nil } fmt.Printf("%s - Links: %d Assets: %d\n", url, len(res.Links), len(res.Assets)) return crawler.ErrSkipURL }) if err != nil { panic(err) } }
Output: https://godoc.org/ - Links: 39 Assets: 5
Index ¶
- Variables
- func ReadResponse(base *url.URL, r io.Reader, res *Response) error
- type Asset
- type CheckFetchFunc
- type CheckFetchStack
- type CrawlFunc
- type InMemoryQueue
- type Link
- type Option
- func WithAllowedHosts(hosts ...string) Option
- func WithCheckFetch(fn CheckFetchFunc) Option
- func WithConcurrentRequests(n int) Option
- func WithExcludedHosts(hosts ...string) Option
- func WithHTTPTransport(rt http.RoundTripper) Option
- func WithMaxDepth(depth int) Option
- func WithOneRequestPerURL() Option
- type Queue
- type Request
- type Response
- type Runner
- type Simple
- type Worker
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ErrSkipURL = errors.New("skip URL")
ErrSkipURL can be returned by CrawlFunc to avoid crawling the links from the given url
Functions ¶
Types ¶
type Asset ¶
type Asset struct { // Tag used to link the asset Tag string `json:"tag"` // URL of the asset URL string `json:"url"` // Rel contains the text of the rel attribute Rel string `json:"rel,omitempty"` // Type contains the text of the type attribute Type string `json:"type,omitempty"` }
Asset represents linked assets such as link, script and img tags
type CheckFetchFunc ¶
CheckFetchFunc is used to check whether a page should be fetched during the crawl or not
type CheckFetchStack ¶
type CheckFetchStack []CheckFetchFunc
CheckFetchStack is a stack of CheckFetchFunc types where all have to pass for the fetch to happen.
func (CheckFetchStack) CheckFetch ¶
func (s CheckFetchStack) CheckFetch(req *Request) bool
CheckFetch will return true if all funcs in the stack return true. false otherwise.
type CrawlFunc ¶
CrawlFunc is the type of the function called for each webpage visited by Crawl. The incoming url specifies which url was fetched, while res contains the response of the fetched URL if it was successful. If the fetch failed, the incoming error will specify the reason and res will be nil.
Returning ErrSkipURL will avoid queing up the resources links to be crawled.
Returning any other error from the function will immediately stop the crawl.
type InMemoryQueue ¶
type InMemoryQueue struct {
// contains filtered or unexported fields
}
InMemoryQueue holds a queue of items to be crawled in memory
func NewInMemoryQueue ¶
func NewInMemoryQueue(ctx context.Context) *InMemoryQueue
NewInMemoryQueue returns an in memory queue ready to be used by different workers
func (*InMemoryQueue) PopFront ¶
func (q *InMemoryQueue) PopFront() (*Request, error)
PopFront gets the next request from the queue. It will return a nil request and a nil error if the queue is empty.
func (*InMemoryQueue) PushBack ¶
func (q *InMemoryQueue) PushBack(req *Request) error
PushBack adds a request to the queue
type Link ¶
type Link struct { // URL contains the href attribute of the link. e.g: <a href="{href}">...</a> URL string `json:"url"` }
Link contains the informaiton from a single `a` tag
type Option ¶
type Option func(*options) error
Option is used to provide optional configuration to a crawler
func WithAllowedHosts ¶
WithAllowedHosts adds a check to only allow URLs with the given hosts
func WithCheckFetch ¶
func WithCheckFetch(fn CheckFetchFunc) Option
WithCheckFetch takes CheckFetchFunc that will be run before fetching each page to check whether it should be fetched or not
func WithConcurrentRequests ¶
WithConcurrentRequests sets how many concurrent requests to allow
func WithExcludedHosts ¶
WithExcludedHosts adds a check to only allow URLs with hosts other than the given ones
func WithHTTPTransport ¶
func WithHTTPTransport(rt http.RoundTripper) Option
WithHTTPTransport sets the optional http client
func WithMaxDepth ¶
WithMaxDepth sets the max depth of the crawl. It must be over zero or the call will panic.
func WithOneRequestPerURL ¶
func WithOneRequestPerURL() Option
WithOneRequestPerURL adds a check to only allow URLs once
type Queue ¶
Queue is used by workers to keep track of the urls that need to be fetched. Queue must be safe to use concurrently.
type Request ¶
Request is used to fetch a page and informoation about its resources
func NewRequest ¶
NewRequest initialises a new crawling request to extract information from a single URL
type Response ¶
type Response struct { URL string `json:"url"` RedirectTo string `json:"redirect_to,omitempty"` Links []Link `json:"links"` Assets []Asset `json:"assets"` // contains filtered or unexported fields }
Response has the details from crawling a single URL
type Simple ¶
type Simple struct {
// contains filtered or unexported fields
}
Simple is responsible of running a crawl, allowing you to queue new URLs to be crawled and build requests to be crawled.
func (*Simple) Crawl ¶
Crawl will fetch all the linked websites starting from startURL and invoking crawlFn for each fetched url with either the response or the error.
It will return an error if the crawl was prematurely stopped or could not be started.
Crawl will always add WithOneRequestPerURL to the options of the worker to avoid infinite loops.