Documentation
¶
Overview ¶
Package crawl is a simple link scraper and web crawler with single domain scope. It can be limited with a timeout and interrupted with signals.
Three public functions give access to single page link scraping (ScrapLinks) and single host web crawling (FetchLinks and StreamLinks). FetchLinks and StreamLinks have the same behaviour and result, as FetchLinks is a wrapper for StreamLinks. The only difference is that FetchLinks is blocking, and returns once a stopping condition is reached (link tree exhaustion, timeout, signals), where StreamLinks immediately returns a channel on which the calling function can listen on to get results as they come.
The return values can be used for a site map.
Some precautions have been taken to prevent infinite loops, like stripping queries and fragments off urls.
A sample program calling the package is given in the project repository.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type CrawlerResults ¶
type CrawlerResults struct {
// contains filtered or unexported fields
}
CrawlerResults is send back to the caller, containing results and information about the crawling
func FetchLinks ¶
func FetchLinks(domain string, timeout time.Duration) (*CrawlerResults, error)
FetchLinks is a wrapper around StreamLinks and does the same, except it blocks and accumulates all links before returning them to the caller.
func StreamLinks ¶
func StreamLinks(domain string, timeout time.Duration) (*CrawlerResults, error)
StreamLinks returns a channel on which it will report links as they come during the crawling. The caller should range over than channel to continuously retrieve messages. StreamLinks will close that channel when all encountered links have been visited and none is left, when the deadline on the timeout parameter is reached, or if a SIGINT or SIGTERM signals is received.
func (*CrawlerResults) ExitContext ¶
func (cr *CrawlerResults) ExitContext() string
func (*CrawlerResults) Links ¶
func (cr *CrawlerResults) Links() []string
func (*CrawlerResults) Stream ¶
func (cr *CrawlerResults) Stream() <-chan *LinkMap