Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct { PrivateNetworkDetector PrivateNetworkDetector URLGetter URLGetter Graph Graph Indexer Indexer FetchWorkers int }
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler implements a web-page crawling pipeline consisting of the following stages:
- Given a URL, retrieve the web-page contents from the remote server.
- Extract and resolve absolute and relative links from the retrieved page.
- Extract page title and text content from the retrieved page.
- Update the link graph: add new links and create edges between the crawled page and the links within it.
- Index crawled page title and text content.
func NewCrawler ¶
func (*Crawler) Crawl ¶
Crawl iterates linkIter and sends each link through the crawler pipeline returning the total count of links that went through the pipeline. Calls to Crawl block until the link iterator is exhausted, an error occurs or the context is cancelled. It's safe to be called by concurrent gorutines.
type Graph ¶
type Graph interface { UpsertLink(link *graph.Link) error UpsertEdge(edge *graph.Edge) error // RemoveStaleEdges removes any edge that originates from the specified // link ID and was updated before the specified timestamp. RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error }
Graph is implemented by objects that can upsert links and edges into a link graph instance.
type Indexer ¶
Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.
type PrivateNetworkDetector ¶
PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address. used as a secuity machnism to prevent exposing internal services to the crawler