Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct { // A PrivateNetworkDetector instance PrivateNetworkDetector PrivateNetworkDetector // A URLGetter instance for fetching links. URLGetter URLGetter // A GraphUpdater instance for addding new links to the link graph. Graph Graph // A TextIndexer instance for indexing the content of each retrieved link. Indexer Indexer // The number of concurrent workers used for retrieving links. FetchWorkers int }
Config encapsulates the configuration options for creating a new Crawler.
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler implements a web-page crawling pipeline consisting of the following stages:
- Given a URL, retrieve the web-page contents from the remote server.
- Extract and resolve absolute and relative links from the retrieved page.
- Extract page title and text content from the retrieved page.
- Update the link graph: add new links and create edges between the crawled page and the links within it.
- Index crawled page title and text content.
type Graph ¶
type Graph interface { // UpsertLink creates a new link or updates an existing link. UpsertLink(link *graph.Link) error // UpsertEdge creates a new edge or updates an existing edge. UpsertEdge(edge *graph.Edge) error // RemoveStaleEdges removes any edge that originates from the specified // link ID and was updated before the specified timestamp. RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error }
Graph is implemented by objects that can upsert links and edges into a link graph instance.
type Indexer ¶
type Indexer interface { // Index inserts a new document to the index or updates the index entry // for and existing document. Index(doc *index.Document) error }
Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.
type PrivateNetworkDetector ¶
PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address.
Source Files ¶
Click to show internal directories.
Click to hide internal directories.