Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct { // A PrivateNetworkDetector instance PrivateNetworkDetector PrivateNetworkDetector // A URLGetter instance for fetching links. URLGetter URLGetter // A GraphUpdater instance for adding new links to the link graph. Graph Graph // A TextIndexer instance for indexing the content of each retrieved link. Indexer Indexer // The number of concurrent workers used for retrieving links. FetchWorkers int }
Config encapsulates the configuration options for creating a new Crawler.
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler implements a web-page crawling pipeline consisting of the following stages:
- Given a URL, retrieve the web-page contents from the remote server.
- Extract and resolve absolute and relative links from the retrieved page.
- Extract page title and text content from the retrieved page.
- Update the link graph: add new links and create edges between the crawled page and the links within it.
- Index crawled page title and text content.
type Graph ¶
type Graph interface { // UpsertLink creates a new link or updates an existing link. UpsertLink(link *graph.Link) error // UpsertEdge creates a new edge or updates an existing edge. UpsertEdge(edge *graph.Edge) error // RemoveStaleEdges removes any edge that originates from the specified // link ID and was updated before the specified timestamp. RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error }
Graph is implemented by objects that can upsert links and edges into a link graph instance.
type GraphAPI ¶
type GraphAPI interface { UpsertLink(link *graph.Link) error UpsertEdge(edge *graph.Edge) error RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error Links(fromID, toID uuid.UUID, retrievedBefore time.Time) (graph.LinkIterator, error) }
GraphAPI defines as set of API methods for accessing the link graph.
type Indexer ¶
type Indexer interface { // Index inserts a new document to the index or updates the index entry // for and existing document. Index(doc *index.Document) error }
Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.
type PrivateNetworkDetector ¶
PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address.
type Service ¶
type Service struct {
// contains filtered or unexported fields
}
Service implements the web-crawler component for the Links 'R' Us project.
func NewService ¶
func NewService(cfg ServiceConfig) (*Service, error)
NewService creates a new crawler service instance with the specified config.
type ServiceConfig ¶
type ServiceConfig struct { // An API for managing and interating links and edges in the link graph. GraphAPI GraphAPI // An API for indexing documents. IndexAPI IndexAPI // An API for detecting private network addresses. If not specified, // a default implementation that handles the private network ranges // defined in RFC1918 will be used instead. PrivateNetworkDetector PrivateNetworkDetector // An API for performing HTTP requests. If not specified, // http.DefaultClient will be used instead. URLGetter URLGetter // An API for detecting the partition assignments for this service. PartitionDetector partition.Detector // A clock instance for generating time-related events. If not specified, // the default wall-clock will be used instead. Clock clock.Clock // The number of concurrent workers used for retrieving links. FetchWorkers int // The time between subsequent crawler passes. UpdateInterval time.Duration // The minimum amount of time before re-indexing an already-crawled link. ReIndexThreshold time.Duration // The logger to use. If not defined an output-discarding logger will // be used instead. Logger *logrus.Entry }
ServiceConfig encapsulates the settings for configuring the web-crawler service. Not to be confused with the Config for the crawler itself.