Documentation
¶
Overview ¶
Package crawler is a distributed web crawler.
Index ¶
Constants ¶
This section is empty.
Variables ¶
View Source
var RobotsPath, _ = url.Parse("/robots.txt")
RobotsPath is robots.txt path
Functions ¶
This section is empty.
Types ¶
type Backend ¶
type Backend interface { Setup() error CrawledAndCount(u, domain string) (time.Time, int, error) // gotta be a better name for this Upsert(*document.Document) error }
Backend outlines methods to save documents and count the docs a domain has
type Crawler ¶
type Crawler struct { HTTPClient *http.Client UserAgent Robots robots.Cacher Queue queue.Queuer Backend // contains filtered or unexported fields }
Crawler holds crawler settings for our UserAgent, Seed URLs, etc.
type ElasticSearch ¶
type ElasticSearch struct { *document.ElasticSearch Bulk *elastic.BulkProcessor sync.Mutex }
ElasticSearch satisfies the crawler's Backend interface
func (*ElasticSearch) CrawledAndCount ¶
CrawledAndCount returns the crawled date of the url (if any) and the total number of links a domain has
Directories
¶
Path | Synopsis |
---|---|
Command crawler demonstrates how to run the crawler
|
Command crawler demonstrates how to run the crawler |
Package queue manages the queue for a distributed crawler
|
Package queue manages the queue for a distributed crawler |
Package robots handles caching robots.txt files
|
Package robots handles caching robots.txt files |
Click to show internal directories.
Click to hide internal directories.