Documentation ¶
Overview ¶
Package crawl implements a basic web crawler for crawling a portion of a web site. Construct a Crawler, configure it, and then call its [Run] method. The crawler stores the crawled data in a storage.DB, and then Crawler.PageWatcher can be used to watch for new pages.
Index ¶
- Constants
- type Crawler
- func (c *Crawler) Add(url string)
- func (c *Crawler) Allow(prefix ...string)
- func (c *Crawler) Clean(clean func(*url.URL) error)
- func (c *Crawler) Deny(prefix ...string)
- func (cr *Crawler) DocWatcher() *timed.Watcher[*Page]
- func (c *Crawler) Get(url string) (*Page, bool)
- func (c *Crawler) PageWatcher(name string) *timed.Watcher[*Page]
- func (c *Crawler) Run(ctx context.Context) error
- func (c *Crawler) Set(p *Page)
- func (c *Crawler) SetRecrawl(d time.Duration)
- func (*Crawler) ToDocs(p *Page) (iter.Seq[*docs.Doc], bool)
- type Page
Constants ¶
const DocWatcherID = "crawldocs"
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
A Crawler is a basic web crawler.
Note that this package does not load or process robots.txt. Instead the assumption is that the site owner is crawling a portion of their own site and will confiure the crawler appropriately. (In the case of Go's Oscar instance, we only crawl go.dev.)
func New ¶
New returns a new Crawler that uses the given logger, database, and HTTP client. The caller should configure the Crawler further by calling Crawler.Add, Crawler.Allow, Crawler.Deny, Crawler.Clean, and Crawler.SetRecrawl. Once configured, the crawler can be run by calling Crawler.Run.
func (*Crawler) Add ¶
Add adds the URL to the list of roots for the crawl. The added URL must not include a URL fragment (#name).
func (*Crawler) Allow ¶
Allow records that the crawler is allowed to crawl URLs with the given list of prefixes. A URL is considered to match a prefix if one of the following is true:
- The URL is exactly the prefix.
- The URL begins with the prefix, and the prefix ends in /.
- The URL begins with the prefix, and the next character in the URL is / or ?.
The companion function Crawler.Deny records that the crawler is not allowed to crawl URLs with a list of prefixes. When deciding whether a URL can be crawled, longer prefixes take priority over shorter prefixes. If the same prefix is added to both Crawler.Allow and Crawler.Deny, the last call wins. The default outcome is that a URL is not allowed to be crawled.
For example, consider this call sequence:
c.Allow("https://go.dev/a/") c.Allow("https://go.dev/a/b/c") c.Deny("https://go.dev/a/b")
Given these rules, the crawler makes the following decisions about these URLs:
- https://go.dev/a: not allowed
- https://go.dev/a/: allowed
- https://go.dev/a/?x=1: allowed
- https://go.dev/a/x: allowed
- https://go.dev/a/b: not allowed
- https://go.dev/a/b/x: not allowed
- https://go.dev/a/b/c: allowed
- https://go.dev/a/b/c/x: allowed
- https://go.dev/x: not allowed
func (*Crawler) Clean ¶
Clean adds a cleaning function to the crawler's list of cleaners. Each time the crawler considers queuing a URL to be crawled, it calls the cleaning functions to canonicalize or otherwise clean the URL first. A cleaning function might remove unnecessary URL parameters or canonicalize host names or paths. The Crawler automatically removes any URL fragment before applying registered cleaners.
func (*Crawler) Deny ¶
Deny records that the crawler is allowed to crawl URLs with the given list of prefixes. See the Crawler.Allow documentation for details about prefixes and interactions with Allow.
func (*Crawler) DocWatcher ¶
DocWatcher returns the page watcher with name "crawldocs". Implements docs.Source.DocWatcher.
func (*Crawler) Get ¶
Get returns the result of the most recent crawl for the given URL. If the page has been crawled, Get returns a non-nil *Page, true. If the page has not been crawled, Get returns nil, false.
func (*Crawler) PageWatcher ¶
PageWatcher returns a timed.Watcher over Pages that the Crawler has stored in its database.
func (*Crawler) Run ¶
Run crawls all the pages it can, returning when the entire site has been crawled either during this run or within the crawl duration set by [Crawler.Recrawl].
func (*Crawler) Set ¶
Set adds p to the crawled page database. It is typically only used for setting up tests.
func (*Crawler) SetRecrawl ¶
SetRecrawl sets the time to wait before recrawling a page. The default is 24 hours.
type Page ¶
type Page struct { DBTime timed.DBTime URL string // URL of page From string // a page where we found the link to this one LastCrawl time.Time // time of last crawl Redirect string // HTTP redirect during fetch HTML []byte // HTML content, if any Error string // error fetching page, if any }
A Page records the result of crawling a single page.
func (*Page) LastWritten ¶
LastWritten implements docs.Entry.LastWritten.