crawl

package
v0.0.0-...-0bec3fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 26, 2024 License: BSD-3-Clause Imports: 16 Imported by: 0

Documentation

Overview

Package crawl implements a basic web crawler for crawling a portion of a web site. Construct a Crawler, configure it, and then call its [Run] method. The crawler stores the crawled data in a storage.DB, and then Crawler.PageWatcher can be used to watch for new pages.

Index

Constants

View Source
const DocWatcherID = "crawldocs"

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

A Crawler is a basic web crawler.

Note that this package does not load or process robots.txt. Instead the assumption is that the site owner is crawling a portion of their own site and will confiure the crawler appropriately. (In the case of Go's Oscar instance, we only crawl go.dev.)

func New

func New(lg *slog.Logger, db storage.DB, hc *http.Client) *Crawler

New returns a new Crawler that uses the given logger, database, and HTTP client. The caller should configure the Crawler further by calling Crawler.Add, Crawler.Allow, Crawler.Deny, Crawler.Clean, and Crawler.SetRecrawl. Once configured, the crawler can be run by calling Crawler.Run.

func (*Crawler) Add

func (c *Crawler) Add(url string)

Add adds the URL to the list of roots for the crawl. The added URL must not include a URL fragment (#name).

func (*Crawler) Allow

func (c *Crawler) Allow(prefix ...string)

Allow records that the crawler is allowed to crawl URLs with the given list of prefixes. A URL is considered to match a prefix if one of the following is true:

  • The URL is exactly the prefix.
  • The URL begins with the prefix, and the prefix ends in /.
  • The URL begins with the prefix, and the next character in the URL is / or ?.

The companion function Crawler.Deny records that the crawler is not allowed to crawl URLs with a list of prefixes. When deciding whether a URL can be crawled, longer prefixes take priority over shorter prefixes. If the same prefix is added to both Crawler.Allow and Crawler.Deny, the last call wins. The default outcome is that a URL is not allowed to be crawled.

For example, consider this call sequence:

c.Allow("https://go.dev/a/")
c.Allow("https://go.dev/a/b/c")
c.Deny("https://go.dev/a/b")

Given these rules, the crawler makes the following decisions about these URLs:

func (*Crawler) Clean

func (c *Crawler) Clean(clean func(*url.URL) error)

Clean adds a cleaning function to the crawler's list of cleaners. Each time the crawler considers queuing a URL to be crawled, it calls the cleaning functions to canonicalize or otherwise clean the URL first. A cleaning function might remove unnecessary URL parameters or canonicalize host names or paths. The Crawler automatically removes any URL fragment before applying registered cleaners.

func (*Crawler) Deny

func (c *Crawler) Deny(prefix ...string)

Deny records that the crawler is allowed to crawl URLs with the given list of prefixes. See the Crawler.Allow documentation for details about prefixes and interactions with Allow.

func (*Crawler) DocWatcher

func (cr *Crawler) DocWatcher() *timed.Watcher[*Page]

DocWatcher returns the page watcher with name "crawldocs". Implements docs.Source.DocWatcher.

func (*Crawler) Get

func (c *Crawler) Get(url string) (*Page, bool)

Get returns the result of the most recent crawl for the given URL. If the page has been crawled, Get returns a non-nil *Page, true. If the page has not been crawled, Get returns nil, false.

func (*Crawler) PageWatcher

func (c *Crawler) PageWatcher(name string) *timed.Watcher[*Page]

PageWatcher returns a timed.Watcher over Pages that the Crawler has stored in its database.

func (*Crawler) Run

func (c *Crawler) Run(ctx context.Context) error

Run crawls all the pages it can, returning when the entire site has been crawled either during this run or within the crawl duration set by [Crawler.Recrawl].

func (*Crawler) Set

func (c *Crawler) Set(p *Page)

Set adds p to the crawled page database. It is typically only used for setting up tests.

func (*Crawler) SetRecrawl

func (c *Crawler) SetRecrawl(d time.Duration)

SetRecrawl sets the time to wait before recrawling a page. The default is 24 hours.

func (*Crawler) ToDocs

func (*Crawler) ToDocs(p *Page) (iter.Seq[*docs.Doc], bool)

ToDocs converts a crawled page to a list of embeddable documents, split into sections using htmlutil.Split.

Implements docs.Source.ToDocs.

type Page

type Page struct {
	DBTime    timed.DBTime
	URL       string    // URL of page
	From      string    // a page where we found the link to this one
	LastCrawl time.Time // time of last crawl
	Redirect  string    // HTTP redirect during fetch
	HTML      []byte    // HTML content, if any
	Error     string    // error fetching page, if any
}

A Page records the result of crawling a single page.

func (*Page) LastWritten

func (p *Page) LastWritten() timed.DBTime

LastWritten implements docs.Entry.LastWritten.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL