crawl

package

v0.0.0-...-0bec3fc Latest Latest Go to latest Published: Nov 26, 2024 License: BSD-3-Clause Imports: 16 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

go.googlesource.com/oscar

Links

Documentation ¶

Overview ¶

Package crawl implements a basic web crawler for crawling a portion of a web site. Construct a Crawler, configure it, and then call its [Run] method. The crawler stores the crawled data in a storage.DB, and then Crawler.PageWatcher can be used to watch for new pages.

Index ¶

Constants
type Crawler
- func New(lg *slog.Logger, db storage.DB, hc *http.Client) *Crawler
type Page
- func (p *Page) LastWritten() timed.DBTime

Constants ¶

View Source

const DocWatcherID = "crawldocs"

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

A Crawler is a basic web crawler.

Note that this package does not load or process robots.txt. Instead the assumption is that the site owner is crawling a portion of their own site and will confiure the crawler appropriately. (In the case of Go's Oscar instance, we only crawl go.dev.)

func New ¶

func New(lg *slog.Logger, db storage.DB, hc *http.Client) *Crawler

New returns a new Crawler that uses the given logger, database, and HTTP client. The caller should configure the Crawler further by calling Crawler.Add, Crawler.Allow, Crawler.Deny, Crawler.Clean, and Crawler.SetRecrawl. Once configured, the crawler can be run by calling Crawler.Run.

func (*Crawler) Add ¶

func (c *Crawler) Add(url string)

Add adds the URL to the list of roots for the crawl. The added URL must not include a URL fragment (#name).

func (*Crawler) Allow ¶

func (c *Crawler) Allow(prefix ...string)

Allow records that the crawler is allowed to crawl URLs with the given list of prefixes. A URL is considered to match a prefix if one of the following is true:

The URL is exactly the prefix.
The URL begins with the prefix, and the prefix ends in /.
The URL begins with the prefix, and the next character in the URL is / or ?.

The companion function Crawler.Deny records that the crawler is not allowed to crawl URLs with a list of prefixes. When deciding whether a URL can be crawled, longer prefixes take priority over shorter prefixes. If the same prefix is added to both Crawler.Allow and Crawler.Deny, the last call wins. The default outcome is that a URL is not allowed to be crawled.

For example, consider this call sequence:

c.Allow("https://go.dev/a/")
c.Allow("https://go.dev/a/b/c")
c.Deny("https://go.dev/a/b")

Given these rules, the crawler makes the following decisions about these URLs:

https://go.dev/a: not allowed
https://go.dev/a/: allowed
https://go.dev/a/?x=1: allowed
https://go.dev/a/x: allowed
https://go.dev/a/b: not allowed
https://go.dev/a/b/x: not allowed
https://go.dev/a/b/c: allowed
https://go.dev/a/b/c/x: allowed
https://go.dev/x: not allowed

func (*Crawler) Clean ¶

func (c *Crawler) Clean(clean func(*url.URL) error)

Clean adds a cleaning function to the crawler's list of cleaners. Each time the crawler considers queuing a URL to be crawled, it calls the cleaning functions to canonicalize or otherwise clean the URL first. A cleaning function might remove unnecessary URL parameters or canonicalize host names or paths. The Crawler automatically removes any URL fragment before applying registered cleaners.

func (*Crawler) Deny ¶

func (c *Crawler) Deny(prefix ...string)

Deny records that the crawler is allowed to crawl URLs with the given list of prefixes. See the Crawler.Allow documentation for details about prefixes and interactions with Allow.

func (*Crawler) DocWatcher ¶

func (cr *Crawler) DocWatcher() *timed.Watcher[*Page]

DocWatcher returns the page watcher with name "crawldocs". Implements docs.Source.DocWatcher.

func (*Crawler) Get ¶

func (c *Crawler) Get(url string) (*Page, bool)

Get returns the result of the most recent crawl for the given URL. If the page has been crawled, Get returns a non-nil *Page, true. If the page has not been crawled, Get returns nil, false.

func (*Crawler) PageWatcher ¶

func (c *Crawler) PageWatcher(name string) *timed.Watcher[*Page]

PageWatcher returns a timed.Watcher over Pages that the Crawler has stored in its database.

func (*Crawler) Run ¶

func (c *Crawler) Run(ctx context.Context) error

Run crawls all the pages it can, returning when the entire site has been crawled either during this run or within the crawl duration set by [Crawler.Recrawl].

func (*Crawler) Set ¶

func (c *Crawler) Set(p *Page)

Set adds p to the crawled page database. It is typically only used for setting up tests.

func (*Crawler) SetRecrawl ¶

func (c *Crawler) SetRecrawl(d time.Duration)

SetRecrawl sets the time to wait before recrawling a page. The default is 24 hours.

func (*Crawler) ToDocs ¶

func (*Crawler) ToDocs(p *Page) (iter.Seq[*docs.Doc], bool)

ToDocs converts a crawled page to a list of embeddable documents, split into sections using htmlutil.Split.

Implements docs.Source.ToDocs.

type Page ¶

type Page struct {
	DBTime    timed.DBTime
	URL       string    // URL of page
	From      string    // a page where we found the link to this one
	LastCrawl time.Time // time of last crawl
	Redirect  string    // HTTP redirect during fetch
	HTML      []byte    // HTML content, if any
	Error     string    // error fetching page, if any
}

A Page records the result of crawling a single page.

func (*Page) LastWritten ¶

func (p *Page) LastWritten() timed.DBTime

LastWritten implements docs.Entry.LastWritten.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL