crawler

package
v0.0.0-...-e97be17 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 21, 2022 License: MIT Imports: 18 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	PrivateNetworkDetector PrivateNetworkDetector
	URLGetter              URLGetter
	Graph                  Graph
	Indexer                Indexer
	FetchWorkers           int
}

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler implements a web-page crawling pipeline consisting of the following stages:

  • Given a URL, retrieve the web-page contents from the remote server.
  • Extract and resolve absolute and relative links from the retrieved page.
  • Extract page title and text content from the retrieved page.
  • Update the link graph: add new links and create edges between the crawled page and the links within it.
  • Index crawled page title and text content.

func NewCrawler

func NewCrawler(cfg Config) *Crawler

func (*Crawler) Crawl

func (c *Crawler) Crawl(ctx context.Context, linkIter graph.LinkIterator) (int, error)

Crawl iterates linkIter and sends each link through the crawler pipeline returning the total count of links that went through the pipeline. Calls to Crawl block until the link iterator is exhausted, an error occurs or the context is cancelled. It's safe to be called by concurrent gorutines.

type Graph

type Graph interface {
	UpsertLink(link *graph.Link) error
	UpsertEdge(edge *graph.Edge) error

	// RemoveStaleEdges removes any edge that originates from the specified
	// link ID and was updated before the specified timestamp.
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
}

Graph is implemented by objects that can upsert links and edges into a link graph instance.

type Indexer

type Indexer interface {
	Index(doc *indexer.Document) error
}

Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.

type PrivateNetworkDetector

type PrivateNetworkDetector interface {
	IsPrivate(host string) (bool, error)
}

PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address. used as a secuity machnism to prevent exposing internal services to the crawler

type URLGetter

type URLGetter interface {
	Get(url string) (*http.Response, error)
}

URLGetter is implemented by objects that can perform HTTP GET requests.

Directories

Path Synopsis
Package mock_crawler is a generated GoMock package.
Package mock_crawler is a generated GoMock package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL