crawler

package

v0.0.0-...-7b67181 Latest Latest Go to latest Published: Jun 5, 2023 License: MIT Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ejacobg/links-r-us

Links

Open Source Insights

Documentation ¶

Index ¶

type Config
type Crawler
- func NewCrawler(cfg Config) *Crawler
- func (c *Crawler) Crawl(ctx context.Context, linkIt graph.LinkIterator) (int, error)
type Graph
type GraphAPI
type IndexAPI
type Indexer
type PrivateNetworkDetector
type Service
- func NewService(cfg ServiceConfig) (*Service, error)
- func (svc *Service) Name() string
- func (svc *Service) Run(ctx context.Context) error
type ServiceConfig
type URLGetter

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	// A PrivateNetworkDetector instance
	PrivateNetworkDetector PrivateNetworkDetector

	// A URLGetter instance for fetching links.
	URLGetter URLGetter

	// A GraphUpdater instance for adding new links to the link graph.
	Graph Graph

	// A TextIndexer instance for indexing the content of each retrieved link.
	Indexer Indexer

	// The number of concurrent workers used for retrieving links.
	FetchWorkers int
}

Config encapsulates the configuration options for creating a new Crawler.

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler implements a web-page crawling pipeline consisting of the following stages:

Given a URL, retrieve the web-page contents from the remote server.
Extract and resolve absolute and relative links from the retrieved page.
Extract page title and text content from the retrieved page.
Update the link graph: add new links and create edges between the crawled page and the links within it.
Index crawled page title and text content.

func NewCrawler ¶

func NewCrawler(cfg Config) *Crawler

NewCrawler returns a new crawler instance.

func (*Crawler) Crawl ¶

func (c *Crawler) Crawl(ctx context.Context, linkIt graph.LinkIterator) (int, error)

Crawl iterates linkIt and sends each link through the crawler pipeline returning the total count of links that went through the pipeline. Calls to Crawl block until the link iterator is exhausted, an error occurs or the context is cancelled.

type Graph ¶

type Graph interface {
	// UpsertLink creates a new link or updates an existing link.
	UpsertLink(link *graph.Link) error

	// UpsertEdge creates a new edge or updates an existing edge.
	UpsertEdge(edge *graph.Edge) error

	// RemoveStaleEdges removes any edge that originates from the specified
	// link ID and was updated before the specified timestamp.
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
}

Graph is implemented by objects that can upsert links and edges into a link graph instance.

type GraphAPI ¶

type GraphAPI interface {
	UpsertLink(link *graph.Link) error
	UpsertEdge(edge *graph.Edge) error
	RemoveStaleEdges(fromID uuid.UUID, updatedBefore time.Time) error
	Links(fromID, toID uuid.UUID, retrievedBefore time.Time) (graph.LinkIterator, error)
}

GraphAPI defines as set of API methods for accessing the link graph.

type IndexAPI ¶

type IndexAPI interface {
	Index(doc *index.Document) error
}

IndexAPI defines a set of API methods for indexing crawled documents.

type Indexer ¶

type Indexer interface {
	// Index inserts a new document to the index or updates the index entry
	// for and existing document.
	Index(doc *index.Document) error
}

Indexer is implemented by objects that can index the contents of web-pages retrieved by the crawler pipeline.

type PrivateNetworkDetector ¶

type PrivateNetworkDetector interface {
	IsPrivate(host string) (bool, error)
}

PrivateNetworkDetector is implemented by objects that can detect whether a host resolves to a private network address.

type Service ¶

type Service struct {
	// contains filtered or unexported fields
}

Service implements the web-crawler component for the Links 'R' Us project.

func NewService ¶

func NewService(cfg ServiceConfig) (*Service, error)

NewService creates a new crawler service instance with the specified config.

func (*Service) Name ¶

func (svc *Service) Name() string

Name implements service.Service

func (*Service) Run ¶

func (svc *Service) Run(ctx context.Context) error

Run implements service.Service

type ServiceConfig ¶

type ServiceConfig struct {
	// An API for managing and interating links and edges in the link graph.
	GraphAPI GraphAPI

	// An API for indexing documents.
	IndexAPI IndexAPI

	// An API for detecting private network addresses. If not specified,
	// a default implementation that handles the private network ranges
	// defined in RFC1918 will be used instead.
	PrivateNetworkDetector PrivateNetworkDetector

	// An API for performing HTTP requests. If not specified,
	// http.DefaultClient will be used instead.
	URLGetter URLGetter

	// An API for detecting the partition assignments for this service.
	PartitionDetector partition.Detector

	// A clock instance for generating time-related events. If not specified,
	// the default wall-clock will be used instead.
	Clock clock.Clock

	// The number of concurrent workers used for retrieving links.
	FetchWorkers int

	// The time between subsequent crawler passes.
	UpdateInterval time.Duration

	// The minimum amount of time before re-indexing an already-crawled link.
	ReIndexThreshold time.Duration

	// The logger to use. If not defined an output-discarding logger will
	// be used instead.
	Logger *logrus.Entry
}

ServiceConfig encapsulates the settings for configuring the web-crawler service. Not to be confused with the Config for the crawler itself.

type URLGetter ¶

type URLGetter interface {
	Get(url string) (*http.Response, error)
}

URLGetter is implemented by objects that can perform HTTP GET requests.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
mocks Package mocks is a generated GoMock package.	Package mocks is a generated GoMock package.
privnet

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL