crawler

package
v0.0.0-...-a08b8de Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 21, 2023 License: MIT Imports: 13 Imported by: 1

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Anchor

type Anchor struct {
	Href  string
	Title string
}

Anchor represent an HTML anchor tag.

type Collector

type Collector interface {
	Collect(url *url.URL) (*http.Response, []Anchor, error)
}

type Crawler

type Crawler struct {
	URL *url.URL

	Collector
	Processor
	Filter
	Executor *executor.Executor
}

type CrossDomainFilter

type CrossDomainFilter struct {
	Domain string
}

func (CrossDomainFilter) Filter

func (filter CrossDomainFilter) Filter(urls []string) []string

type DefaultProcessor

type DefaultProcessor struct{}

DefaultProcessor is the default implementation of the crawler.Processor interface. This just creates a appropriate CrawlReport instance from the results of the collector.

func (DefaultProcessor) Process

func (processor DefaultProcessor) Process(URL *url.URL, res *http.Response, anchors []Anchor,
	err error) executor.Report

Process creates a CrawlReport instance from the given parameters. It should be noted that when http.Response is nil then the HTTPStatus in the CrawlReport is set as 0.

type Filter

type Filter interface {
	Filter(urls []string) []string
}

type NoneFilter

type NoneFilter struct{}

func (NoneFilter) Filter

func (filter NoneFilter) Filter(urls []string) []string

type Processor

type Processor interface {
	Process(url *url.URL, response *http.Response, connectedURLs []Anchor, err error) executor.Report
}

type Report

type Report struct {
	Url        string
	HTTPStatus int
	Err        string
	Anchors    []Anchor
}

Report represents the results of the crawling task of a single URL.

func (Report) Status

func (report Report) Status() int

Status returns the HTTPStatus of the GET request on the URL of this report.

func (Report) String

func (report Report) String() string

type Task

type Task struct {
	Crawler Crawler
}

func (Task) Execute

func (task Task) Execute() executor.Report

func (Task) String

func (task Task) String() string

type URLCollector

type URLCollector struct {
	URLMap      map[uint64]bool
	AnchorRegex *regexp.Regexp
	*sync.Mutex
}

URLCollector is an implementation of the collector interface that collects Anchors for the pages that the crawler visits.

func (*URLCollector) Collect

func (collector *URLCollector) Collect(URL *url.URL) (*http.Response, []Anchor, error)

Collect method collects all the URLs that are available in the form on href="" markup.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL