crawler

package

v0.0.0-...-0a64c4a Latest Latest Go to latest Published: Oct 31, 2018 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/katzien/crawler

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func Graph(s Sitemap) error
func Text(s Sitemap) (string, error)
type CanonicalURL
type Crawler
- func NewCrawler(start *url.URL, depth int) Crawler
- func (c *Crawler) Crawl(ctx context.Context) Sitemap
type Links
type Page
type Parser
- func NewParser(domainScheme string, domainHost string) Parser
type Sitemap

Constants ¶

View Source

const (
	// DefaultOutputFileDot is the .dot file location to save the sitemap graph information to.
	DefaultOutputFileDot = "sitemap.dot"

	// DefaultOutputFileSvg is the .svg file location to save the sitemap graph to.
	DefaultOutputFileSvg = "sitemap.svg"
)

View Source

const FetchTimeout = 5 * time.Second

FetchTimeout defines the max amount of time the parser will try to fetch a given page for.

Variables ¶

View Source

var (
	// ErrExternalDomain is returned when the given URL redirects to a domain outside the starting domain
	ErrExternalDomain = errors.New("URL is outside the starting domain, ignoring")

	// ErrTooManyRedirects is returned after 10 consecutive redirects from a given URL
	ErrTooManyRedirects = errors.New("stopped after 10 redirects")
)

Functions ¶

func Graph ¶

func Graph(s Sitemap) error

Graph renders the given sitemap as a graph saved in an SVG file. The graph is generated using dot, a graphviz tool. The dot command is invoked using the exec command, and it is assumed that dot is already installed. The sitemap data is first saved as a .dot file, which is then passed as source to the dot command.

func Text ¶

func Text(s Sitemap) (string, error)

Text renders the given sitemap as a list of pages and links found.

Types ¶

type CanonicalURL ¶

type CanonicalURL string

CanonicalURL represents the normalised page URL (a full URL with no query params or fragments).

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler is used to crawl a given starting URL, up to a max depth.

func NewCrawler ¶

func NewCrawler(start *url.URL, depth int) Crawler

NewCrawler returns an instance of the Crawler with all its required properties initialised.

func (*Crawler) Crawl ¶

func (c *Crawler) Crawl(ctx context.Context) Sitemap

Crawl will start crawling the URL given to the Crawler as the starting URL. Once the maximum depth is reached or no new pages are found, a Sitemap struct will be returned with the results. Crawl accepts a cancellable context and stops crawling when the context is cancelled, returning the current results.

type Links ¶

type Links []string

Links is a slice containing links found on a given page.

type Page ¶

type Page struct {
	Addr  CanonicalURL
	Links Links
}

Page defines the data structure representing a single web page. Addr is the full URL of the page with no query params or fragments. Links is a collection of links found on the page.

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

Parser parses the DOM of a single web page.

func NewParser ¶

func NewParser(domainScheme string, domainHost string) Parser

NewParser returns an instance of the Parser with all its required properties initialised. The given domain scheme and host values are used as the scheme and host values of any relative URLs found on the page.

type Sitemap ¶

type Sitemap map[CanonicalURL]Links

Sitemap is the data structure holding current sitemap information. It's a map of a page URL to the links found on that page.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL