crawl

package
v0.0.0-...-ee38f16 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 27, 2018 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Crawl

func Crawl(id int, userAgent string, waiting <-chan *url.URL, processed chan<- *url.URL, content chan<- string)

Crawl parses URLs from a `waiting` channel, places the content in a `content` channel, and places the URL on an `processed` channel.

func Download

func Download(id int, userAgent string, dir string, wainting <-chan *url.URL, processed chan<- *url.URL)

Download downloads resources from URIs in a `waiting` channel and URI to a given `dir` and puts the URI in a `processed` channel

func Harvest

func Harvest(id int, domain *url.URL, filterPattern string, content <-chan string, sites, images chan<- *url.URL)

Harvest extracts URIs from `anchor` (a) and `image` (img) tags in an HTML string, from htlm strings from a `content` channel `domain` is used to resolve the full URL of relative URLs

func MakeURIParser

func MakeURIParser(tag, element string, domain *url.URL, filterPattern string) func(html string) []*url.URL

MakeURIParser returns a function that takes a string with `html` and returns a list of URIs that match a given regex pattern. If the parsed URI is a relative URL, the `domain` URL is used to resolve it to an absolute path.

func ParseHTMLElementValues

func ParseHTMLElementValues(html, tag, element string) []string

ParseHTMLElementValues parses a specified html `element` with a specified `tag`.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL