crawler

package module

v0.0.0-...-67712f9 Latest Latest Go to latest Published: Oct 5, 2018 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ernesto-jimenez/crawler

Links

Open Source Insights

README ¶

crawler

A simple package to quickly build programs that require crawling websites.

go get github.com/ernesto-jimenez/crawler

Usage

func Example() {
	startURL := "https://godoc.org"

	cr, err := crawler.New()
	if err != nil {
		panic(err)
	}

	err = cr.Crawl(startURL, func(url string, res *crawler.Response, err error) error {
		if err != nil {
			fmt.Printf("error: %s", err.Error())
			return nil
		}
		fmt.Printf("%s - Links: %d Assets: %d\n", url, len(res.Links), len(res.Assets))
		return crawler.ErrSkipURL
	})
	if err != nil {
		panic(err)
	}
	// Output:
	// https://godoc.org/ - Links: 39 Assets: 5
}

Documentation ¶

Overview ¶

Example ¶

package main

import (
	"fmt"

	"github.com/ernesto-jimenez/crawler"
)

func main() {
	startURL := "https://godoc.org"

	cr, err := crawler.New()
	if err != nil {
		panic(err)
	}

	err = cr.Crawl(startURL, func(url string, res *crawler.Response, err error) error {
		if err != nil {
			fmt.Printf("error: %s", err.Error())
			return nil
		}
		fmt.Printf("%s - Links: %d Assets: %d\n", url, len(res.Links), len(res.Assets))
		return crawler.ErrSkipURL
	})
	if err != nil {
		panic(err)
	}
}

Output:

https://godoc.org/ - Links: 39 Assets: 5

Index ¶

Variables
func ReadResponse(base *url.URL, r io.Reader, res *Response) error
type Asset
type CheckFetchFunc
type CheckFetchStack
- func (s CheckFetchStack) CheckFetch(req *Request) bool
type CrawlFunc
type InMemoryQueue
- func NewInMemoryQueue(ctx context.Context) *InMemoryQueue
- func (q *InMemoryQueue) PopFront() (*Request, error)
- func (q *InMemoryQueue) PushBack(req *Request) error
type Link
type Option
- func WithAllowedHosts(hosts ...string) Option
- func WithCheckFetch(fn CheckFetchFunc) Option
- func WithConcurrentRequests(n int) Option
- func WithExcludedHosts(hosts ...string) Option
- func WithHTTPTransport(rt http.RoundTripper) Option
- func WithMaxDepth(depth int) Option
- func WithOneRequestPerURL() Option
type Queue
type Request
- func NewRequest(uri string) (*Request, error)
- func (r *Request) Finish()
type Response
type Runner
type Simple
- func New(opts ...Option) (*Simple, error)
- func (s *Simple) Crawl(startURL string, crawlFn CrawlFunc) error
type Worker
- func NewWorker(fn CrawlFunc, opts ...Option) (*Worker, error)
- func (w *Worker) Run(ctx context.Context, q Queue) error

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrSkipURL = errors.New("skip URL")

ErrSkipURL can be returned by CrawlFunc to avoid crawling the links from the given url

Functions ¶

func ReadResponse ¶

func ReadResponse(base *url.URL, r io.Reader, res *Response) error

ReadResponse extracts links and assets from the HTML read form the given io reader and fills it in the response

Types ¶

type Asset ¶

type Asset struct {
	// Tag used to link the asset
	Tag string `json:"tag"`

	// URL of the asset
	URL string `json:"url"`

	// Rel contains the text of the rel attribute
	Rel string `json:"rel,omitempty"`

	// Type contains the text of the type attribute
	Type string `json:"type,omitempty"`
}

Asset represents linked assets such as link, script and img tags

type CheckFetchFunc ¶

type CheckFetchFunc func(*Request) bool

CheckFetchFunc is used to check whether a page should be fetched during the crawl or not

type CheckFetchStack ¶

type CheckFetchStack []CheckFetchFunc

CheckFetchStack is a stack of CheckFetchFunc types where all have to pass for the fetch to happen.

func (CheckFetchStack) CheckFetch ¶

func (s CheckFetchStack) CheckFetch(req *Request) bool

CheckFetch will return true if all funcs in the stack return true. false otherwise.

type CrawlFunc ¶

type CrawlFunc func(url string, res *Response, err error) error

CrawlFunc is the type of the function called for each webpage visited by Crawl. The incoming url specifies which url was fetched, while res contains the response of the fetched URL if it was successful. If the fetch failed, the incoming error will specify the reason and res will be nil.

Returning ErrSkipURL will avoid queing up the resources links to be crawled.

Returning any other error from the function will immediately stop the crawl.

type InMemoryQueue ¶

type InMemoryQueue struct {
	// contains filtered or unexported fields
}

InMemoryQueue holds a queue of items to be crawled in memory

func NewInMemoryQueue ¶

func NewInMemoryQueue(ctx context.Context) *InMemoryQueue

NewInMemoryQueue returns an in memory queue ready to be used by different workers

func (*InMemoryQueue) PopFront ¶

func (q *InMemoryQueue) PopFront() (*Request, error)

PopFront gets the next request from the queue. It will return a nil request and a nil error if the queue is empty.

func (*InMemoryQueue) PushBack ¶

func (q *InMemoryQueue) PushBack(req *Request) error

PushBack adds a request to the queue

type Link ¶

type Link struct {
	// URL contains the href attribute of the link. e.g: <a href="{href}">...</a>
	URL string `json:"url"`
}

Link contains the informaiton from a single `a` tag

type Option ¶

type Option func(*options) error

Option is used to provide optional configuration to a crawler

func WithAllowedHosts ¶

func WithAllowedHosts(hosts ...string) Option

WithAllowedHosts adds a check to only allow URLs with the given hosts

func WithCheckFetch ¶

func WithCheckFetch(fn CheckFetchFunc) Option

WithCheckFetch takes CheckFetchFunc that will be run before fetching each page to check whether it should be fetched or not

func WithConcurrentRequests ¶

func WithConcurrentRequests(n int) Option

WithConcurrentRequests sets how many concurrent requests to allow

func WithExcludedHosts ¶

func WithExcludedHosts(hosts ...string) Option

WithExcludedHosts adds a check to only allow URLs with hosts other than the given ones

func WithHTTPTransport ¶

func WithHTTPTransport(rt http.RoundTripper) Option

WithHTTPTransport sets the optional http client

func WithMaxDepth ¶

func WithMaxDepth(depth int) Option

WithMaxDepth sets the max depth of the crawl. It must be over zero or the call will panic.

func WithOneRequestPerURL ¶

func WithOneRequestPerURL() Option

WithOneRequestPerURL adds a check to only allow URLs once

type Queue ¶

type Queue interface {
	PushBack(*Request) error
	PopFront() (*Request, error)
}

Queue is used by workers to keep track of the urls that need to be fetched. Queue must be safe to use concurrently.

type Request ¶

type Request struct {
	URL *url.URL
	// contains filtered or unexported fields
}

Request is used to fetch a page and informoation about its resources

func NewRequest ¶

func NewRequest(uri string) (*Request, error)

NewRequest initialises a new crawling request to extract information from a single URL

func (*Request) Finish ¶

func (r *Request) Finish()

Finish should be called once the request has been completed

type Response ¶

type Response struct {
	URL        string  `json:"url"`
	RedirectTo string  `json:"redirect_to,omitempty"`
	Links      []Link  `json:"links"`
	Assets     []Asset `json:"assets"`
	// contains filtered or unexported fields
}

Response has the details from crawling a single URL

type Runner ¶

type Runner interface {
	Run(context.Context, Queue) error
}

Runner defines the interface requred to run a crawl

type Simple ¶

type Simple struct {
	// contains filtered or unexported fields
}

Simple is responsible of running a crawl, allowing you to queue new URLs to be crawled and build requests to be crawled.

func New ¶

func New(opts ...Option) (*Simple, error)

New initialises a new crawl runner

func (*Simple) Crawl ¶

func (s *Simple) Crawl(startURL string, crawlFn CrawlFunc) error

Crawl will fetch all the linked websites starting from startURL and invoking crawlFn for each fetched url with either the response or the error.

It will return an error if the crawl was prematurely stopped or could not be started.

Crawl will always add WithOneRequestPerURL to the options of the worker to avoid infinite loops.

type Worker ¶

type Worker struct {
	// contains filtered or unexported fields
}

Worker is used to run a crawl on a single goroutine

func NewWorker ¶

func NewWorker(fn CrawlFunc, opts ...Option) (*Worker, error)

NewWorker initialises a goroutine

func (*Worker) Run ¶

func (w *Worker) Run(ctx context.Context, q Queue) error

Run starts processing requests from the queue

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
web-resources

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL