fetch

package
v0.0.0-...-d33463d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2020 License: BSD-3-Clause Imports: 37 Imported by: 5

Documentation

Overview

Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.

Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.

Currently two types of fetcher are available : Chrome Fetcher and Base Fetcher.

Base fetcher is used for downloading html web page using Go standard Http library.

Chrome Fetcher connects to Headless Chrome which renders JavaScript pages.

RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AllowedByRobots

func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool

AllowedByRobots checks if scraping of specified URL is allowed by robots.txt

func AssembleRobotstxtURL

func AssembleRobotstxtURL(rawurl string) (string, error)

AssembleRobotstxtURL robots.txt URL from URL

func GetCrawlDelay

func GetCrawlDelay(r *robotstxt.RobotsData) time.Duration

getCrawlDelay retrieves Crawl-delay directive from robots.txt. Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt. Crawl-delay value does not have an effect on delays between consecutive requests to the same domain for the moment. FetchDelay and RandomizeFetchDelay from ScrapeOptions are used for throttling a crawler speed.

func RobotstxtData

func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)

RobotstxtData generates robots.txt url, retrieves its content through API fetch endpoint.

Types

type Action

type Action interface {
	Execute(ctx context.Context, f *ChromeFetcher) error
}

func NewAction

func NewAction(actionType string, params json.RawMessage) (Action, error)

type BaseFetcher

type BaseFetcher struct {
	// contains filtered or unexported fields
}

BaseFetcher is a Fetcher that uses the Go standard library's http client to fetch URLs.

func (*BaseFetcher) Fetch

func (bf *BaseFetcher) Fetch(request Request) (io.ReadCloser, error)

Fetch retrieves document from the remote server.

type ChromeFetcher

type ChromeFetcher struct {
	// contains filtered or unexported fields
}

ChromeFetcher is used to fetch Java Script rendeded pages.

func (*ChromeFetcher) Fetch

func (f *ChromeFetcher) Fetch(request Request) (io.ReadCloser, error)

Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.

func (ChromeFetcher) RunJSFromFile

func (f ChromeFetcher) RunJSFromFile(ctx context.Context, path string, entryPointFunction string) error

type ClickAction

type ClickAction struct {
	Element string `json:"element"`
}

func (*ClickAction) Execute

func (a *ClickAction) Execute(ctx context.Context, f *ChromeFetcher) error

type Config

type Config struct {
	Host    string
	Version string
}

Config provides basic configuration

type FetchService

type FetchService struct {
}

FetchService implements service with empty struct

func (FetchService) Fetch

func (fs FetchService) Fetch(req Request) (io.ReadCloser, error)

Fetch method implements fetching content from web page with Base or Chrome fetcher.

type Fetcher

type Fetcher interface {
	//  Fetch is called to retrieve HTML content of a document from the remote server.
	Fetch(request Request) (io.ReadCloser, error)
	// contains filtered or unexported methods
}

Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.

Note: Fetchers may or may not be safe to use concurrently. Please read the documentation for each fetcher for more details.

type HTMLServer

type HTMLServer struct {
	// contains filtered or unexported fields
}

HTMLServer represents the web service that serves up HTML

func Start

func Start(cfg Config) *HTMLServer

Start func launches Parsing service

func (*HTMLServer) Stop

func (htmlServer *HTMLServer) Stop() error

Stop turns off the HTML Server

type LogCodec

type LogCodec struct {
	// contains filtered or unexported fields
}

LogCodec captures the output from writing RPC requests and reading responses on the connection. It implements rpcc.Codec via WriteRequest and ReadResponse.

func (*LogCodec) ReadResponse

func (c *LogCodec) ReadResponse(resp *rpcc.Response) error

ReadResponse unmarshals from the connection into v whilst echoing what is read into a buffer for logging.

func (*LogCodec) WriteRequest

func (c *LogCodec) WriteRequest(req *rpcc.Request) error

WriteRequest marshals v into a buffer, writes its contents onto the connection and logs it.

type PaginateAction

type PaginateAction struct {
	MaxPage int    `json:"maxpage"`
	Element string `json:"element"`
}

func (*PaginateAction) Execute

func (pa *PaginateAction) Execute(ctx context.Context, f *ChromeFetcher) error

type Request

type Request struct {
	// Type defines Fetcher type. It may be "chrome" or "base". Defaults to "base".
	Type string `json:"type"`
	//	URL to be retrieved
	URL string `json:"url"`
	//	HTTP method : GET, POST
	Method string
	// FormData is a string value for passing formdata parameters.
	//
	// For example it may be used for processing pages which require authentication
	//
	// Example:
	//
	// "auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1"
	//
	FormData string `json:"formData,omitempty"`
	//UserToken identifies user to keep personal cookies information.
	UserToken string `json:"userToken"`
	// Actions contains the list of action we have to perform on page
	Actions string `json:"actions"`
}

Request struct contains request information sent to Fetchers

func (Request) Host

func (req Request) Host() (string, error)

Host returns Host value from Request

type Service

type Service interface {
	Fetch(req Request) (io.ReadCloser, error)
}

Service defines Fetch service interface

func NewHTTPClient

func NewHTTPClient(instance string) (Service, error)

NewHTTPClient returns an Fetch Service backed by an HTTP server living at the remote instance. We expect instance to come from a service discovery system, so likely of the form "host:port". We bake-in certain middlewares, implementing the client library pattern.

type ServiceMiddleware

type ServiceMiddleware func(Service) Service

ServiceMiddleware defines a middleware for a Fetch service

func LoggingMiddleware

func LoggingMiddleware(logger *zap.Logger) ServiceMiddleware

LoggingMiddleware logs Service endpoints

type Type

type Type string

Type represents types of fetcher

const (
	//Base fetcher is used for downloading html web page using Go standard library's http
	Base Type = "Base"
	//Headless chrome is used to download content from JS driven web pages
	Chrome = "Chrome"
)

Fetcher types

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL