fetch

package

v0.0.0-...-d33463d Latest Latest Go to latest Published: Jun 12, 2020 License: BSD-3-Clause Imports: 37 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/slotix/dataflowkit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.

Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.

Currently two types of fetcher are available : Chrome Fetcher and Base Fetcher.

Base fetcher is used for downloading html web page using Go standard Http library.

Chrome Fetcher connects to Headless Chrome which renders JavaScript pages.

RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt

Index ¶

func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool
func AssembleRobotstxtURL(rawurl string) (string, error)
func GetCrawlDelay(r *robotstxt.RobotsData) time.Duration
func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)
type Action
- func NewAction(actionType string, params json.RawMessage) (Action, error)
type BaseFetcher
- func (bf *BaseFetcher) Fetch(request Request) (io.ReadCloser, error)
type ChromeFetcher
- func (f *ChromeFetcher) Fetch(request Request) (io.ReadCloser, error)
- func (f ChromeFetcher) RunJSFromFile(ctx context.Context, path string, entryPointFunction string) error
type ClickAction
- func (a *ClickAction) Execute(ctx context.Context, f *ChromeFetcher) error
type Config
type FetchService
- func (fs FetchService) Fetch(req Request) (io.ReadCloser, error)
type Fetcher
type HTMLServer
- func Start(cfg Config) *HTMLServer
- func (htmlServer *HTMLServer) Stop() error
type LogCodec
- func (c *LogCodec) ReadResponse(resp *rpcc.Response) error
- func (c *LogCodec) WriteRequest(req *rpcc.Request) error
type PaginateAction
- func (pa *PaginateAction) Execute(ctx context.Context, f *ChromeFetcher) error
type Request
- func (req Request) Host() (string, error)
type Service
- func NewHTTPClient(instance string) (Service, error)
type ServiceMiddleware
- func LoggingMiddleware(logger *zap.Logger) ServiceMiddleware
type Type

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func AllowedByRobots ¶

func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool

AllowedByRobots checks if scraping of specified URL is allowed by robots.txt

func AssembleRobotstxtURL ¶

func AssembleRobotstxtURL(rawurl string) (string, error)

AssembleRobotstxtURL robots.txt URL from URL

func GetCrawlDelay ¶

func GetCrawlDelay(r *robotstxt.RobotsData) time.Duration

getCrawlDelay retrieves Crawl-delay directive from robots.txt. Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt. Crawl-delay value does not have an effect on delays between consecutive requests to the same domain for the moment. FetchDelay and RandomizeFetchDelay from ScrapeOptions are used for throttling a crawler speed.

func RobotstxtData ¶

func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)

RobotstxtData generates robots.txt url, retrieves its content through API fetch endpoint.

Types ¶

type Action ¶

type Action interface {
	Execute(ctx context.Context, f *ChromeFetcher) error
}

func NewAction ¶

func NewAction(actionType string, params json.RawMessage) (Action, error)

type BaseFetcher ¶

type BaseFetcher struct {
	// contains filtered or unexported fields
}

BaseFetcher is a Fetcher that uses the Go standard library's http client to fetch URLs.

func (*BaseFetcher) Fetch ¶

func (bf *BaseFetcher) Fetch(request Request) (io.ReadCloser, error)

Fetch retrieves document from the remote server.

type ChromeFetcher ¶

type ChromeFetcher struct {
	// contains filtered or unexported fields
}

ChromeFetcher is used to fetch Java Script rendeded pages.

func (*ChromeFetcher) Fetch ¶

func (f *ChromeFetcher) Fetch(request Request) (io.ReadCloser, error)

Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.

func (ChromeFetcher) RunJSFromFile ¶

func (f ChromeFetcher) RunJSFromFile(ctx context.Context, path string, entryPointFunction string) error

type ClickAction ¶

type ClickAction struct {
	Element string `json:"element"`
}

func (*ClickAction) Execute ¶

func (a *ClickAction) Execute(ctx context.Context, f *ChromeFetcher) error

type Config ¶

type Config struct {
	Host    string
	Version string
}

Config provides basic configuration

type FetchService ¶

type FetchService struct {
}

FetchService implements service with empty struct

func (FetchService) Fetch ¶

func (fs FetchService) Fetch(req Request) (io.ReadCloser, error)

Fetch method implements fetching content from web page with Base or Chrome fetcher.

type Fetcher ¶

type Fetcher interface {
	//  Fetch is called to retrieve HTML content of a document from the remote server.
	Fetch(request Request) (io.ReadCloser, error)
	// contains filtered or unexported methods
}

Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.

Note: Fetchers may or may not be safe to use concurrently. Please read the documentation for each fetcher for more details.

type HTMLServer ¶

type HTMLServer struct {
	// contains filtered or unexported fields
}

HTMLServer represents the web service that serves up HTML

func Start ¶

func Start(cfg Config) *HTMLServer

Start func launches Parsing service

func (*HTMLServer) Stop ¶

func (htmlServer *HTMLServer) Stop() error

Stop turns off the HTML Server

type LogCodec ¶

type LogCodec struct {
	// contains filtered or unexported fields
}

LogCodec captures the output from writing RPC requests and reading responses on the connection. It implements rpcc.Codec via WriteRequest and ReadResponse.

func (*LogCodec) ReadResponse ¶

func (c *LogCodec) ReadResponse(resp *rpcc.Response) error

ReadResponse unmarshals from the connection into v whilst echoing what is read into a buffer for logging.

func (*LogCodec) WriteRequest ¶

func (c *LogCodec) WriteRequest(req *rpcc.Request) error

WriteRequest marshals v into a buffer, writes its contents onto the connection and logs it.

type PaginateAction ¶

type PaginateAction struct {
	MaxPage int    `json:"maxpage"`
	Element string `json:"element"`
}

func (*PaginateAction) Execute ¶

func (pa *PaginateAction) Execute(ctx context.Context, f *ChromeFetcher) error

type Request ¶

type Request struct {
	// Type defines Fetcher type. It may be "chrome" or "base". Defaults to "base".
	Type string `json:"type"`
	//	URL to be retrieved
	URL string `json:"url"`
	//	HTTP method : GET, POST
	Method string
	// FormData is a string value for passing formdata parameters.
	//
	// For example it may be used for processing pages which require authentication
	//
	// Example:
	//
	// "auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1"
	//
	FormData string `json:"formData,omitempty"`
	//UserToken identifies user to keep personal cookies information.
	UserToken string `json:"userToken"`
	// Actions contains the list of action we have to perform on page
	Actions string `json:"actions"`
}

Request struct contains request information sent to Fetchers

func (Request) Host ¶

func (req Request) Host() (string, error)

Host returns Host value from Request

type Service ¶

type Service interface {
	Fetch(req Request) (io.ReadCloser, error)
}

Service defines Fetch service interface

func NewHTTPClient ¶

func NewHTTPClient(instance string) (Service, error)

NewHTTPClient returns an Fetch Service backed by an HTTP server living at the remote instance. We expect instance to come from a service discovery system, so likely of the form "host:port". We bake-in certain middlewares, implementing the client library pattern.

type ServiceMiddleware ¶

type ServiceMiddleware func(Service) Service

ServiceMiddleware defines a middleware for a Fetch service

func LoggingMiddleware ¶

func LoggingMiddleware(logger *zap.Logger) ServiceMiddleware

LoggingMiddleware logs Service endpoints

type Type ¶

type Type string

Type represents types of fetcher

const (
	//Base fetcher is used for downloading html web page using Go standard library's http
	Base Type = "Base"
	//Headless chrome is used to download content from JS driven web pages
	Chrome = "Chrome"
)

Fetcher types

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL