Documentation ¶
Overview ¶
Package fetch of the Dataflow kit is used by fetch.d service which downloads html content from web pages to feed Dataflow kit scrapers.
Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.
Currently two types of fetcher are available : Chrome Fetcher and Base Fetcher.
Base fetcher is used for downloading html web page using Go standard Http library.
Chrome Fetcher connects to Headless Chrome which renders JavaScript pages.
RobotsTxtMiddleware checks if scraping of specified resource is allowed by robots.txt
Index ¶
- func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool
- func AssembleRobotstxtURL(rawurl string) (string, error)
- func GetCrawlDelay(r *robotstxt.RobotsData) time.Duration
- func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)
- type Action
- type BaseFetcher
- type ChromeFetcher
- type ClickAction
- type Config
- type FetchService
- type Fetcher
- type HTMLServer
- type LogCodec
- type PaginateAction
- type Request
- type Service
- type ServiceMiddleware
- type Type
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AllowedByRobots ¶
func AllowedByRobots(rawurl string, robotsData *robotstxt.RobotsData) bool
AllowedByRobots checks if scraping of specified URL is allowed by robots.txt
func AssembleRobotstxtURL ¶
AssembleRobotstxtURL robots.txt URL from URL
func GetCrawlDelay ¶
func GetCrawlDelay(r *robotstxt.RobotsData) time.Duration
getCrawlDelay retrieves Crawl-delay directive from robots.txt. Crawl-delay is not in the standard robots.txt protocol, and according to Wikipedia, some bots have different interpretations for this value. That's why maybe many websites don't even bother defining the rate limits in robots.txt. Crawl-delay value does not have an effect on delays between consecutive requests to the same domain for the moment. FetchDelay and RandomizeFetchDelay from ScrapeOptions are used for throttling a crawler speed.
func RobotstxtData ¶
func RobotstxtData(url string) (robotsData *robotstxt.RobotsData, err error)
RobotstxtData generates robots.txt url, retrieves its content through API fetch endpoint.
Types ¶
type BaseFetcher ¶
type BaseFetcher struct {
// contains filtered or unexported fields
}
BaseFetcher is a Fetcher that uses the Go standard library's http client to fetch URLs.
func (*BaseFetcher) Fetch ¶
func (bf *BaseFetcher) Fetch(request Request) (io.ReadCloser, error)
Fetch retrieves document from the remote server.
type ChromeFetcher ¶
type ChromeFetcher struct {
// contains filtered or unexported fields
}
ChromeFetcher is used to fetch Java Script rendeded pages.
func (*ChromeFetcher) Fetch ¶
func (f *ChromeFetcher) Fetch(request Request) (io.ReadCloser, error)
Fetch retrieves document from the remote server. It returns web page content along with cache and expiration information.
func (ChromeFetcher) RunJSFromFile ¶
type ClickAction ¶
type ClickAction struct {
Element string `json:"element"`
}
func (*ClickAction) Execute ¶
func (a *ClickAction) Execute(ctx context.Context, f *ChromeFetcher) error
type FetchService ¶
type FetchService struct { }
FetchService implements service with empty struct
func (FetchService) Fetch ¶
func (fs FetchService) Fetch(req Request) (io.ReadCloser, error)
Fetch method implements fetching content from web page with Base or Chrome fetcher.
type Fetcher ¶
type Fetcher interface { // Fetch is called to retrieve HTML content of a document from the remote server. Fetch(request Request) (io.ReadCloser, error) // contains filtered or unexported methods }
Fetcher is the interface that must be satisfied by things that can fetch remote URLs and return their contents.
Note: Fetchers may or may not be safe to use concurrently. Please read the documentation for each fetcher for more details.
type HTMLServer ¶
type HTMLServer struct {
// contains filtered or unexported fields
}
HTMLServer represents the web service that serves up HTML
type LogCodec ¶
type LogCodec struct {
// contains filtered or unexported fields
}
LogCodec captures the output from writing RPC requests and reading responses on the connection. It implements rpcc.Codec via WriteRequest and ReadResponse.
func (*LogCodec) ReadResponse ¶
ReadResponse unmarshals from the connection into v whilst echoing what is read into a buffer for logging.
type PaginateAction ¶
func (*PaginateAction) Execute ¶
func (pa *PaginateAction) Execute(ctx context.Context, f *ChromeFetcher) error
type Request ¶
type Request struct { // Type defines Fetcher type. It may be "chrome" or "base". Defaults to "base". Type string `json:"type"` // URL to be retrieved URL string `json:"url"` // HTTP method : GET, POST Method string // FormData is a string value for passing formdata parameters. // // For example it may be used for processing pages which require authentication // // Example: // // "auth_key=880ea6a14ea49e853634fbdc5015a024&referer=http%3A%2F%2Fexample.com%2F&ips_username=user&ips_password=userpassword&rememberMe=1" // FormData string `json:"formData,omitempty"` //UserToken identifies user to keep personal cookies information. UserToken string `json:"userToken"` // Actions contains the list of action we have to perform on page Actions string `json:"actions"` }
Request struct contains request information sent to Fetchers
type Service ¶
type Service interface {
Fetch(req Request) (io.ReadCloser, error)
}
Service defines Fetch service interface
func NewHTTPClient ¶
NewHTTPClient returns an Fetch Service backed by an HTTP server living at the remote instance. We expect instance to come from a service discovery system, so likely of the form "host:port". We bake-in certain middlewares, implementing the client library pattern.
type ServiceMiddleware ¶
ServiceMiddleware defines a middleware for a Fetch service
func LoggingMiddleware ¶
func LoggingMiddleware(logger *zap.Logger) ServiceMiddleware
LoggingMiddleware logs Service endpoints