scrape

package

v0.0.0-...-d33463d Latest Latest Go to latest Published: Jun 12, 2020 License: BSD-3-Clause Imports: 27 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/slotix/dataflowkit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scrape of the Dataflow kit is for structured data extraction from webpages starting from JSON payload processing to encoding scraped data to one of output formats like JSON, Excel, CSV, XML

Index ¶

Constants
func EncodeResults(ctx context.Context, task *Task, ei encodeInfo) (string, error)
type CSVEncoder
type Field
type Filter
- func (f Filter) Apply(data string) (string, error)
type JSONEncoder
type Payload
- func (p *Payload) InitUID()
type Task
- func NewTask() *Task
- func (task *Task) Parse(ctx context.Context, payload Payload) (io.ReadCloser, error)
type XLSXEncoder
type XMLEncoder

Constants ¶

View Source

const (
	COMMENT_INFO  = "Generated by Dataflow Kit. https://dataflowkit.com"
	GZIP_COMPRESS = "gz"
)

Variables ¶

This section is empty.

Functions ¶

func EncodeResults ¶

func EncodeResults(ctx context.Context, task *Task, ei encodeInfo) (string, error)

Types ¶

type CSVEncoder ¶

type CSVEncoder struct {
	// contains filtered or unexported fields
}

CSVEncoder transforms parsed data to CSV format.

type Field ¶

type Field struct {
	//Name is a name of fields. It is required, and will be used to aggregate results.
	Name string `json:"name"`
	//Selector is a CSS selector within the given block to process.  Pass in "." to use the root block's selector.
	CSSSelector string `json:"selector"`
	//Attrs specify attributes which will be extracted from element
	Attrs []string `json:"attrs"`
	//Details is an optional field strictly for Link extractor type. It guides scraper to parse additional pages following the links according to the set of fields specified inside "details"
	Details Payload `json:"details"`
	//Filters
	Filters []Filter `json:"filters"`
}

A Field corresponds to a given chunk of data to be extracted from every block in each page of a scrape.

type Filter ¶

type Filter struct {
	Name  string
	Param string
}

func (Filter) Apply ¶

func (f Filter) Apply(data string) (string, error)

type JSONEncoder ¶

type JSONEncoder struct {
	JSONL bool
}

JSONEncoder transforms parsed data to JSON format.

type Payload ¶

type Payload struct {
	// Name - Collection name.
	Name string `json:"name"`
	//Request struct represents HTTP request to be sent to a server. It combines parameters for passing for downloading html pages by Fetch Endpoint.
	//Request.URL field is required. All other fields including Params, Cookies, Func are optional.
	Request fetch.Request `json:"request"`
	//Fields is a set of fields used to extract data from a web page.
	Fields []Field `json:"fields"`
	//PayloadMD5 encodes payload content to MD5. It is used for generating file name to be stored.
	PayloadMD5 string
	//FetcherType represent fetcher which is used for document download.
	//Set up it to either `base` or `chrome` values
	//If FetcherType is omitted the value of FETCHER_TYPE of parse.d service is used by default.
	//FetcherType string `json:"fetcherType"`
	//Format represents output format (CSV, JSON, XML)
	Format string `json:"format"`
	//Compressed represents if result will be compressed into GZip
	Compressor string `json:"compressor"`
	//Paginator is used to scrape multiple pages.
	//If Paginator is nil, then no pagination is performed and it is assumed that the initial URL is the only page.
	Paginator string `json:"paginator"`
	//Paginated results are returned if true.
	//Default value is false
	// Single list of combined results from every block on all pages is returned by default.
	//
	// Paginated results are applicable for JSON and XML output formats.
	//
	// Combined list of results is always returned for CSV format.
	PaginateResults *bool `json:"paginateResults"`
	//FetchDelay should be used for a scraper to throttle the crawling speed to avoid hitting the web servers too frequently.
	//FetchDelay specifies sleep time for multiple requests for the same domain. It is equal to FetchDelay * random value between 500 and 1500 msec
	FetchDelay *time.Duration
	//Some web sites track  statistically significant similarities in the time between requests to them. RandomizeCrawlDelay setting decreases the chance of a crawler being blocked by such sites. This way a random delay ranging from 0.5  CrawlDelay to 1.5  CrawlDelay seconds is used between consecutive requests to the same domain. If CrawlDelay is zero (default) this option has no effect.
	RandomizeFetchDelay *bool
	//Maximum number of times to retry, in addition to the first download.
	//RETRY_HTTP_CODES
	//Default: [500, 502, 503, 504, 408]
	//Failed pages should be rescheduled for download at the end. once the spider has finished crawling all other (non failed) pages.
	RetryTimes int `json:"retryTimes"`
	// ContainPath means that one of the field just a path and we have to ignore all other fields (if present)
	// that are not a path
	IsPath bool `json:"path"`
	// contains filtered or unexported fields
}

Payload structure contain information and rules to be passed to a scraper Find the most actual information in docs/payload.md

func (*Payload) InitUID ¶

func (p *Payload) InitUID()

type Task ¶

type Task struct {
	Robots map[string]*robotstxt.RobotsData
	// contains filtered or unexported fields
}

Task keeps Results of Task generated from Payload along with other auxiliary information

func NewTask ¶

func NewTask() *Task

NewTask creates new task to parse fetched page following the rules from Payload.

func (*Task) Parse ¶

func (task *Task) Parse(ctx context.Context, payload Payload) (io.ReadCloser, error)

Parse specified payload.

type XLSXEncoder ¶

type XLSXEncoder struct {
	// contains filtered or unexported fields
}

type XMLEncoder ¶

type XMLEncoder struct {
}

XMLEncoder transforms parsed data to XML format.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL