scraper

package
v0.5.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 5, 2023 License: GPL-3.0 Imports: 20 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Writer   output.WriterConfig `yaml:"writer,omitempty"`
	Scrapers []Scraper           `yaml:"scrapers,omitempty"`
	Global   GlobalConfig        `yaml:"global,omitempty"`
}

Config defines the overall structure of the scraper configuration. Values will be taken from a config yml file or environment variables or both.

func NewConfig added in v0.2.1

func NewConfig(configPath string) (*Config, error)

type DateComponent

type DateComponent struct {
	Covers          date.CoveredDateParts `yaml:"covers"`
	ElementLocation ElementLocation       `yaml:"location"`
	Layout          []string              `yaml:"layout"`
	Transform       []TransformConfig     `yaml:"transform,omitempty"`
}

A DateComponent is used to find a specific part of a date within a html document

type ElementLocation

type ElementLocation struct {
	Selector      string      `yaml:"selector,omitempty"`
	JsonSelector  string      `yaml:"json_selector,omitempty"`
	NodeIndex     int         `yaml:"node_index,omitempty"`
	ChildIndex    int         `yaml:"child_index,omitempty"`
	RegexExtract  RegexConfig `yaml:"regex_extract,omitempty"`
	Attr          string      `yaml:"attr,omitempty"`
	MaxLength     int         `yaml:"max_length,omitempty"`
	EntireSubtree bool        `yaml:"entire_subtree,omitempty"`
	AllNodes      bool        `yaml:"all_nodes,omitempty"`
	Separator     string      `yaml:"separator,omitempty"`
}

ElementLocation is used to find a specific string in a html document

type ElementLocations added in v0.4.3

type ElementLocations []ElementLocation

func (*ElementLocations) UnmarshalYAML added in v0.4.3

func (e *ElementLocations) UnmarshalYAML(value *yaml.Node) error

type Field added in v0.2.10

type Field struct {
	Name             string           `yaml:"name"`
	Value            string           `yaml:"value,omitempty"`
	Type             string           `yaml:"type,omitempty"`     // can currently be text, url or date
	ElementLocations ElementLocations `yaml:"location,omitempty"` // elements are string joined using the given Separator
	Separator        string           `yaml:"separator,omitempty"`
	// If a field can be found on a subpage the following variable has to contain a field name of
	// a field of type 'url' that is located on the main page.
	OnSubpage    string          `yaml:"on_subpage,omitempty"`    // applies to text, url, date
	CanBeEmpty   bool            `yaml:"can_be_empty,omitempty"`  // applies to text, url
	Components   []DateComponent `yaml:"components,omitempty"`    // applies to date
	DateLocation string          `yaml:"date_location,omitempty"` // applies to date
	DateLanguage string          `yaml:"date_language,omitempty"` // applies to date
	Hide         bool            `yaml:"hide,omitempty"`          // applies to text, url, date
	GuessYear    bool            `yaml:"guess_year,omitempty"`    // applies to date
}

A Field contains all the information necessary to scrape a dynamic field from a website, ie a field who's value changes for each item

type Filter

type Filter struct {
	Field      string `yaml:"field"`
	Type       string
	Expression string `yaml:"exp"` // changed from 'regex' to 'exp' in version 0.5.7
	RegexComp  *regexp.Regexp
	DateComp   time.Time
	DateOp     string
	Match      bool `yaml:"match"`
}

A Filter is used to filter certain items from the result list

func (*Filter) FilterMatch added in v0.5.7

func (f *Filter) FilterMatch(value interface{}) bool

func (*Filter) Initialize added in v0.5.7

func (f *Filter) Initialize(fieldType string) error

type GlobalConfig added in v0.2.1

type GlobalConfig struct {
	UserAgent string `yaml:"user-agent"`
}

GlobalConfig is used for storing global configuration parameters that are needed across all scrapers

type Paginator added in v0.5.0

type Paginator struct {
	Location ElementLocation `yaml:"location,omitempty"`
	MaxPages int             `yaml:"max_pages,omitempty"`
}

A Paginator is used to paginate through a website

type RegexConfig

type RegexConfig struct {
	RegexPattern string `yaml:"exp"`
	Index        int    `yaml:"index"`
}

RegexConfig is used for extracting a substring from a string based on the given RegexPattern and Index

type Scraper

type Scraper struct {
	Name                string            `yaml:"name"`
	URL                 string            `yaml:"url"`
	Item                string            `yaml:"item"`
	ExcludeWithSelector []string          `yaml:"exclude_with_selector,omitempty"`
	Fields              []Field           `yaml:"fields,omitempty"`
	Filters             []*Filter         `yaml:"filters,omitempty"`
	Paginator           Paginator         `yaml:"paginator,omitempty"`
	RenderJs            bool              `yaml:"renderJs,omitempty"`
	PageLoadWaitSeconds int               `yaml:"page_load_wait_sec,omitempty"` // only taken into account when renderJs = true
	Interaction         types.Interaction `yaml:"interaction,omitempty"`
	// contains filtered or unexported fields
}

A Scraper contains all the necessary config parameters and structs needed to extract the desired information from a website

func (Scraper) GetItems added in v0.1.2

func (c Scraper) GetItems(globalConfig *GlobalConfig, rawDyn bool) ([]map[string]interface{}, error)

GetItems fetches and returns all items from a website according to the Scraper's paramaters. When rawDyn is set to true the items returned are not processed according to their type but instead the raw values based only on the location are returned (ignore regex_extract??). And only those of dynamic fields, ie fields that don't have a predefined value and that are present on the main page (not subpages). This is used by the ML feature generation.

type TransformConfig added in v0.3.5

type TransformConfig struct {
	TransformType string `yaml:"type,omitempty"`    // only regex-replace for now
	RegexPattern  string `yaml:"regex,omitempty"`   // a container for the pattern
	Replacement   string `yaml:"replace,omitempty"` // a plain string for replacement
}

TransformConfig is used to replace an existing substring with some other kind of string. Processing needs to happen before extracting dates.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL