scraper

package
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 7, 2023 License: GPL-3.0 Imports: 18 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Writer   output.WriterConfig `yaml:"writer,omitempty"`
	Scrapers []Scraper           `yaml:"scrapers,omitempty"`
	Global   GlobalConfig        `yaml:"global,omitempty"`
}

Config defines the overall structure of the scraper configuration. Values will be taken from a config yml file or environment variables or both.

func NewConfig added in v0.2.1

func NewConfig(configPath string) (*Config, error)

type CoveredDateParts

type CoveredDateParts struct {
	Day   bool `yaml:"day,omitempty"`
	Month bool `yaml:"month,omitempty"`
	Year  bool `yaml:"year,omitempty"`
	Time  bool `yaml:"time,omitempty"`
}

CoveredDateParts is used to determine what parts of a date a DateComponent covers

type DateComponent

type DateComponent struct {
	Covers          CoveredDateParts  `yaml:"covers"`
	ElementLocation ElementLocation   `yaml:"location"`
	Layout          []string          `yaml:"layout"`
	Transform       []TransformConfig `yaml:"transform,omitempty"`
}

A DateComponent is used to find a specific part of a date within a html document

type ElementLocation

type ElementLocation struct {
	Selector      string      `yaml:"selector,omitempty"`
	NodeIndex     int         `yaml:"node_index,omitempty"`
	ChildIndex    int         `yaml:"child_index,omitempty"`
	RegexExtract  RegexConfig `yaml:"regex_extract,omitempty"`
	Attr          string      `yaml:"attr,omitempty"`
	MaxLength     int         `yaml:"max_length,omitempty"`
	EntireSubtree bool        `yaml:"entire_subtree,omitempty"`
}

ElementLocation is used to find a specific string in a html document

type Field added in v0.2.10

type Field struct {
	Name  string `yaml:"name"`
	Value string `yaml:"value,omitempty"`
	Type  string `yaml:"type,omitempty"` // can currently be text, url or date
	// If a field can be found on a subpage the following variable has to contain a field name of
	// a field of type 'url' that is located on the main page.
	ElementLocation ElementLocation `yaml:"location,omitempty"`
	OnSubpage       string          `yaml:"on_subpage,omitempty"`    // applies to text, url, date
	CanBeEmpty      bool            `yaml:"can_be_empty,omitempty"`  // applies to text, url
	Components      []DateComponent `yaml:"components,omitempty"`    // applies to date
	DateLocation    string          `yaml:"date_location,omitempty"` // applies to date
	DateLanguage    string          `yaml:"date_language,omitempty"` // applies to date
	Hide            bool            `yaml:"hide,omitempty"`          // appliess to text, url, date
}

A Field contains all the information necessary to scrape a dynamic field from a website, ie a field who's value changes for each item

type Filter

type Filter struct {
	Field string `yaml:"field"`
	Regex string `yaml:"regex"`
	Match bool   `yaml:"match"`
}

A Filter is used to filter certain items from the result list

type GlobalConfig added in v0.2.1

type GlobalConfig struct {
	UserAgent string `yaml:"user-agent"`
}

GlobalConfig is used for storing global configuration parameters that are needed across all scrapers

type RegexConfig

type RegexConfig struct {
	Exp   string `yaml:"exp"`
	Index int    `yaml:"index"`
}

RegexConfig is used for extracting a substring from a string based on the given Exp and Index

type Scraper

type Scraper struct {
	Name                string   `yaml:"name"`
	URL                 string   `yaml:"url"`
	Item                string   `yaml:"item"`
	ExcludeWithSelector []string `yaml:"exclude_with_selector,omitempty"`
	Fields              []Field  `yaml:"fields,omitempty"`
	Filters             []Filter `yaml:"filters,omitempty"`
	Paginator           struct {
		Location ElementLocation `yaml:"location,omitempty"`
		MaxPages int             `yaml:"max_pages,omitempty"`
	} `yaml:"paginator,omitempty"`
	RenderJs bool `yaml:"renderJs,omitempty"`
}

A Scraper contains all the necessary config parameters and structs needed to extract the desired information from a website

func (Scraper) GetItems added in v0.1.2

func (c Scraper) GetItems(globalConfig *GlobalConfig, rawDyn bool) ([]map[string]interface{}, error)

GetItems fetches and returns all items from a website according to the Scraper's paramaters. When rawDyn is set to true the items returned are not processed according to their type but instead the raw values based only on the location are returned (ignore regex_extract??). And only those of dynamic fields, ie fields that don't have a predefined value and that are present on the main page (not subpages). This is used by the ML feature generation.

type TransformConfig added in v0.3.5

type TransformConfig struct {
	TransformType string `yaml:"type,omitempty"`    // only regex-replace for now
	RegexPattern  string `yaml:"regex,omitempty"`   // a container for the pattern
	Replacement   string `yaml:"replace,omitempty"` // a plain string for replacement
}

TransformConfig is used to replace an existing substring with some other kind of string. Processing needs to happen before extracting dates.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL