scraper

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 7, 2022 License: GPL-3.0 Imports: 18 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Writer   output.WriterConfig `yaml:"writer,omitempty"`
	Scrapers []Scraper           `yaml:"scrapers,omitempty"`
	Global   GlobalConfig        `yaml:"global,omitempty"`
}

Config defines the overall structure of the scraper configuration. Values will be taken from a config yml file or environment variables or both.

func NewConfig added in v0.2.1

func NewConfig(configPath string) (*Config, error)

type CoveredDateParts

type CoveredDateParts struct {
	Day   bool `yaml:"day"`
	Month bool `yaml:"month"`
	Year  bool `yaml:"year"`
	Time  bool `yaml:"time"`
}

CoveredDateParts is used to determine what parts of a date a DateComponent covers

type DateComponent

type DateComponent struct {
	Covers          CoveredDateParts `yaml:"covers"`
	ElementLocation ElementLocation  `yaml:"location"`
	Layout          []string         `yaml:"layout"`
}

A DateComponent is used to find a specific part of a date within a html document

type ElementLocation

type ElementLocation struct {
	Selector      string      `yaml:"selector,omitempty"`
	NodeIndex     int         `yaml:"node_index,omitempty"`
	ChildIndex    int         `yaml:"child_index,omitempty"`
	RegexExtract  RegexConfig `yaml:"regex_extract,omitempty"`
	Attr          string      `yaml:"attr,omitempty"`
	MaxLength     int         `yaml:"max_length,omitempty"`
	EntireSubtree bool        `yaml:"entire_subtree,omitempty"`
}

ElementLocation is used to find a specific string in a html document

type Field added in v0.2.10

type Field struct {
	Name  string `yaml:"name"`
	Value string `yaml:"value,omitempty"`
	Type  string `yaml:"type,omitempty"` // can currently be text, url or date
	// If a field can be found on a subpage the following variable has to contain a field name of
	// a field of type 'url' that is located on the main page.
	ElementLocation ElementLocation `yaml:"location,omitempty"`
	OnSubpage       string          `yaml:"on_subpage,omitempty"`    // applies to text, url, date
	CanBeEmpty      bool            `yaml:"can_be_empty,omitempty"`  // applies to text, url
	Components      []DateComponent `yaml:"components,omitempty"`    // applies to date
	DateLocation    string          `yaml:"date_location,omitempty"` // applies to date
	DateLanguage    string          `yaml:"date_language,omitempty"` // applies to date
	Hide            bool            `yaml:"hide,omitempty"`          // appliess to text, url, date
}

A Field contains all the information necessary to scrape a dynamic field from a website, ie a field who's value changes for each item

type Filter

type Filter struct {
	Field string `yaml:"field"`
	Regex string `yaml:"regex"`
	Match bool   `yaml:"match"`
}

A Filter is used to filter certain items from the result list

type GlobalConfig added in v0.2.1

type GlobalConfig struct {
	UserAgent string `yaml:"user-agent"`
}

GlobalConfig is used for storing global configuration parameters that are needed across all scrapers

type RegexConfig

type RegexConfig struct {
	Exp   string `yaml:"exp"`
	Index int    `yaml:"index"`
}

RegexConfig is used for extracting a substring from a string based on the given Exp and Index

type Scraper

type Scraper struct {
	Name                string   `yaml:"name"`
	URL                 string   `yaml:"url"`
	Item                string   `yaml:"item"`
	ExcludeWithSelector []string `yaml:"exclude_with_selector,omitempty"`
	Fields              []Field  `yaml:"fields,omitempty"`
	Filters             []Filter `yaml:"filters,omitempty"`
	Paginator           struct {
		Location ElementLocation `yaml:"location,omitempty"`
		MaxPages int             `yaml:"max_pages,omitempty"`
	} `yaml:"paginator,omitempty"`
	RenderJs bool `yaml:"renderJs,omitempty"`
}

A Scraper contains all the necessary config parameters and structs needed to extract the desired information from a website

func (Scraper) GetItems added in v0.1.2

func (c Scraper) GetItems(globalConfig *GlobalConfig) ([]map[string]interface{}, error)

GetItems fetches and returns all items from a website according to the Scraper's paramaters

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL