scraper

package
v0.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 28, 2022 License: GPL-3.0 Imports: 16 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Writer   output.WriterConfig `yaml:"writer"`
	Scrapers []Scraper           `yaml:"scrapers"`
	Global   GlobalConfig        `yaml:"global"`
}

Config defines the overall structure of the scraper configuration. Values will be taken from a config yml file or environment variables or both.

func NewConfig added in v0.2.1

func NewConfig(configPath string) (*Config, error)

type CoveredDateParts

type CoveredDateParts struct {
	Day   bool `yaml:"day"`
	Month bool `yaml:"month"`
	Year  bool `yaml:"year"`
	Time  bool `yaml:"time"`
}

CoveredDateParts is used to determine what parts of a date a DateComponent covers

type DateComponent

type DateComponent struct {
	Covers          CoveredDateParts `yaml:"covers"`
	ElementLocation ElementLocation  `yaml:"location"`
	Layout          []string         `yaml:"layout"`
}

A DateComponent is used to find a specific part of a date within a html document

type DynamicField

type DynamicField struct {
	Name string `yaml:"name"`
	Type string `yaml:"type"` // can currently be text, url or date
	// If a field can be found on a subpage the following variable has to contain a field name of
	// a field of type 'url' that is located on the main page.
	ElementLocation ElementLocation `yaml:"location"`
	OnSubpage       string          `yaml:"on_subpage"`    // applies to text, url, date
	CanBeEmpty      bool            `yaml:"can_be_empty"`  // applies to text, url
	Components      []DateComponent `yaml:"components"`    // applies to date
	DateLocation    string          `yaml:"date_location"` // applies to date
	DateLanguage    string          `yaml:"date_language"` // applies to date
	Hide            bool            `yaml:"hide"`          // appliess to text, url, date
}

A DynamicField contains all the information necessary to scrape a dynamic field from a website, ie a field who's value changes for each item

type ElementLocation

type ElementLocation struct {
	Selector      string      `yaml:"selector"`
	NodeIndex     int         `yaml:"node_index"`
	ChildIndex    int         `yaml:"child_index"`
	RegexExtract  RegexConfig `yaml:"regex_extract"`
	Attr          string      `yaml:"attr"`
	MaxLength     int         `yaml:"max_length"`
	EntireSubtree bool        `yaml:"entire_subtree"`
}

ElementLocation is used to find a specific string in a html document

type Filter

type Filter struct {
	Field string `yaml:"field"`
	Regex string `yaml:"regex"`
	Match bool   `yaml:"match"`
}

A Filter is used to filter certain items from the result list

type GlobalConfig added in v0.2.1

type GlobalConfig struct {
	UserAgent string `yaml:"user-agent"`
}

GlobalConfig is used for storing global configuration parameters that are needed across all scrapers

type RegexConfig

type RegexConfig struct {
	Exp   string `yaml:"exp"`
	Index int    `yaml:"index"`
}

RegexConfig is used for extracting a substring from a string based on the given Exp and Index

type Scraper

type Scraper struct {
	Name                string   `yaml:"name"`
	URL                 string   `yaml:"url"`
	Item                string   `yaml:"item"`
	ExcludeWithSelector []string `yaml:"exclude_with_selector"`
	Fields              struct {
		Static  []StaticField  `yaml:"static"`
		Dynamic []DynamicField `yaml:"dynamic"`
	} `yaml:"fields"`
	Filters   []Filter `yaml:"filters"`
	Paginator struct {
		Location ElementLocation `yaml:"location"`
		MaxPages int             `yaml:"max_pages"`
	}
}

A Scraper contains all the necessary config parameters and structs needed to extract the desired information from a website

func (Scraper) GetItems added in v0.1.2

func (c Scraper) GetItems(globalConfig *GlobalConfig) ([]map[string]interface{}, error)

GetItems fetches and returns all items from a website according to the Scraper's paramaters

type StaticField

type StaticField struct {
	Name  string `yaml:"name"`
	Value string `yaml:"value"`
}

A StaticField defines a field that has a fixed name and value across all scraped items

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL