et

package module
v0.0.0-...-e89bf66 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2021 License: Apache-2.0 Imports: 17 Imported by: 5

README

Extract-Transform from the web

GoDocGo Report Card

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Parse

func Parse(fname, url, page string) ([]map[string]interface{}, error)

func ParseExt

func ParseExt(fname, url, page string) (string, error)
func ParseLinks(page, url string) ([]string, error)

ParseLinks returns all urls contained in html page

func ParseNewLinks(page, url string) ([]string, error)

ParseNewLinks returns new urls contained in html page

Types

type DomNode

type DomNode struct {
	Name string
	Node *html.Node
	Item map[string]interface{}
}

DomNode is for internal usage

type Parser

type Parser struct {
	Name          string             `json:"name"`
	DefaultFields bool               `json:"default_fields"`
	ZipContent    bool               `json:"zip_content"`
	ExampleUrl    string             `json:"example_url"`
	UA            string             `json:"ua"`
	Urls          []string           `json:"urls"`
	Rules         map[string][]*Rule `json:"rules"`
	Js            string             `json:"js"`
}

Parser contains a set of cascaded rule and an optional js code to parse corresponding htmls

func (*Parser) Do

func (p *Parser) Do() ([]*UrlTask, []map[string]interface{}, error)

func (*Parser) Parse

func (p *Parser) Parse(
	page, pageUrl string) ([]*UrlTask, []map[string]interface{}, error)

Parse parses the page of pageUrl and returns new UrlTasks and Items

func (*Parser) ParseURL

func (p *Parser) ParseURL(url string) ([]*UrlTask, []map[string]interface{}, error)

func (*Parser) RunJs

func (p *Parser) RunJs(
	items []map[string]interface{}) ([]map[string]interface{}, error)

RunJs runs the parser's js codes for items

type Rule

type Rule struct {
	Type  string   `json:"type"`
	Key   string   `json:"key"`
	Xpath string   `json:"xpath"`
	Re    []string `json:"re"`
	Js    string   `json:"js"`
}

Rule extract a specific key by xpath, regexp and js sequentially. Five types for now: url, dom, text, html and attr

func (*Rule) RunJs

func (r *Rule) RunJs(v interface{}) (interface{}, error)

RunJs runs the rule's js code for v

type UrlTask

type UrlTask struct {
	ParserName string      `json:"parser_name"`
	Url        string      `json:"url"`
	TaskName   string      `json:"task_name"`
	Ext        interface{} `json:"ext"`
}

UrlTask contains a crawling task of Url that should be parsed by ParserName

Directories

Path Synopsis
cmd
et

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL