sq

package module
v0.0.0-...-1ef4f4d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 11, 2019 License: MIT Imports: 16 Imported by: 1

README

sq

sq is a very simple, powerful scraping library

sq uses struct tags as configuration, reflection, and goquery to unmarshall data out of HTML pages.

type ExamplePage struct {
	Title string `sq:"title | text"`

	Users []struct {
		ID        int       `sq:"td:nth-child(1) | text | regexp(\\d+)"`
		Name      string    `sq:"td:nth-child(2) | text"`
		Email     string    `sq:"td:nth-child(3) a | attr(href) | regexp(mailto:(.+))"`
		Website   *url.URL  `sq:"td:nth-child(4) > a | attr(href)"`
		Timestamp time.Time `sq:"td:nth-child(5) | text | time(2006 02 03)"`
		RowMarkup string    `sq:" . | html"`
	} `sq:"table tr"`

	Stylesheets []*css.Stylesheet `sq:"style"`
	Javascripts []*ast.Program    `sq:"script [type$=javascript]"`

	HTMLSnippet      *html.Node         `sq:"div.container"`
	GoquerySelection *goquery.Selection `sq:"[href], [src]"`
}

resp, err := http.Get("https://example.com")
if err != nil {
	log.Fatal(err)
}
defer resp.Body.Close()

var p ExamplePage

// Scrape continues on error and returns a slice of errors that occurred.
errs := sq.Scrape(&p, resp.Body)
for _, err := range errs {
	fmt.Println(err)
}

Note: go struct tags are parsed as strings and so all backslashes must be escaped. (ie. \d+ -> \\d+)

Accessors, Parsers, and Loaders

Accessors, parsers, loaders are specified in the tag in a unix-style pipeline.

Accessors

  • text: The text accessor emits the result of goquery's Text() method on the matched Selection.
  • html: The html accessor emits the result of goquery's Html() method on the matched Selection.
  • attr(<attr>): The attr() accessor emits the result of goquery's Attr() method with the supplied argument on the matched Selection. An error will be returned if the specified attribute is not found.

Parsers

  • regexp(<regexp>): The regexp parser takes a regular expression and applies it to the input emitted by the previous accessor or parser function. When no subcapture group is specified, the first match is emitted. If a subcapture group is specified, the first subcapture is returned.

Loaders

  • time(<format>): The time() loader calls time.Parse() with the supplied format on the input emitted from the previous accessor or parser function.

Custom parsers and loaders may be added or overridden:

// unescapes content
sq.RegisterParseFunc("unescape", func(s, _ string) (string, error) {
	return html.UnescapeString(s), nil
})

// loads a time.Duration from a datestamp
sq.RegisterLoadFunc("age", func(_ *goquery.Selection, s, layout string) (interface{}, error) {
	t, err := time.Parse(layout, s)
	if err != nil {
		return nil, err
	}
	return time.Since(t), nil
})

// example use
type Page struct {
	Alerts []struct {
		Title string        `sq:"h3 | text"`
		Age   time.Duration `sq:"span.posted | unescape | age(2006 02 03 15:04:05 MST)"`
	} `sq:"div.alert"`
}

Types

sq supports the full list of native go types except map, func, chan, and complex.

Several web related datastructures are also detected and loaded:

Each of these types are detected and loaded automatically using a TypeLoader. Overriding or adding type loaders is simple.

A TypeLoader is a pair of functions with a name. It takes function that checks for a match, and a function that does the loading.

// This is the typeloader for detecting url.URLs and loading them.
sq.RegisterTypeLoader("url",
	func(t reflect.Type) bool {
		return t.PkgPath() == "net/url" && t.Name() == "URL"
	},
	func(_ *goquery.Selection, s string) (interface{}, error) {
		return url.Parse(s)
	},
)
Docs

godoc

License

MIT 2016

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// reflection errors
	ErrInvalidKind       = errors.New("invalid kind")
	ErrNotSettable       = errors.New("v is not settable")
	ErrNonStructPtrValue = errors.New("*struct type required")
	ErrTagNotFound       = errors.New("sq tag not found")

	// not found errors
	ErrNodeNotFound      = errors.New("node not found")
	ErrAttributeNotFound = errors.New("attribute not found")
)
View Source
var (
	ErrNoRegexpMatch = errors.New("regexp did not match the content")
)

Functions

func RegisterLoadFunc

func RegisterLoadFunc(name string, f LoadFunc)

func RegisterParseFunc

func RegisterParseFunc(name string, f ParseFunc)

func RegisterTypeLoader

func RegisterTypeLoader(name string, isType func(t reflect.Type) bool, load func(sel *goquery.Selection, text string) (interface{}, error))

func Scrape

func Scrape(structPtr interface{}, r io.Reader) []error

Types

type LoadFunc

type LoadFunc func(sel *goquery.Selection, s, arg string) (interface{}, error)

type ParseFunc

type ParseFunc func(s, arg string) (string, error)

type TypeLoader

type TypeLoader struct {
	// contains filtered or unexported fields
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL