htmlquery

package module

v0.0.0-...-1fbec5a Latest Latest Go to latest Published: Mar 3, 2023 License: MIT Imports: 14 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/hktalent/htmlquery

Links

Open Source Insights

README ¶

htmlquery

What Features

InsecureSkipVerify: true
LoadURLWithPost(url string, szPost string, isPostJson bool)

Overview

htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquery built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

You can visit this page to learn about the supported XPath(1.0/2.0) syntax. https://github.com/antchfx/xpath

XPath query packages for Go

Name	Description
htmlquery	XPath query package for the HTML document
xmlquery	XPath query package for the XML document
jsonquery	XPath query package for the JSON document

Installation

go get github.com/hktalent/htmlquery

Getting Started

Query, returns matched elements or error.

nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
	panic(`not a valid XPath expression.`)
}

Load HTML document from URL.

doc, err := htmlquery.LoadURL("http://example.com/")

Load HTML from document.

filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)

Load HTML document from string.

s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))

Find all A elements.

list := htmlquery.Find(doc, "//a")

Find all A elements that have `href` attribute.

list := htmlquery.Find(doc, "//a[@href]")

Find all A elements with `href` attribute and only return `href` value.

list := htmlquery.Find(doc, "//a/@href")	
for _ , n := range list{
	fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}

Find the third A element.

a := htmlquery.FindOne(doc, "//a[3]")

Find children element (img) under A `href` and print the source

a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value

Evaluate the number of all IMG element.

expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)

Quick Starts

func main() {
	doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
	if err != nil {
		panic(err)
	}
	// Find all news item.
	list, err := htmlquery.QueryAll(doc, "//ol/li")
	if err != nil {
		panic(err)
	}
	for i, n := range list {
		a := htmlquery.FindOne(n, "//a")
		if a != nil {
		    fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
		}
	}
}

FAQ

`Find()` vs `QueryAll()`, which is better?

Find and QueryAll both do the same things, searches all of matched html nodes. The Find will panics if you give an error XPath query, but QueryAll will return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer the QuerySelector and QuerySelectorAll methods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance

goos: windows
goarch: amd64
pkg: github.com/hktalent/htmlquery
BenchmarkSelectorCache-4                20000000                55.2 ns/op
BenchmarkDisableSelectorCache-4           500000              3162 ns/op

How to disable caching?

htmlquery.DisableSelectorCache = true

Questions

Please let me know if you have any questions.

Documentation ¶

Overview ¶

Package htmlquery provides extract data from HTML documents using XPath expression.

Index ¶

Variables
func ExistsAttr(n *html.Node, name string) bool
func Find(top *html.Node, expr string) []*html.Node
func FindOne(top *html.Node, expr string) *html.Node
func InnerText(n *html.Node) string
func LoadDoc(path string) (*html.Node, error)
func LoadURL(url string) (*html.Node, error)
func LoadURLWithPost(url string, szPost string, isPostJson bool) (*html.Node, error)
func OutputHTML(n *html.Node, self bool) string
func Parse(r io.Reader) (*html.Node, error)
func Query(top *html.Node, expr string) (*html.Node, error)
func QueryAll(top *html.Node, expr string) ([]*html.Node, error)
func QuerySelector(top *html.Node, selector *xpath.Expr) *html.Node
func QuerySelectorAll(top *html.Node, selector *xpath.Expr) []*html.Node
func SelectAttr(n *html.Node, name string) (val string)
type NodeNavigator
- func CreateXPathNavigator(top *html.Node) *NodeNavigator

Constants ¶

This section is empty.

Variables ¶

View Source

var DisableSelectorCache = false

DisableSelectorCache will disable caching for the query selector if value is true.

View Source

var SelectorCacheMaxEntries = 50

SelectorCacheMaxEntries allows how many selector object can be caching. Default is 50. Will disable caching if SelectorCacheMaxEntries <= 0.

Functions ¶

func ExistsAttr ¶

func ExistsAttr(n *html.Node, name string) bool

ExistsAttr returns whether attribute with specified name exists.

func Find ¶

func Find(top *html.Node, expr string) []*html.Node

Find is like QueryAll but Will panics if the expression `expr` cannot be parsed.

See `QueryAll()` function.

func FindOne ¶

func FindOne(top *html.Node, expr string) *html.Node

FindOne is like Query but will panics if the expression `expr` cannot be parsed. See `Query()` function.

func InnerText ¶

func InnerText(n *html.Node) string

InnerText returns the text between the start and end tags of the object.

func LoadDoc ¶

func LoadDoc(path string) (*html.Node, error)

LoadDoc loads the HTML document from the specified file path.

func LoadURL ¶

func LoadURL(url string) (*html.Node, error)

LoadURL loads the HTML document from the specified URL. Default enabling gzip on a HTTP request.

func LoadURLWithPost ¶

func LoadURLWithPost(url string, szPost string, isPostJson bool) (*html.Node, error)

func OutputHTML ¶

func OutputHTML(n *html.Node, self bool) string

OutputHTML returns the text including tags name.

func Parse ¶

func Parse(r io.Reader) (*html.Node, error)

Parse returns the parse tree for the HTML from the given Reader.

func Query ¶

func Query(top *html.Node, expr string) (*html.Node, error)

Query runs the given XPath expression against the given html.Node and returns the first matching html.Node, or nil if no matches are found.

Returns an error if the expression `expr` cannot be parsed.

func QueryAll ¶

func QueryAll(top *html.Node, expr string) ([]*html.Node, error)

QueryAll searches the html.Node that matches by the specified XPath expr. Return an error if the expression `expr` cannot be parsed.

func QuerySelector ¶

func QuerySelector(top *html.Node, selector *xpath.Expr) *html.Node

QuerySelector returns the first matched html.Node by the specified XPath selector.

func QuerySelectorAll ¶

func QuerySelectorAll(top *html.Node, selector *xpath.Expr) []*html.Node

QuerySelectorAll searches all of the html.Node that matches the specified XPath selectors.

func SelectAttr ¶

func SelectAttr(n *html.Node, name string) (val string)

SelectAttr returns the attribute value with the specified name.

Types ¶

type NodeNavigator ¶

type NodeNavigator struct {
	// contains filtered or unexported fields
}

func CreateXPathNavigator ¶

func CreateXPathNavigator(top *html.Node) *NodeNavigator

CreateXPathNavigator creates a new xpath.NodeNavigator for the specified html.Node.

func (*NodeNavigator) Copy ¶

func (h *NodeNavigator) Copy() xpath.NodeNavigator

func (*NodeNavigator) Current ¶

func (h *NodeNavigator) Current() *html.Node

func (*NodeNavigator) LocalName ¶

func (h *NodeNavigator) LocalName() string

func (*NodeNavigator) MoveTo ¶

func (h *NodeNavigator) MoveTo(other xpath.NodeNavigator) bool

func (*NodeNavigator) MoveToChild ¶

func (h *NodeNavigator) MoveToChild() bool

func (*NodeNavigator) MoveToFirst ¶

func (h *NodeNavigator) MoveToFirst() bool

func (*NodeNavigator) MoveToNext ¶

func (h *NodeNavigator) MoveToNext() bool

func (*NodeNavigator) MoveToNextAttribute ¶

func (h *NodeNavigator) MoveToNextAttribute() bool

func (*NodeNavigator) MoveToParent ¶

func (h *NodeNavigator) MoveToParent() bool

func (*NodeNavigator) MoveToPrevious ¶

func (h *NodeNavigator) MoveToPrevious() bool

func (*NodeNavigator) MoveToRoot ¶

func (h *NodeNavigator) MoveToRoot()

func (*NodeNavigator) NodeType ¶

func (h *NodeNavigator) NodeType() xpath.NodeType

func (*NodeNavigator) Prefix ¶

func (*NodeNavigator) Prefix() string

func (*NodeNavigator) String ¶

func (h *NodeNavigator) String() string

func (*NodeNavigator) Value ¶

func (h *NodeNavigator) Value() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

htmlquery

What Features

Overview

XPath query packages for Go

Installation

Getting Started

Query, returns matched elements or error.

Load HTML document from URL.

Load HTML from document.

Load HTML document from string.

Find all A elements.

Find all A elements that have href attribute.

Find all A elements with href attribute and only return href value.

Find the third A element.

Find children element (img) under A href and print the source

Evaluate the number of all IMG element.

Quick Starts

FAQ

Find() vs QueryAll(), which is better?

Can I save my query expression object for the next query?

XPath query object cache performance

How to disable caching?

Questions

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func ExistsAttr ¶

func Find ¶

func FindOne ¶

func InnerText ¶

func LoadDoc ¶

func LoadURL ¶

func LoadURLWithPost ¶

func OutputHTML ¶

func Parse ¶

func Query ¶

func QueryAll ¶

func QuerySelector ¶

func QuerySelectorAll ¶

func SelectAttr ¶

Types ¶

type NodeNavigator ¶

func CreateXPathNavigator ¶

func (*NodeNavigator) Copy ¶

func (*NodeNavigator) Current ¶

func (*NodeNavigator) LocalName ¶

func (*NodeNavigator) MoveTo ¶

func (*NodeNavigator) MoveToChild ¶

func (*NodeNavigator) MoveToFirst ¶

func (*NodeNavigator) MoveToNext ¶

func (*NodeNavigator) MoveToNextAttribute ¶

func (*NodeNavigator) MoveToParent ¶

func (*NodeNavigator) MoveToPrevious ¶

func (*NodeNavigator) MoveToRoot ¶

func (*NodeNavigator) NodeType ¶

func (*NodeNavigator) Prefix ¶

func (*NodeNavigator) String ¶

func (*NodeNavigator) Value ¶

Source Files ¶

Find all A elements that have `href` attribute.

Find all A elements with `href` attribute and only return `href` value.

Find children element (img) under A `href` and print the source

`Find()` vs `QueryAll()`, which is better?