Documentation
¶
Overview ¶
Package htmlutil implements a wrapper for Golang's html5 tokeniser / parser implementation, making it much easier to find and extract information, aiming to be powerful and intuitive while remaining a minimal and logical extension.
There are three core components, the `htmlutil.Node` struct (a wrapper for `*html.Node`), the `htmlutil.Parse` function (optional), an ubiquitous filter algorithm used throughout this implementation, providing functionality similar to CSS selectors, and powered by optional (varargs) parameters in the form of chained closures with a signature of `func(htmlutil.Node) bool`.
Filter behavior ¶
- based on a recursive algorithm where each node can match at most one filter, consuming it (for that sub-tree), and is added to the result if `len(filters) == 0`
- every node in the tree is searched (in general, there is a "find" mode where only one result is returned)
- nil filters are preemptively stripped, and so are treated like they were omitted
- each node will be present in the result at most once, and will retain (depth first) order
- behavior is undefined if the tree is not "well formed" (e.g. any cycles)
- providing no filters will return ALL nodes (or if only one result is needed, the first node)
- filter closures will not be called with a node with a nil `Data` field
- filter closures will receive nodes with a `Depth` field relative to the original
- the node's `Match` field stores the last "matched" node in the chain (note: duplicate matches for the same `*html.Node` are squashed), the root node is always treated as an initial match
- resulting node values will retain the match chain (will always be non-nil if the root was non-nil)
General behavior ¶
- a nil `Data` field for a `htmlutil.Node` indicates no node / no result, and methods should return default values, or other intuitive analog (behavior to make chaining far simpler)
Index ¶
- type Node
- func (n Node) Attr() []html.Attribute
- func (n Node) Children(filters ...func(node Node) bool) (children []Node)
- func (n Node) Classes() []string
- func (n Node) FilterNodes(filters ...func(node Node) bool) []Node
- func (n Node) FindNode(filters ...func(node Node) bool) (Node, bool)
- func (n Node) FirstChild(filters ...func(node Node) bool) Node
- func (n Node) GetAttr(namespace string, key string) (html.Attribute, bool)
- func (n Node) GetAttrVal(namespace string, key string) string
- func (n Node) GetNode(filters ...func(node Node) bool) Node
- func (n Node) HasClass(class string) bool
- func (n Node) InnerHTML(filters ...func(node Node) bool) string
- func (n Node) InnerText(filters ...func(node Node) bool) string
- func (n Node) InnerWords(filters ...func(node Node) bool) string
- func (n Node) LastChild(filters ...func(node Node) bool) Node
- func (n Node) NextSibling(filters ...func(node Node) bool) Node
- func (n Node) Offset() int
- func (n Node) OuterHTML() string
- func (n Node) OuterText() string
- func (n Node) OuterWords() string
- func (n Node) Parent(filters ...func(node Node) bool) Node
- func (n Node) PrevSibling(filters ...func(node Node) bool) Node
- func (n Node) Range(fn func(i int, node Node) bool, filters ...func(node Node) bool)
- func (n Node) SiblingIndex(filters ...func(node Node) bool) int
- func (n Node) SiblingLength(filters ...func(node Node) bool) int
- func (n Node) String() string
- func (n Node) Tag() string
- func (n Node) Type() html.NodeType
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Node ¶
type Node struct { // Data is the underlying html data for this node Data *html.Node // Depth is the relative depth to the top of the tree (being parsed, filtered, etc) Depth int // Match is the last match (set by filter impl.), and is used to check previous matches for chained filters Match *Node }
Node is the data structure this package provides to allow utilisation of utility methods + extra metadata such as the last match (`Match` property) for filter / find / get calls, as well as the overall (relative) depth, allowing matching on things such as "all the table row elements that are direct children of a given tbody", a-la CSS selectors
func Parse ¶
Parse first performs html.Parse, parsing through any errors, before applying a find to the resulting Node (wrapped like `Node{Data: node}`), returning the first matching Node, or an error, if no matches were found
func (Node) Children ¶
Children builds a slice containing all child nodes using the `Range` method, passing through filters
func (Node) Classes ¶ added in v1.1.0
Classes will return all the (whitespace-separated) values for the (first) `class` attribute, or an empty slice if n is not a valid element node with a class attribute with at least one non-whitespace character
func (Node) FilterNodes ¶
FilterNodes returns all nodes from the sub-tree (a search including the receiver) matching the filters (see package comment for filter behavior)
func (Node) FindNode ¶
FindNode returns the first node from the sub-tree (a search including the receiver) matching the filters (see package comment for filter behavior)
func (Node) FirstChild ¶
FirstChild will return the leftmost child node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match, note that depth will be automatically incremented
func (Node) GetAttr ¶
GetAttr matches on the first attribute (if any) for this node with the same namespace and key (key being case insensitive if namespace is empty), returning false if no match was found
func (Node) GetAttrVal ¶
GetAttrVal returns the value of any attribute matched by `n.GetAttr`
func (Node) GetNode ¶
GetNode returns the node returned by FindNode without the boolean flag indicating if there was a match, it is provided for chaining purposes, since this package deliberately handles a nil `Data` field gracefully
func (Node) HasClass ¶ added in v1.1.0
HasClass will return true if n is a valid element node with the given html class (case sensitive)
func (Node) InnerHTML ¶
InnerHTML builds a string using the outer html of all children matching all filters (see the `FindNode` method)
func (Node) InnerText ¶
InnerText builds a string using the outer text of all children matching all filters (see the `FindNode` method)
func (Node) InnerWords ¶ added in v1.2.0
InnerWords builds a string using the outer words of all children matching all filters (see the `FindNode` method and the `OuterWords` methods)
func (Node) LastChild ¶
LastChild will return the rightmost child node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match, note that depth will be automatically incremented
func (Node) NextSibling ¶
NextSibling will return the leftmost next sibling node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match
func (Node) Offset ¶
Offset is the difference between the depth of this node and the depth of last match, returning the depth of this node if `n.Match` is nil
func (Node) OuterHTML ¶
OuterHTML encodes this node as html using the `html.Render` function, note that it will return an empty string if `n.Data` is nil, and will panic if any error is returned (which should only occur if the sub-tree is not "well formed")
func (Node) OuterText ¶
OuterText builds a string from the data of all text nodes in the sub-tree, starting from and including `n`
func (Node) OuterWords ¶ added in v1.2.0
OuterWords builds a space-separated string from the whitespace-separated data of all text nodes in the sub-tree, starting from and including `n`, note that text separated / split across multiple elements will be considered as multiple words (words within non-empty sibling elements will be split by a single space)
func (Node) Parent ¶
Parent will return the first parent node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match, note that depth will be automatically decremented (potentially multiple times)
func (Node) PrevSibling ¶
PrevSibling will return the rightmost previous sibling node matching any filters (see the `FindNode` method), or a node with a nil `Data` property for no match
func (Node) Range ¶
Range iterates on any children matching any filters (see the `FindNode` method), providing the (filtered) index and node to the provided fn, note that it will panic if fn is nil
func (Node) SiblingIndex ¶
SiblingIndex returns the total number of previous siblings matching any filters (see the `FindNode` method)
func (Node) SiblingLength ¶
SiblingLength returns the total number of siblings matching any filters (see the `FindNode` method) incremented by one for the current node, or returns 0 if the receiver has nil data (is empty)