README ¶
Exploring HTML structure
HTML is parsed using golang.org/x/net/html which produces a tree.
The module provides basic functionality to compare HTML tags or nodes and their trees.
The search of an HTML tag using a *node.HTML
type ignores pointers.
It always returns the first match. By ignoring some properties, tags like <button>
are easy to count.
Text value of a tag (title, error message,...) can be checked.
Good to know
Parsing is not done according to the complete syntax checker of HTML.
For instance, tags like <p>
for which a closing tag would fail a comparison.
Siblings must always have the same order or comparison fails. Order of attributes is treated as irrelevant.
How to start
Detailed documentation includes examples.
Versions
v1.0.6
updates golang/go/x/net package to remove CVE-2022-27664 which does not affect x/net/html
v1.0.5
requires Go 1.16+ as ioutil package use is removed.
v1.0.4
requires Go 1.17+ which implements lazy loading of modules to avoid go.mod updates.
v1.0.0
was created on Go 1.12 which supports modules.
Documentation ¶
Overview ¶
Package parsing provides basic search and comparison of HTML documents. To limit storage of references, it uses the net/html package and its Node type to structure HTML.
Search a tag in a Node with options
- searching a tag based on its name whatever attributes where its type is optional
- searching a tag based on its non-pointer values: type, name, attribute and namespace
- comparing tags including list of attributes where order is irrelevant
- comparing Node structures with an optional type
Three ways to print a node tree
- select type of node and a the node value where to stop.
- select type of nodes or none.
- complete with indentation.
Good to know
- a non-matching closed tag is one element.
- a non-closed tag is closed by the following opening tag. The elements that follow are discarded as the tag is closed by the parser.
Index ¶
- func AttrIncluded(m, n *html.Node) bool
- func Equal(m, n *html.Node) bool
- func ExploreNode(n *html.Node, s string, t html.NodeType)
- func FindNode(m *html.Node, n html.Node) *html.Node
- func FindTag(n *html.Node, s string, t html.NodeType) *html.Node
- func FindTags(n *html.Node, s string, t html.NodeType) (a []*html.Node)
- func GetText(m *html.Node, b *bytes.Buffer)
- func IdenticalNodes(m, n *html.Node, t html.NodeType) *html.Node
- func IncludedNode(m, n *html.Node) *html.Node
- func IncludedNodeTyped(m, n *html.Node, t html.NodeType) *html.Node
- func IsTextNode(b io.ReadCloser, ns *html.Node, s string) error
- func IsTextTag(b io.ReadCloser, t, s string) error
- func ParseFile(f string) (*html.Node, error)
- func PrintData(n *html.Node) string
- func PrintNodes(m, n *html.Node, t html.NodeType, d int)
- func PrintTags(n *html.Node, s string, tagOnly bool)
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AttrIncluded ¶
AttrIncluded returns true if list of attributes of n is included in reference node m whatever their order.
func Equal ¶
Equal returns true if all fields of nodes m and n are equal except pointers reflect.DeepEqual(tag1, tag2) is unusable as pointers are checked too.
func ExploreNode ¶
ExploreNode prints node tags with name s and type t Without name, all tags are printed When type ErrorNode (iota == 0) prints tags of all types
Example (All) ¶
ExampleExploreNode_all prints the complete node tree.
Output: (Document) html (Element) head (Element) body (Element) p (Element) [{ class ex1}] HTML Fragment to compare against (Text) em (Element) others below (Text) to test (Text) sub (Element) diffs (Text)
Example (Tags) ¶
ExampleExploreNode_tags only prints text.
Output: HTML Fragment to compare against (Text) others below (Text) to test (Text) diffs (Text)
func FindTag ¶
FindTag finds the first occurrence of a tag name (i.e. whatever its attributes). If ErrorNode is passed, any tag type will be searched.
func FindTags ¶
FindTags finds all occurrences of a tag name whatever their attributes. If ErrorNode is passed, any tag type will be searched.
func GetText ¶
GetText prints the text content of a tree structure like PrintNodes w/o any formatting TODO Check usage of (* Tokenizer) Text equivalent in net/html package
func IdenticalNodes ¶
IdenticalNodes fails if trees have different size
func IncludedNode ¶
IncludedNode checks if n is included in m. Included means that the subtree is identical to m including order of siblings. If it is identical, nil is returned. Otherwise, the tag from which trees diverge is returned. If m has more tags than n, nil is returned as the search stops when one subtree exploration is exhausted.
Example ¶
ExampleIncludeNode is using the test files to demonstrate usage.
Output:
func IncludedNodeTyped ¶
IncludedNodeTyped is like IncludeNode where only tags of type t are compared
func IsTextNode ¶
IsTextNode checks the presence of a node and its text value in a buffer. An error message is returned if the node is not found or if the text is not the expected one.
func IsTextTag ¶
func IsTextTag(b io.ReadCloser, t, s string) error
IsTextTag checks the presence of a tag and its text value in a buffer. An error message is returned if the tag is not found or if the text is not the expected one.
func PrintData ¶
PrintData returns a string with Node information (not its relationships) nil will panic
func PrintNodes ¶
PrintNodes prints the tree structure of node m until n node is equal. If nil is passed, the complete node is printed. Values are indented based on the recursion depth d which is usually 0 when called html.ErrorNode (iota) displays every tag except the error node.
Example (WSearch) ¶
ExamplePrintNodes_wSearch is the previous example stopping at a searched node.
Output: html (Element) . head (Element) body (Element) .. p (Element) [{ class ex1}] tag found: p (Element) [{ class ex1}] ... HTML Fragment to compare against (Text) em (Element) .... others below (Text) to test (Text) sub (Element) .... diffs (Text)
Example (WoSearch) ¶
ExamplePrintNodes_woSearch prints all nodes without using search.
Output: html (Element) . head (Element) body (Element) .. p (Element) [{ class ex1}] ... HTML Fragment to compare against (Text) em (Element) .... others below (Text) to test (Text) sub (Element) .... diffs (Text)
func PrintTags ¶
PrintTags prints node structure until a tag name is found (whatever attributes) Without name, all tags are printed tagOnly selects ElementNode, otherwise tags are printed whatever type. If node tree has no Errornode, there is no difference with previous i.e. exploreNode(n, "", html.ErrorNode) prints nothing then both are equivalent.
Example (WSearch) ¶
ExamplePrintTags_wSearch is the previous example stopping at a searched tag
Output: html (Element) head (Element) body (Element) p (Element) [{ class ex1}] em (Element) [em] found. Stopping exploration sub (Element)
Example (WoSearch) ¶
ExamplePrintTags_woSearch is not using the search part.
Output: (Document) html (Element) head (Element) body (Element) p (Element) [{ class ex1}] HTML Fragment to compare against (Text) em (Element) others below (Text) to test (Text) sub (Element) diffs (Text)
Types ¶
This section is empty.