Documentation ¶
Overview ¶
Package parsing provides basic search and comparison of HTML documents. To limit storage of references, it uses the net/html package and its Node type to structure HTML.
Search a tag in a Node with options
- searching a tag based on its name whatever attributes where its type is optional
- searching a tag based on its non-pointer values: type, name, attribute and namespace
- comparing tags including list of attributes where order is irrelevant
- comparing Node structures with an optional type
Three ways to print a node tree
- select type of node and a the node value where to stop.
- select type of nodes or none.
- complete with indentation.
Good to know
- a non-matching closed tag is one element.
- a non-closed tag is closed by the following opening tag. The elements that follow are discarded as the tag is closed by the parser.
Index ¶
- func AttrIncluded(m, n *html.Node) bool
- func Equal(m, n *html.Node) bool
- func ExploreNode(n *html.Node, s string, t html.NodeType)
- func FindNode(m *html.Node, n html.Node) *html.Node
- func FindTag(n *html.Node, s string, t html.NodeType) *html.Node
- func FindTags(n *html.Node, s string, t html.NodeType) (a []*html.Node)
- func GetText(m *html.Node, b *bytes.Buffer)
- func IdenticalNodes(m, n *html.Node, t html.NodeType) *html.Node
- func IncludedNode(m, n *html.Node) *html.Node
- func IncludedNodeTyped(m, n *html.Node, t html.NodeType) *html.Node
- func IsTextNode(b io.ReadCloser, ns *html.Node, s string) error
- func IsTextTag(b io.ReadCloser, t, s string) error
- func ParseFile(f string) (*html.Node, error)
- func PrintData(n *html.Node) string
- func PrintNodes(m, n *html.Node, t html.NodeType, d int)
- func PrintTags(n *html.Node, s string, tagOnly bool)
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AttrIncluded ¶
AttrIncluded returns true if list of attributes of n is included in reference node m whatever their order.
func Equal ¶
Equal returns true if all fields of nodes m and n are equal except pointers reflect.DeepEqual(tag1, tag2) is unusable as pointers are checked too.
func ExploreNode ¶
ExploreNode prints node tags with name s and type t Without name, all tags are printed When type ErrorNode (iota == 0) prints tags of all types
Example (All) ¶
ExampleExploreNode_all prints the complete node tree.
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) fmt.Fprint(b, HTMLf) o, _ := html.Parse(b) parsing.ExploreNode(o, "", html.ErrorNode) }
Output: (Document) html (Element) head (Element) body (Element) p (Element) [{ class ex1}] HTML Fragment to compare against (Text) em (Element) others below (Text) to test (Text) sub (Element) diffs (Text)
Example (Tags) ¶
ExampleExploreNode_tags only prints text.
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" "log" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) fmt.Fprint(b, HTMLf) o, err := html.Parse(b) // Only place where err of Parse is checked if err != nil { log.Fatalf("parsing error:%v\n", err) } parsing.ExploreNode(o, "", html.TextNode) }
Output: HTML Fragment to compare against (Text) others below (Text) to test (Text) diffs (Text)
func FindTag ¶
FindTag finds the first occurrence of a tag name (i.e. whatever its attributes). If ErrorNode is passed, any tag type will be searched.
func FindTags ¶
FindTags finds all occurrences of a tag name whatever their attributes. If ErrorNode is passed, any tag type will be searched.
func GetText ¶
GetText prints the text content of a tree structure like PrintNodes w/o any formatting TODO Check usage of (* Tokenizer) Text equivalent in net/html package
Example ¶
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) _, _ = fmt.Fprint(b, HTMLf) o, _ := html.Parse(b) // Any parsing error would occured elsewhere w := new(bytes.Buffer) parsing.GetText(o, w) if s := fmt.Sprint(w); s != "HTML Fragment to compare against others below to test diffs" { fmt.Println("incorrect text") } }
Output:
func IdenticalNodes ¶
IdenticalNodes fails if trees have different size
func IncludedNode ¶
IncludedNode checks if n is included in m. Included means that the subtree is identical to m including order of siblings. If it is identical, nil is returned. Otherwise, the tag from which trees diverge is returned. If m has more tags than n, nil is returned as the search stops when one subtree exploration is exhausted.
Example ¶
ExampleIncludeNode is using the test files to demonstrate usage.
// f1 is the main table tag included in f2 toFind := html.Node{Type: html.ElementNode, Data: "table", Attr: []html.Attribute{{Namespace: "", Key: "class", Val: "fixed"}}, } pm, _ := ParseFile(f1) m := FindNode(pm, toFind) // searching <table> in d1 if m == nil { fmt.Printf("%s not found in %s \n", PrintData(&toFind), f1) } pn, _ := ParseFile(f2) n := FindNode(pn, toFind) // searching <table> in d2 if n == nil { fmt.Printf("%s not found in %s \n", PrintData(&toFind), f2) } // Is n included in m if f := IncludedNode(n, m); f != nil { fmt.Printf("nodes structures diverge from : %s\n", PrintData(f)) }
Output:
func IncludedNodeTyped ¶
IncludedNodeTyped is like IncludeNode where only tags of type t are compared
func IsTextNode ¶
IsTextNode checks the presence of a node and its text value in a buffer. An error message is returned if the node is not found or if the text is not the expected one.
func IsTextTag ¶
func IsTextTag(b io.ReadCloser, t, s string) error
IsTextTag checks the presence of a tag and its text value in a buffer. An error message is returned if the tag is not found or if the text is not the expected one.
func PrintData ¶
PrintData returns a string with Node information (not its relationships) nil will panic
func PrintNodes ¶
PrintNodes prints the tree structure of node m until n node is equal. If nil is passed, the complete node is printed. Values are indented based on the recursion depth d which is usually 0 when called html.ErrorNode (iota) displays every tag except the error node.
Example (WSearch) ¶
ExamplePrintNodes_wSearch is the previous example stopping at a searched node.
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) fmt.Fprint(b, HTMLf) o, _ := html.Parse(b) var tagToFind html.Node tagToFind.Type = html.ElementNode tagToFind.Data = "p" tagToFind.Attr = []html.Attribute{{Namespace: "", Key: "class", Val: "ex1"}} parsing.PrintNodes(o, &tagToFind, html.ErrorNode, 0) }
Output: html (Element) . head (Element) body (Element) .. p (Element) [{ class ex1}] tag found: p (Element) [{ class ex1}] ... HTML Fragment to compare against (Text) em (Element) .... others below (Text) to test (Text) sub (Element) .... diffs (Text)
Example (WoSearch) ¶
ExamplePrintNodes_woSearch prints all nodes without using search.
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) fmt.Fprint(b, HTMLf) o, _ := html.Parse(b) parsing.PrintNodes(o, nil, html.ErrorNode, 0) }
Output: html (Element) . head (Element) body (Element) .. p (Element) [{ class ex1}] ... HTML Fragment to compare against (Text) em (Element) .... others below (Text) to test (Text) sub (Element) .... diffs (Text)
func PrintTags ¶
PrintTags prints node structure until a tag name is found (whatever attributes) Without name, all tags are printed tagOnly selects ElementNode, otherwise tags are printed whatever type. If node tree has no Errornode, there is no difference with previous i.e. exploreNode(n, "", html.ErrorNode) prints nothing then both are equivalent.
Example (WSearch) ¶
ExamplePrintTags_wSearch is the previous example stopping at a searched tag
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) fmt.Fprint(b, HTMLf) o, _ := html.Parse(b) // err ignored as failure is detected before parsing.PrintTags(o, "em", true) // }
Output: html (Element) head (Element) body (Element) p (Element) [{ class ex1}] em (Element) [em] found. Stopping exploration sub (Element)
Example (WoSearch) ¶
ExamplePrintTags_woSearch is not using the search part.
package main import ( "bytes" "fmt" parsing "github.com/iwdgo/htmlutils" "golang.org/x/net/html" ) const HTMLf = `<p class="ex1">HTML Fragment to compare against <em>others below</em> to test <sub>diffs</sub></p>` func main() { b := new(bytes.Buffer) fmt.Fprint(b, HTMLf) o, _ := html.Parse(b) parsing.PrintTags(o, "", false) // +1,6% }
Output: (Document) html (Element) head (Element) body (Element) p (Element) [{ class ex1}] HTML Fragment to compare against (Text) em (Element) others below (Text) to test (Text) sub (Element) diffs (Text)
Types ¶
This section is empty.