dom

package module

v0.2.0 Latest Latest Go to latest Published: Dec 26, 2024 License: MIT Imports: 4 Imported by: 7

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/JohannesKaufmann/dom

Links

Open Source Insights

README ¶

dom

Helper functions for "net/html" that make it easier to interact with *html.Node.

🚀 Getting Started - 📚 Documentation - 🧑‍💻 Examples

Installation

go get -u github.com/JohannesKaufmann/dom

[!NOTE] This "dom" libary was developed for the needs of the html-to-markdown library. That beeing said, please submit any functions that you need.

Getting Started

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `
	<ul>
		<li><a href="github.com/JohannesKaufmann/dom">dom</a></li>
		<li><a href="github.com/JohannesKaufmann/html-to-markdown">html-to-markdown</a></li>
	</ul>
	`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	// - - - //

	firstLink := dom.FindFirstNode(doc, func(node *html.Node) bool {
		return dom.NodeName(node) == "a"
	})

	fmt.Println("href:", dom.GetAttributeOr(firstLink, "href", ""))
}

Node vs Element

The naming scheme in this library is:

"Node" means *html.Node{}
- This means any node in the tree of nodes.
"Element" means *html.Node{Type: html.ElementNode}
- This means only nodes with the type of ElementNode. For example <p>, <span>, <a>, ... but not #text, , ...

For most functions, there are two versions. For example:

FirstChildNode() and FirstChildElement()
AllChildNodes() and AllChildElements()
...

Documentation

Attributes & Content

You can get the attributes of a node using GetAttribute, GetAttributeOr or the more specialized GetClasses that returns a slice of strings.

For matching nodes, HasID and HasClass can be used.

If you want to collect the #text of all the child nodes, you can call CollectText.

name := dom.NodeName(node)
// "h2"

href := dom.GetAttributeOr(node, "href", "")
// "github.com"

isHeading := dom.HasClass(node, "repo__name")
// `true`

content := dom.CollectText(node)
// "Lorem ipsum"

Children & Siblings

You can already use node.FirstChild to get the first child node. For the convenience we added FirstChildNode() and FirstChildElement() which returns *html.Node.

To get all direct children, use AllChildNodes and AllChildElements which returns []*html.Node.

PrevSiblingNode and PrevSiblingElement
NextSiblingNode and NextSiblingElement

Find Nodes

Searching for nodes deep in the tree is made easier with:

firstParagraph := dom.FindFirstNode(doc, func(node *html.Node) bool {
    return dom.NodeName(node) == "p"
})
// *html.Node


allParagraphs := dom.FindAllNodes(doc, func(node *html.Node) bool {
    return dom.NodeName(node) == "p"
})
// []*html.Node

🧑‍💻 Example code, find
🧑‍💻 Example code, selectors

Get next/previous neighbors

What is special about this? The order!

If you are somewhere in the DOM, you can call GetNextNeighborNode to get the next node, even if it is further up the tree. The order is the same as you would see the elements in the DOM.

node := startNode
for node != nil {
    fmt.Println(dom.NodeName(node))

    node = dom.GetNextNeighborNode(node)
}

If we start the for loop at the <button> and repeatedly call GetNextNeighborNode this would be the order that the nodes are visited.

#document
├─html
│ ├─head
│ ├─body
│ │ ├─nav
│ │ │ ├─p
│ │ │ │ ├─#text "up"
│ │ ├─main
│ │ │ ├─button   *️⃣
│ │ │ │ ├─span  0️⃣
│ │ │ │ │ ├─#text "start"  1️⃣
│ │ │ ├─div  2️⃣
│ │ │ │ ├─h3  3️⃣
│ │ │ │ │ ├─#text "heading"  4️⃣
│ │ │ │ ├─p  5️⃣
│ │ │ │ │ ├─#text "description"  6️⃣
│ │ ├─footer  7️⃣
│ │ │ ├─p  8️⃣
│ │ │ │ ├─#text "down"  9️⃣

If you only want to visit the ElementNode's (and skip the #text Nodes) you can use GetNextNeighborElement instead.

If you want to skip the children you can use GetNextNeighborNodeExcludingOwnChild. In the example above, when starting at the <button> the next node would be the <div>.

The same functions also exist for the previous nodes, e.g. GetPrevNeighborNode.

🧑‍💻 Example code, next basics
🧑‍💻 Example code, next inside a loop

Remove & Replace Node

if dom.HasClass(node, "lang__old") {
	newNode := &html.Node{
		Type: html.TextNode,
		Data: "🪦",
	}
	dom.ReplaceNode(node, newNode)
}


for _, node := range emptyTextNodes {
	dom.RemoveNode(node)
}

🧑‍💻 Example code, remove and replace

Unwrap Node

#document
├─html
│ ├─head
│ ├─body
│ │ ├─article   *️⃣
│ │ │ ├─h3
│ │ │ │ ├─#text "Heading"
│ │ │ ├─p
│ │ │ │ ├─#text "short description"

If we take the input above and run UnwrapNode(articleNode) we can "unwrap" the <article>. That means removing the <article> while keeping the children (<h3> and <p>).

#document
├─html
│ ├─head
│ ├─body
│ │ ├─h3
│ │ │ ├─#text "Heading"
│ │ ├─p
│ │ │ ├─#text "short description"

For the reverse you can use WrapNode(existingNode, newNode).

RenderRepresentation

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `<a href="/about">Read More</a>`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(dom.RenderRepresentation(doc))
}

The tree representation helps to visualize the tree-structure of the DOM. And the #text nodes stand out.

[!TIP] This function could be useful for debugging & testcases. For example in neighbors_test.go

#document
├─html
│ ├─head
│ ├─body
│ │ ├─a (href=/about)
│ │ │ ├─#text "Read More"

While the normal "net/html" Render() function would have produced this:

<html><head></head><body><a href="/about">Read More</a></body></html>

🧑‍💻 Example code, dom representation

Documentation ¶

Overview ¶

dom makes it easier to interact with the html document.

Node = return all the nodes Element = return all the nodes that are of type Element. This e.g. excludes #text nodes.

Index ¶

func AllChildElements(node *html.Node) (children []*html.Node)
func AllChildNodes(node *html.Node) (children []*html.Node)
func AllNodes(startNode *html.Node) (allNodes []*html.Node)
func CollectText(node *html.Node) string
func ContainsNode(startNode *html.Node, matchFn func(node *html.Node) bool) bool
func FindAllNodes(startNode *html.Node, matchFn func(node *html.Node) bool) (foundNodes []*html.Node)
func FindFirstNode(startNode *html.Node, matchFn func(node *html.Node) bool) *html.Node
func FirstChildElement(node *html.Node) *html.Node
func FirstChildNode(node *html.Node) *html.Node
func GetAttribute(node *html.Node, key string) (string, bool)
func GetAttributeOr(node *html.Node, key string, fallback string) string
func GetClasses(node *html.Node) []string
func GetNextNeighborElement(node *html.Node) *html.Node
func GetNextNeighborElementExcludingOwnChild(node *html.Node) *html.Node
func GetNextNeighborNode(node *html.Node) *html.Node
func GetNextNeighborNodeExcludingOwnChild(node *html.Node) *html.Node
func GetPrevNeighborElement(node *html.Node) *html.Node
func GetPrevNeighborElementExcludingOwnChild(node *html.Node) *html.Node
func GetPrevNeighborNode(node *html.Node) *html.Node
func GetPrevNeighborNodeExcludingOwnChild(node *html.Node) *html.Node
func HasClass(node *html.Node, expectedClass string) bool
func HasID(node *html.Node, expectedID string) bool
func NameIsBlockNode(name string) bool
func NameIsHeading(name string) bool
func NameIsInlineNode(name string) bool
func NextSiblingElement(node *html.Node) *html.Node
func NextSiblingNode(node *html.Node) *html.Node
func NodeName(node *html.Node) string
func PrevSiblingElement(node *html.Node) *html.Node
func PrevSiblingNode(node *html.Node) *html.Node
func RemoveNode(node *html.Node)
func RenderRepresentation(startNode *html.Node) string
func ReplaceNode(node, newNode *html.Node)
func UNSTABLE_initGetNeighbor(firstChildFunc func(node *html.Node) *html.Node, ...) func(*html.Node) *html.Node
func UnwrapNode(node *html.Node)
func WrapNode(existingNode, newNode *html.Node) *html.Node

Examples ¶

FindFirstNode

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func AllChildElements ¶

func AllChildElements(node *html.Node) (children []*html.Node)

AllChildElements is similar to AllChildNodes but only returns nodes of type `ElementNode`.

func AllChildNodes ¶

func AllChildNodes(node *html.Node) (children []*html.Node)

func AllNodes ¶ added in v0.2.0

func AllNodes(startNode *html.Node) (allNodes []*html.Node)

AllNodes recursively gets all the nodes in the tree.

func CollectText ¶

func CollectText(node *html.Node) string

func ContainsNode ¶

func ContainsNode(startNode *html.Node, matchFn func(node *html.Node) bool) bool

func FindAllNodes ¶

func FindAllNodes(startNode *html.Node, matchFn func(node *html.Node) bool) (foundNodes []*html.Node)

func FindFirstNode ¶

func FindFirstNode(startNode *html.Node, matchFn func(node *html.Node) bool) *html.Node

Example ¶

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `
	<ul>
		<li><a href="github.com/JohannesKaufmann/dom">dom</a></li>
		<li><a href="github.com/JohannesKaufmann/html-to-markdown">html-to-markdown</a></li>
	</ul>
	`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	// - - - //

	firstLink := dom.FindFirstNode(doc, func(node *html.Node) bool {
		return dom.NodeName(node) == "a"
	})

	fmt.Println(dom.GetAttributeOr(firstLink, "href", ""))
}

Output:

github.com/JohannesKaufmann/dom

func FirstChildElement ¶

func FirstChildElement(node *html.Node) *html.Node

func FirstChildNode ¶

func FirstChildNode(node *html.Node) *html.Node

func GetAttribute ¶

func GetAttribute(node *html.Node, key string) (string, bool)

func GetAttributeOr ¶

func GetAttributeOr(node *html.Node, key string, fallback string) string

func GetClasses ¶

func GetClasses(node *html.Node) []string

func GetNextNeighborElement ¶

func GetNextNeighborElement(node *html.Node) *html.Node

func GetNextNeighborElementExcludingOwnChild ¶

func GetNextNeighborElementExcludingOwnChild(node *html.Node) *html.Node

func GetNextNeighborNode ¶

func GetNextNeighborNode(node *html.Node) *html.Node

func GetNextNeighborNodeExcludingOwnChild ¶

func GetNextNeighborNodeExcludingOwnChild(node *html.Node) *html.Node

func GetPrevNeighborElement ¶

func GetPrevNeighborElement(node *html.Node) *html.Node

func GetPrevNeighborElementExcludingOwnChild ¶

func GetPrevNeighborElementExcludingOwnChild(node *html.Node) *html.Node

func GetPrevNeighborNode ¶

func GetPrevNeighborNode(node *html.Node) *html.Node

func GetPrevNeighborNodeExcludingOwnChild ¶

func GetPrevNeighborNodeExcludingOwnChild(node *html.Node) *html.Node

func HasClass ¶

func HasClass(node *html.Node, expectedClass string) bool

func HasID ¶

func HasID(node *html.Node, expectedID string) bool

func NameIsBlockNode ¶

func NameIsBlockNode(name string) bool

func NameIsHeading ¶

func NameIsHeading(name string) bool

func NameIsInlineNode ¶

func NameIsInlineNode(name string) bool

func NextSiblingElement ¶

func NextSiblingElement(node *html.Node) *html.Node

NextSiblingElement returns the element immediately following the passed-in node or nil. In contrast to `node.NextSibling` this only returns the next `ElementNode`.

func NextSiblingNode ¶

func NextSiblingNode(node *html.Node) *html.Node

func NodeName ¶

func NodeName(node *html.Node) string

In order to stay consistent with v1 of the library, this follows the naming scheme of goquery. E.g. "#text", "div", ...

func PrevSiblingElement ¶

func PrevSiblingElement(node *html.Node) *html.Node

func PrevSiblingNode ¶

func PrevSiblingNode(node *html.Node) *html.Node

func RemoveNode ¶

func RemoveNode(node *html.Node)

func RenderRepresentation ¶

func RenderRepresentation(startNode *html.Node) string

RenderRepresentation is useful for debugging. It renders out the *structure* of the dom.

func ReplaceNode ¶

func ReplaceNode(node, newNode *html.Node)

func UNSTABLE_initGetNeighbor ¶

func UNSTABLE_initGetNeighbor(
	firstChildFunc func(node *html.Node) *html.Node,
	prevNextFunc func(node *html.Node) *html.Node,
	goUpUntilFunc func(node *html.Node) bool,
) func(*html.Node) *html.Node

Warning: It is not meant to be called directly and may change signature from release to release!

func UnwrapNode ¶

func UnwrapNode(node *html.Node)

func WrapNode ¶ added in v0.2.0

func WrapNode(existingNode, newNode *html.Node) *html.Node

WrapNode wraps the newNode around the existingNode.

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
dom_representation
find
next_basics
next_loop
remove_replace
selectors

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL