dom

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 26, 2024 License: MIT Imports: 4 Imported by: 7

README

dom

Go Reference

Helper functions for "net/html" that make it easier to interact with *html.Node.

🚀 Getting Started - 📚 Documentation - 🧑‍💻 Examples

Installation

go get -u github.com/JohannesKaufmann/dom

[!NOTE] This "dom" libary was developed for the needs of the html-to-markdown library. That beeing said, please submit any functions that you need.

Getting Started

package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `
	<ul>
		<li><a href="github.com/JohannesKaufmann/dom">dom</a></li>
		<li><a href="github.com/JohannesKaufmann/html-to-markdown">html-to-markdown</a></li>
	</ul>
	`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	// - - - //

	firstLink := dom.FindFirstNode(doc, func(node *html.Node) bool {
		return dom.NodeName(node) == "a"
	})

	fmt.Println("href:", dom.GetAttributeOr(firstLink, "href", ""))
}

Node vs Element

The naming scheme in this library is:

  • "Node" means *html.Node{}
    • This means any node in the tree of nodes.
  • "Element" means *html.Node{Type: html.ElementNode}
    • This means only nodes with the type of ElementNode. For example <p>, <span>, <a>, ... but not #text, <!--comment-->, ...

For most functions, there are two versions. For example:

  • FirstChildNode() and FirstChildElement()
  • AllChildNodes() and AllChildElements()
  • ...

Documentation

Go Reference

Attributes & Content

You can get the attributes of a node using GetAttribute, GetAttributeOr or the more specialized GetClasses that returns a slice of strings.

For matching nodes, HasID and HasClass can be used.

If you want to collect the #text of all the child nodes, you can call CollectText.

name := dom.NodeName(node)
// "h2"

href := dom.GetAttributeOr(node, "href", "")
// "github.com"

isHeading := dom.HasClass(node, "repo__name")
// `true`

content := dom.CollectText(node)
// "Lorem ipsum"

Children & Siblings

You can already use node.FirstChild to get the first child node. For the convenience we added FirstChildNode() and FirstChildElement() which returns *html.Node.

To get all direct children, use AllChildNodes and AllChildElements which returns []*html.Node.

  • PrevSiblingNode and PrevSiblingElement

  • NextSiblingNode and NextSiblingElement

Find Nodes

Searching for nodes deep in the tree is made easier with:

firstParagraph := dom.FindFirstNode(doc, func(node *html.Node) bool {
    return dom.NodeName(node) == "p"
})
// *html.Node


allParagraphs := dom.FindAllNodes(doc, func(node *html.Node) bool {
    return dom.NodeName(node) == "p"
})
// []*html.Node

Get next/previous neighbors

What is special about this? The order!

If you are somewhere in the DOM, you can call GetNextNeighborNode to get the next node, even if it is further up the tree. The order is the same as you would see the elements in the DOM.

node := startNode
for node != nil {
    fmt.Println(dom.NodeName(node))

    node = dom.GetNextNeighborNode(node)
}

If we start the for loop at the <button> and repeatedly call GetNextNeighborNode this would be the order that the nodes are visited.

#document
├─html
│ ├─head
│ ├─body
│ │ ├─nav
│ │ │ ├─p
│ │ │ │ ├─#text "up"
│ │ ├─main
│ │ │ ├─button   *️⃣
│ │ │ │ ├─span  0️⃣
│ │ │ │ │ ├─#text "start"  1️⃣
│ │ │ ├─div  2️⃣
│ │ │ │ ├─h3  3️⃣
│ │ │ │ │ ├─#text "heading"  4️⃣
│ │ │ │ ├─p  5️⃣
│ │ │ │ │ ├─#text "description"  6️⃣
│ │ ├─footer  7️⃣
│ │ │ ├─p  8️⃣
│ │ │ │ ├─#text "down"  9️⃣

If you only want to visit the ElementNode's (and skip the #text Nodes) you can use GetNextNeighborElement instead.

If you want to skip the children you can use GetNextNeighborNodeExcludingOwnChild. In the example above, when starting at the <button> the next node would be the <div>.

The same functions also exist for the previous nodes, e.g. GetPrevNeighborNode.


Remove & Replace Node
if dom.HasClass(node, "lang__old") {
	newNode := &html.Node{
		Type: html.TextNode,
		Data: "🪦",
	}
	dom.ReplaceNode(node, newNode)
}


for _, node := range emptyTextNodes {
	dom.RemoveNode(node)
}
Unwrap Node
#document
├─html
│ ├─head
│ ├─body
│ │ ├─article   *️⃣
│ │ │ ├─h3
│ │ │ │ ├─#text "Heading"
│ │ │ ├─p
│ │ │ │ ├─#text "short description"

If we take the input above and run UnwrapNode(articleNode) we can "unwrap" the <article>. That means removing the <article> while keeping the children (<h3> and <p>).

#document
├─html
│ ├─head
│ ├─body
│ │ ├─h3
│ │ │ ├─#text "Heading"
│ │ ├─p
│ │ │ ├─#text "short description"

For the reverse you can use WrapNode(existingNode, newNode).


RenderRepresentation
import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `<a href="/about">Read More</a>`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(dom.RenderRepresentation(doc))
}

The tree representation helps to visualize the tree-structure of the DOM. And the #text nodes stand out.

[!TIP] This function could be useful for debugging & testcases. For example in neighbors_test.go

#document
├─html
│ ├─head
│ ├─body
│ │ ├─a (href=/about)
│ │ │ ├─#text "Read More"

While the normal "net/html" Render() function would have produced this:

<html><head></head><body><a href="/about">Read More</a></body></html>

Documentation

Overview

dom makes it easier to interact with the html document.

Node = return all the nodes Element = return all the nodes that are of type Element. This e.g. excludes #text nodes.

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

func AllChildElements

func AllChildElements(node *html.Node) (children []*html.Node)

AllChildElements is similar to AllChildNodes but only returns nodes of type `ElementNode`.

func AllChildNodes

func AllChildNodes(node *html.Node) (children []*html.Node)

func AllNodes added in v0.2.0

func AllNodes(startNode *html.Node) (allNodes []*html.Node)

AllNodes recursively gets all the nodes in the tree.

func CollectText

func CollectText(node *html.Node) string

func ContainsNode

func ContainsNode(startNode *html.Node, matchFn func(node *html.Node) bool) bool

func FindAllNodes

func FindAllNodes(startNode *html.Node, matchFn func(node *html.Node) bool) (foundNodes []*html.Node)

func FindFirstNode

func FindFirstNode(startNode *html.Node, matchFn func(node *html.Node) bool) *html.Node
Example
package main

import (
	"fmt"
	"log"
	"strings"

	"github.com/JohannesKaufmann/dom"
	"golang.org/x/net/html"
)

func main() {
	input := `
	<ul>
		<li><a href="github.com/JohannesKaufmann/dom">dom</a></li>
		<li><a href="github.com/JohannesKaufmann/html-to-markdown">html-to-markdown</a></li>
	</ul>
	`

	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		log.Fatal(err)
	}

	// - - - //

	firstLink := dom.FindFirstNode(doc, func(node *html.Node) bool {
		return dom.NodeName(node) == "a"
	})

	fmt.Println(dom.GetAttributeOr(firstLink, "href", ""))
}
Output:

github.com/JohannesKaufmann/dom

func FirstChildElement

func FirstChildElement(node *html.Node) *html.Node

func FirstChildNode

func FirstChildNode(node *html.Node) *html.Node

func GetAttribute

func GetAttribute(node *html.Node, key string) (string, bool)

func GetAttributeOr

func GetAttributeOr(node *html.Node, key string, fallback string) string

func GetClasses

func GetClasses(node *html.Node) []string

func GetNextNeighborElement

func GetNextNeighborElement(node *html.Node) *html.Node

func GetNextNeighborElementExcludingOwnChild

func GetNextNeighborElementExcludingOwnChild(node *html.Node) *html.Node

func GetNextNeighborNode

func GetNextNeighborNode(node *html.Node) *html.Node

func GetNextNeighborNodeExcludingOwnChild

func GetNextNeighborNodeExcludingOwnChild(node *html.Node) *html.Node

func GetPrevNeighborElement

func GetPrevNeighborElement(node *html.Node) *html.Node

func GetPrevNeighborElementExcludingOwnChild

func GetPrevNeighborElementExcludingOwnChild(node *html.Node) *html.Node

func GetPrevNeighborNode

func GetPrevNeighborNode(node *html.Node) *html.Node

func GetPrevNeighborNodeExcludingOwnChild

func GetPrevNeighborNodeExcludingOwnChild(node *html.Node) *html.Node

func HasClass

func HasClass(node *html.Node, expectedClass string) bool

func HasID

func HasID(node *html.Node, expectedID string) bool

func NameIsBlockNode

func NameIsBlockNode(name string) bool

func NameIsHeading

func NameIsHeading(name string) bool

func NameIsInlineNode

func NameIsInlineNode(name string) bool

func NextSiblingElement

func NextSiblingElement(node *html.Node) *html.Node

NextSiblingElement returns the element immediately following the passed-in node or nil. In contrast to `node.NextSibling` this only returns the next `ElementNode`.

func NextSiblingNode

func NextSiblingNode(node *html.Node) *html.Node

func NodeName

func NodeName(node *html.Node) string

In order to stay consistent with v1 of the library, this follows the naming scheme of goquery. E.g. "#text", "div", ...

func PrevSiblingElement

func PrevSiblingElement(node *html.Node) *html.Node

func PrevSiblingNode

func PrevSiblingNode(node *html.Node) *html.Node

func RemoveNode

func RemoveNode(node *html.Node)

func RenderRepresentation

func RenderRepresentation(startNode *html.Node) string

RenderRepresentation is useful for debugging. It renders out the *structure* of the dom.

func ReplaceNode

func ReplaceNode(node, newNode *html.Node)

func UNSTABLE_initGetNeighbor

func UNSTABLE_initGetNeighbor(
	firstChildFunc func(node *html.Node) *html.Node,
	prevNextFunc func(node *html.Node) *html.Node,
	goUpUntilFunc func(node *html.Node) bool,
) func(*html.Node) *html.Node

Warning: It is not meant to be called directly and may change signature from release to release!

func UnwrapNode

func UnwrapNode(node *html.Node)

func WrapNode added in v0.2.0

func WrapNode(existingNode, newNode *html.Node) *html.Node

WrapNode wraps the newNode around the existingNode.

Types

This section is empty.

Directories

Path Synopsis
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL