flattenhtml

package module

v0.3.4 Latest Latest Go to latest Published: Jan 7, 2025 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

flattenhtml

flattenthtml is a Go package that helps you access to specific nodes in a HTML document directly without a need for traversing all nodes.

Installation

go get github.com/seinshah/flattenhtml

Overview

Use built-in or custom flatteners to access HTML document nodes directly using your desired selectors. Whether you want to access all div nodes (based on the tag name) or all elements with class attributes, or all elements with class value as container, and so on.

flattenhtml currently supports the following flatteners out of the box:

TagFlattener: flattens all nodes based on their tag name.

You can build a custom in-house flattener by implementing *flattenhtml.Flattener interface. If your implementation is generic and can be used by others, please consider contributing it to this package.

Usage

package main

import (
    "fmt"
    "log"
    "strings"

    "github.com/seinshah/flattenhtml"
)

func main() {
    // HTML document to be flattened.
    html := `
        <html>
            <head>
                <title>flattenhtml</title>
            </head>
            <body>
                <div class="container" id="target">
                    <div class="row">
                        <div class="col-md-6">
                            <h1>flattenhtml</h1>
                            <p>flattens HTML documents</p>
                        </div>
                        <div class="col-md-6">
                            <h1>flattenhtml</h1>
                            <p>flattens HTML documents</p>
                        </div>
                    </div>
                </div>
            </body>
        </html>
    `

    nm, err := flattenhtml.NewNodeManagerFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatal(err)
    }

    mc, err := nm.Parse(flattenhtml.NewTagFlattener())
    if err != nil {
        log.Fatal(err)
    }

    tf, err := mc.SelectFlattener(&flattenhtml.TagFlattener{})
    if err != nil {
        log.Fatal(err)
    }

    divs := tf.SelectNodes("div")

    divs.
        Filter(flattenhtml.WithAttributeValueAs("class", "container")).
        Each(func(n *flattenhtml.Node) {
            val, _ := n.Attribute("id")

            fmt.Println(val)

            // Output:
            // target
        })
}

Documentation ¶

Overview ¶

Package flattenhtml provides a way to flatten the HTML tree structure and then use the flattened data to do different kinds of lookups.

Go provides html package that bear the heavy load of parsing HTML. However, this package results in a tree structure. Although it is generic and can be utilized for any traversal purposes, it is not very convenient for some use cases, such as, searching for a specific element.

Here is where flattenhtml comes in. It provides different mechanism to flatten the HTML tree structure based on the use case. For example, if you want to work with the nodes based on their tag name, you can use TagFlattener flattener to first flatten all the nodes based on their tag name and then do continues tag lookup without the need for constantly traversing the tree.

TagFlattener is currently the only built-in flattener of this package. However, all flatteners implement flattenhtml.Flattener interface and you can easily implement your own flattener.

When you use the following statement to initialize the NodeManager, parsed HTML tree will be traversed once and for any further lookups, the flattener data is accessible without the need for traversing the tree again. Also, there is the possibility of using multiple flatteners at the same time. For example, you can use TagFlattener to flatten the nodes based on their tag name and then use AttributeFlattener to flatten the nodes based on their attributes. The same as before, the HTML tree will be traversed only once to utilize all flatteners.

html := "<html><head></head><body><div><p></p></div></body></html>"
flatteners := []flattenhtml.Flattener{flattenhtml.TagFlattener, ...}
nm := flattenhtml.NewNodeManagerFromReader(strings.NewReader(html))
mc := nm.Parse(flatteners...)

Once the flattening process is done, you will have a *flattenhtml.MultiCursor Which holds a pointer to all the flatteners. Now, before proceeding, you need to select a single flattener of your choice, to continue the lookup process.

tagFlattenerCursor := mc.First()

Now, you can get nodes of the same tag name using the following statement:

nodes := tagFlattenerCursor.SelectNodes("div")

This will return a *flattenhtml.NodeIterator that can be used to iterate over the nodes that are selected by the given key. In this case, all the nodes that have "div" tag name.

Note that the underlying engine for parsing the HTML is golang.org/x/net/html package and all the fact about standardizing the HTML tree applies to this package.

Index ¶

Variables
type Cursor
- func (c *Cursor) Len() int
- func (c *Cursor) RegisterNewNode(node *Node) error
- func (c *Cursor) SelectNodes(key string) *NodeIterator
type FilterOption
- func WithAttribute(key string) FilterOption
- func WithAttributeValueAs(key, value string) FilterOption
- func WithTag(tag string) FilterOption
type Flattener
type MultiCursor
- func NewMultiCursor(flatteners ...Flattener) *MultiCursor
- func (m *MultiCursor) First() *Cursor
- func (m *MultiCursor) RegisterNewNode(node *Node) error
- func (m *MultiCursor) SelectCursor(flattener Flattener) (*Cursor, error)
type Node
- func NewNode(htmlNode *html.Node) *Node
- func (n *Node) AppendChild(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) AppendSibling(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) Attribute(key string) (string, bool)
- func (n *Node) Attributes() map[string]string
- func (n *Node) HTMLNode() *html.Node
- func (n *Node) IsRemoved() bool
- func (n *Node) PrependChild(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) PrependSibling(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) Remove() error
- func (n *Node) RemoveAttribute(key string)
- func (n *Node) SetAttribute(key, value string)
- func (n *Node) TagName() string
type NodeIterator
- func NewNodeIterator() *NodeIterator
- func (n *NodeIterator) Add(node *Node) *NodeIterator
- func (n *NodeIterator) Each(f func(node *Node))
- func (n *NodeIterator) Filter(option FilterOption) *NodeIterator
- func (n *NodeIterator) FilterAnd(options ...FilterOption) *NodeIterator
- func (n *NodeIterator) FilterOr(options ...FilterOption) *NodeIterator
- func (n *NodeIterator) First() *Node
- func (n *NodeIterator) Len() int
- func (n *NodeIterator) Next() *Node
- func (n *NodeIterator) Reset()
type NodeManager
- func NewNodeManager(root *html.Node) *NodeManager
- func NewNodeManagerFromReader(r io.Reader) (*NodeManager, error)
- func NewNodeManagerFromURL(ctx context.Context, url string) (*NodeManager, error)
- func (n *NodeManager) Parse(flatteners ...Flattener) (*MultiCursor, error)
- func (n *NodeManager) Render(w io.Writer) error
type NodeType
type TagFlattener
- func NewTagFlattener() *TagFlattener
- func (t *TagFlattener) Flatten(node *html.Node) error
- func (t *TagFlattener) GetNodesByKey(key string) *NodeIterator
- func (t *TagFlattener) IsMyType(flattener Flattener) bool
- func (t *TagFlattener) Len() int

Examples ¶

TagFlattener

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrNoFlattener = errors.New("at least one flattener should be provided")

ErrNoFlattener is returned when no flattener is provided to the Parse method, or no flattener is found in the MultiCursor.

View Source

var ErrParentlessNode = errors.New("node with no parent cannot be removed")

Functions ¶

This section is empty.

Types ¶

type Cursor ¶

type Cursor struct {
	// contains filtered or unexported fields
}

Cursor is a helper struct that holds the selected flattener from the MultiCursor. It allows the caller to perform different operations on the flattened document using the selected flattener by *MultiCursor.SelectFlattener method.

func (*Cursor) Len ¶

func (c *Cursor) Len() int

Len returns the final number of categories or keys that were created by the flattener.

func (*Cursor) RegisterNewNode ¶ added in v0.3.1

func (c *Cursor) RegisterNewNode(node *Node) error

RegisterNewNode is used to add a newly and manually added nodes by the user to the cycle. It calls flatten method of the cursor's flatteners by giving the Node's underlying html.Node. New node can only be accessed by the NodeIterator and Cursor, if it is added to the cycle using this method.

func (*Cursor) SelectNodes ¶

func (c *Cursor) SelectNodes(key string) *NodeIterator

SelectNodes returns a new NodeIterator that can iterates over the nodes that are selected by the given key and perform different operations. If the given key is not found in the flattened document, nodeIterator will have a zero length.

type FilterOption func(node *Node) bool

FilterOption is a function that accepts a *Node and returns a boolean. The boolean value is true if the given *Node should be included in the NodeIterator and false otherwise.

func WithAttribute ¶

func WithAttribute(key string) FilterOption

WithAttribute returns a FilterOption that filters nodes by the given key. The Node will be included in the final output if it has an attribute with the given key.

func WithAttributeValueAs ¶

func WithAttributeValueAs(key, value string) FilterOption

WithAttributeValueAs returns a FilterOption that filters nodes by the given key and value. The Node will be included in the final output if it has an attribute with the given key and the value of that attribute is equal to the given value.

func WithTag ¶

func WithTag(tag string) FilterOption

WithTag is a function that filters Node based on their tag name. If the node's tag name is the same is the given tag, it will be included in the final output.

type Flattener ¶

type Flattener interface {
	// Flatten is a callback function called for each node
	// in the HTML tree. It accepts a *html.Node as the argument and returns
	// an error if any. If the error is not nil, the iteration stops and the
	// error is returned.
	Flatten(node *html.Node) error

	// GetNodesByKey returns a NodeIterator that can iterate over the nodes
	// that are flattened using the flattener and filtered by the given key.
	// If the given key is not found in the flattened document, it returns
	// nil.
	GetNodesByKey(key string) *NodeIterator

	// IsMyType allows each flattener implementation to decide whether the given
	// Flattener is of the same type as itself or not.
	IsMyType(flattener Flattener) bool

	// Len the final number of categories or keys that were created by the flattener.
	Len() int
}

Flattener is an interface for the logic that decides how the HTML tree should be traversed and flattened.

type MultiCursor ¶

type MultiCursor struct {
	// contains filtered or unexported fields
}

MultiCursor is a helper struct that holds all the configured flatteners. It will usually be initiated by the NodeManager using the configured flatteners which can be later filtered to a single flattener using *MultiCursor.SelectFlattener method.

func NewMultiCursor ¶

func NewMultiCursor(flatteners ...Flattener) *MultiCursor

NewMultiCursor returns a new MultiCursor initiated by the NodeManager. This holds all the configured flatteners that are used separately to flatten the HTML tree. To perform the variety of operations on the flattened documents, first you need to select your desired flattener cursor using methods defined on MultiCursor.

func (*MultiCursor) First ¶ added in v0.2.0

func (m *MultiCursor) First() *Cursor

First returns the first Cursor from the MultiCursor initiated by the NodeManager. This Cursor will hold the reference to the first flattener you configured for the NodeManager.Parse method. If MultiCursor has no cursor, the result will be nil.

func (*MultiCursor) RegisterNewNode ¶ added in v0.3.0

func (m *MultiCursor) RegisterNewNode(node *Node) error

RegisterNewNode is used to add a newly and manually added nodes by the user to the cycle. It calls flatten method of all it's flatteners by giving the Node's underlying html.Node. New node can only be accessed by the NodeIterator and Cursor, if it is added to the cycle using this method.

func (*MultiCursor) SelectCursor ¶ added in v0.2.0

func (m *MultiCursor) SelectCursor(flattener Flattener) (*Cursor, error)

SelectCursor returns a new Cursor with the selected flattener from the MultiCursor initiated by the NodeManager. If the given flattener is not found in the MultiCursor, it returns ErrNoFlattener.

type Node ¶

type Node struct {
	// contains filtered or unexported fields
}

Node is a simple wrapper around *html.Node. It allows read/write operations on the *html.Node along with keeping the structure of the HTML tree.

func NewNode ¶

func NewNode(htmlNode *html.Node) *Node

NewNode creates a new Node with the given *html.Node.

func (*Node) AppendChild ¶ added in v0.3.0

func (n *Node) AppendChild(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

AppendChild appends a new child to the Node. The new child will be added to the end of the children list of the Node. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) AppendSibling ¶ added in v0.3.0

func (n *Node) AppendSibling(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

AppendSibling appends a new sibling to the Node. The new node will be the next node after this node in the parent's children list. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) Attribute ¶

func (n *Node) Attribute(key string) (string, bool)

Attribute returns the value of the given attribute key. The second return value is a boolean that indicates whether the given key is found.

func (*Node) Attributes ¶

func (n *Node) Attributes() map[string]string

Attributes returns a map of strings containing attributes key and values of the Node.

func (*Node) HTMLNode ¶ added in v0.2.0

func (n *Node) HTMLNode() *html.Node

HTMLNode returns the underlying *html.Node of the Node. Any write operation on the *html.Node might corrupt the structure of the HTML tree.

func (*Node) IsRemoved ¶

func (n *Node) IsRemoved() bool

IsRemoved returns true if the Node is removed from the NodeIterator and html.Node tree.

func (*Node) PrependChild ¶ added in v0.3.0

func (n *Node) PrependChild(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

PrependChild prepends a new child to the Node. The new child will be added to the beginning of the children list of the Node. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) PrependSibling ¶ added in v0.3.0

func (n *Node) PrependSibling(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

PrependSibling prepends a new sibling to the Node. The new node will be the previous node before this node in the parent's children list. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) Remove ¶

func (n *Node) Remove() error

Remove removes the Node from the NodeIterator and html.Node tree. It won't be available if you use the NodeManager.Render.

func (*Node) RemoveAttribute ¶

func (n *Node) RemoveAttribute(key string)

RemoveAttribute removes the given attribute key from the node. If the given key does not exist, it will be ignored.

func (*Node) SetAttribute ¶

func (n *Node) SetAttribute(key, value string)

SetAttribute sets the value of the given attribute key for the node. If the given key does not exist, it will be added to the node as a new attribute. Otherwise, the value of the given key will be updated.

func (*Node) TagName ¶

func (n *Node) TagName() string

TagName returns the tag name of the Node.

type NodeIterator ¶

type NodeIterator struct {
	// contains filtered or unexported fields
}

NodeIterator is a simple iterator that can iterate over a slice of *Node. It is used to iterate over the nodes that are flattened by a Flattener and perform different operations using the methods that are defined on the NodeIterator.

func NewNodeIterator ¶

func NewNodeIterator() *NodeIterator

NewNodeIterator creates a new NodeIterator.

func (*NodeIterator) Add ¶

func (n *NodeIterator) Add(node *Node) *NodeIterator

Add adds the given *Node to the NodeIterator. This does not change the html.Node tree. It is expected that NodeIterator and Node are managed by the flattener.

func (*NodeIterator) Each ¶

func (n *NodeIterator) Each(f func(node *Node))

Each iterates over the nodes in the NodeIterator and calls the given function.

func (*NodeIterator) Filter ¶

func (n *NodeIterator) Filter(option FilterOption) *NodeIterator

Filter filters the nodes in the NodeIterator using the given FilterOption. It returns a new NodeIterator that can iterate over the filtered nodes. For more complex filtering, you can use FilterOr or FilterAnd methods.

func (*NodeIterator) FilterAnd ¶

func (n *NodeIterator) FilterAnd(options ...FilterOption) *NodeIterator

FilterAnd filters the nodes in the NodeIterator using the given FilterOptions. All the given options will be combined using AND operator. It means that if all the given options return true for a node, it will be included in the filtered NodeIterator. If any of the given options returns false for a node, the node will be filtered out and the rest of the options will be ignored for that node.

func (*NodeIterator) FilterOr ¶

func (n *NodeIterator) FilterOr(options ...FilterOption) *NodeIterator

FilterOr filters the nodes in the NodeIterator using the given FilterOptions. All the given options will be combined using OR operator. It means that if any of the given options returns true for a node, it will be included in the filtered NodeIterator and the rest of options will be ignored for that node.

func (*NodeIterator) First ¶ added in v0.2.0

func (n *NodeIterator) First() *Node

First returns the first non-removed node in the NodeIterator. If there is no non-removed node, it returns nil.

func (*NodeIterator) Len ¶

func (n *NodeIterator) Len() int

Len returns the number of nodes in the NodeIterator.

func (*NodeIterator) Next ¶ added in v0.2.0

func (n *NodeIterator) Next() *Node

Next iterates over the nodes in the NodeIterator and returns the next non-removed node. It starts from the first element of the NodeIterator and proceed to the next item on each call to Next. If there is no non-removed node, it returns nil. Once received nil, must be considered as the end of the iteration. Use Reset to start the iteration from the beginning.

func (*NodeIterator) Reset ¶ added in v0.2.0

func (n *NodeIterator) Reset()

Reset resets the cursor index to the beginning of the NodeIterator.

type NodeManager ¶

type NodeManager struct {
	// contains filtered or unexported fields
}

NodeManager is an interface for the top-level logic of this package. This package is responsible to parse HTML nodes in some way, perform some modifications or read-only operations on them, and then render the HTML tree. There are different approaches to initiate a NodeManager:

NewNodeManagerFromReader: It accepts an io.Reader and parses the HTML tree from it.
NewNodeManagerFromURL: It accepts a URL and parses the HTML tree from the response body of the URL.
NewNodeManager: It accepts a *html.Node and uses it as the root of the HTML tree.

Using approaches 2 and 3 follow the html.Parse method to parse the HTML tree.

func NewNodeManager ¶

func NewNodeManager(root *html.Node) *NodeManager

NewNodeManager creates a new DefaultNodeManager with the given *html.Node as the root of the HTML tree.

func NewNodeManagerFromReader ¶

func NewNodeManagerFromReader(r io.Reader) (*NodeManager, error)

NewNodeManagerFromReader creates a new DefaultNodeManager with the HTML tree parsed from the given io.Reader.

func NewNodeManagerFromURL ¶

func NewNodeManagerFromURL(ctx context.Context, url string) (*NodeManager, error)

NewNodeManagerFromURL creates a new DefaultNodeManager with the HTML tree parsed from the response body of the given URL.

func (*NodeManager) Parse ¶

func (n *NodeManager) Parse(flatteners ...Flattener) (*MultiCursor, error)

Parse parses the HTML tree tha has been converted to *html.Node before. It accepts a set of Flattener that decides how the HTML tree should be traversed and flattened. If any of the flatteners returns an error, the iteration stops and the error is returned.

func (*NodeManager) Render ¶

func (n *NodeManager) Render(w io.Writer) error

Render renders the HTML tree to the given writer.

type NodeType ¶ added in v0.3.0

type NodeType html.NodeType

const (
	NodeTypeElement NodeType = NodeType(html.ElementNode)
	NodeTypeText    NodeType = NodeType(html.TextNode)
)

type TagFlattener ¶

type TagFlattener struct {
	// contains filtered or unexported fields
}

TagFlattener is a Flattener that flattens the HTML tree by the tag name. When the NodeManager is initialized with this flattener, it will categorize NodeIterator by the tag name. Therefore, you can access all nodes with the same tag name (i.e., meta, a, p, etc.) using the GetNodesByKey method or Cursor.SelectNodes method.

Example ¶

package main

import (
	"bytes"
	"fmt"
	"strings"

	"github.com/seinshah/flattenhtml"
)

func main() {
	rawHTML := `<html><body><div><p class="p1">hello</p><p class="p2">world</p></div></body></html>`

	manager, err := flattenhtml.NewNodeManagerFromReader(strings.NewReader(rawHTML))
	if err != nil {
		panic(err)
	}

	mc, err := manager.Parse(flattenhtml.NewTagFlattener())
	if err != nil {
		panic(err)
	}

	cursor := mc.First()
	targetP := cursor.SelectNodes("p").Filter(flattenhtml.WithAttributeValueAs("class", "p2"))

	targetP.Each(func(node *flattenhtml.Node) {
		err = node.Remove()
		if err != nil {
			panic(err)
		}
	})

	output := bytes.Buffer{}

	err = manager.Render(&output)
	if err != nil {
		panic(err)
	}

	fmt.Println(output.String())

}

Output:

<html><head></head><body><div><p class="p1">hello</p></div></body></html>

func NewTagFlattener ¶

func NewTagFlattener() *TagFlattener

NewTagFlattener creates a new TagFlattener.

func (*TagFlattener) Flatten ¶

func (t *TagFlattener) Flatten(node *html.Node) error

Flatten is a callback function called for each node during the NodeManager.Parse. It will continue to categorize all nodes in their tag NodeIterator as NodeManager traverses the HTML tree. This method does not return an error.

func (*TagFlattener) GetNodesByKey ¶

func (t *TagFlattener) GetNodesByKey(key string) *NodeIterator

func (*TagFlattener) IsMyType ¶

func (t *TagFlattener) IsMyType(flattener Flattener) bool

func (*TagFlattener) Len ¶

func (t *TagFlattener) Len() int

Len for tagflattener gives you the concrete number of tags in the HTML tree.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL