flattenhtml

package module
v0.3.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2024 License: MIT Imports: 6 Imported by: 0

README

flattenhtml

flattenthtml is a Go package that helps you access to specific nodes in a HTML document directly without a need for traversing all nodes.

gerrors CI Flow Maintainability Test Coverage GitHub release (latest SemVer)

Installation

go get github.com/seinshah/flattenhtml

Overview

Use built-in or custom flatteners to access HTML document nodes directly using your desired selectors. Whether you want to access all div nodes (based on the tag name) or all elements with class attributes, or all elements with class value as container, and so on.

flattenhtml currently supports the following flatteners out of the box:

  • TagFlattener: flattens all nodes based on their tag name.

You can build a custom in-house flattener by implementing *flattenhtml.Flattener interface. If your implementation is generic and can be used by others, please consider contributing it to this package.

Usage

package main

import (
    "fmt"
    "log"
    "strings"

    "github.com/seinshah/flattenhtml"
)

func main() {
    // HTML document to be flattened.
    html := `
        <html>
            <head>
                <title>flattenhtml</title>
            </head>
            <body>
                <div class="container" id="target">
                    <div class="row">
                        <div class="col-md-6">
                            <h1>flattenhtml</h1>
                            <p>flattens HTML documents</p>
                        </div>
                        <div class="col-md-6">
                            <h1>flattenhtml</h1>
                            <p>flattens HTML documents</p>
                        </div>
                    </div>
                </div>
            </body>
        </html>
    `

    nm, err := flattenhtml.NewNodeManagerFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatal(err)
    }

    mc, err := nm.Parse(flattenhtml.NewTagFlattener())
    if err != nil {
        log.Fatal(err)
    }

    tf, err := mc.SelectFlattener(&flattenhtml.TagFlattener{})
    if err != nil {
        log.Fatal(err)
    }

    divs := tf.SelectNodes("div")

    divs.
        Filter(flattenhtml.WithAttributeValueAs("class", "container")).
        Each(func(n *flattenhtml.Node) {
            val, _ := n.Attribute("id")

            fmt.Println(val)

            // Output:
            // target
        })
}

Documentation

Overview

Package flattenhtml provides a way to flatten the HTML tree structure and then use the flattened data to do different kinds of lookups.

Go provides html package that bear the heavy load of parsing HTML. However, this package results in a tree structure. Although it is generic and can be utilized for any traversal purposes, it is not very convenient for some use cases, such as, searching for a specific element.

Here is where flattenhtml comes in. It provides different mechanism to flatten the HTML tree structure based on the use case. For example, if you want to work with the nodes based on their tag name, you can use TagFlattener flattener to first flatten all the nodes based on their tag name and then do continues tag lookup without the need for constantly traversing the tree.

TagFlattener is currently the only built-in flattener of this package. However, all flatteners implement flattenhtml.Flattener interface and you can easily implement your own flattener.

When you use the following statement to initialize the NodeManager, parsed HTML tree will be traversed once and for any further lookups, the flattener data is accessible without the need for traversing the tree again. Also, there is the possibility of using multiple flatteners at the same time. For example, you can use TagFlattener to flatten the nodes based on their tag name and then use AttributeFlattener to flatten the nodes based on their attributes. The same as before, the HTML tree will be traversed only once to utilize all flatteners.

html := "<html><head></head><body><div><p></p></div></body></html>"
flatteners := []flattenhtml.Flattener{flattenhtml.TagFlattener, ...}
nm := flattenhtml.NewNodeManagerFromReader(strings.NewReader(html))
mc := nm.Parse(flatteners...)

Once the flattening process is done, you will have a *flattenhtml.MultiCursor Which holds a pointer to all the flatteners. Now, before proceeding, you need to select a single flattener of your choice, to continue the lookup process.

tagFlattenerCursor := mc.First()

Now, you can get nodes of the same tag name using the following statement:

nodes := tagFlattenerCursor.SelectNodes("div")

This will return a *flattenhtml.NodeIterator that can be used to iterate over the nodes that are selected by the given key. In this case, all the nodes that have "div" tag name.

Note that the underlying engine for parsing the HTML is golang.org/x/net/html package and all the fact about standardizing the HTML tree applies to this package.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrNoFlattener = errors.New("at least one flattener should be provided")

ErrNoFlattener is returned when no flattener is provided to the Parse method, or no flattener is found in the MultiCursor.

View Source
var ErrParentlessNode = errors.New("node with no parent cannot be removed")

Functions

This section is empty.

Types

type Cursor

type Cursor struct {
	// contains filtered or unexported fields
}

Cursor is a helper struct that holds the selected flattener from the MultiCursor. It allows the caller to perform different operations on the flattened document using the selected flattener by *MultiCursor.SelectFlattener method.

func (*Cursor) Len

func (c *Cursor) Len() int

Len returns the final number of categories or keys that were created by the flattener.

func (*Cursor) RegisterNewNode added in v0.3.1

func (c *Cursor) RegisterNewNode(node *Node) error

RegisterNewNode is used to add a newly and manually added nodes by the user to the cycle. It calls flatten method of the cursor's flatteners by giving the Node's underlying html.Node. New node can only be accessed by the NodeIterator and Cursor, if it is added to the cycle using this method.

func (*Cursor) SelectNodes

func (c *Cursor) SelectNodes(key string) *NodeIterator

SelectNodes returns a new NodeIterator that can iterates over the nodes that are selected by the given key and perform different operations. If the given key is not found in the flattened document, nodeIterator will have a zero length.

type FilterOption

type FilterOption func(node *Node) bool

FilterOption is a function that accepts a *Node and returns a boolean. The boolean value is true if the given *Node should be included in the NodeIterator and false otherwise.

func WithAttribute

func WithAttribute(key string) FilterOption

WithAttribute returns a FilterOption that filters nodes by the given key. The Node will be included in the final output if it has an attribute with the given key.

func WithAttributeValueAs

func WithAttributeValueAs(key, value string) FilterOption

WithAttributeValueAs returns a FilterOption that filters nodes by the given key and value. The Node will be included in the final output if it has an attribute with the given key and the value of that attribute is equal to the given value.

func WithTag

func WithTag(tag string) FilterOption

WithTag is a function that filters Node based on their tag name. If the node's tag name is the same is the given tag, it will be included in the final output.

type Flattener

type Flattener interface {
	// Flatten is a callback function called for each node
	// in the HTML tree. It accepts a *html.Node as the argument and returns
	// an error if any. If the error is not nil, the iteration stops and the
	// error is returned.
	Flatten(node *html.Node) error

	// GetNodesByKey returns a NodeIterator that can iterate over the nodes
	// that are flattened using the flattener and filtered by the given key.
	// If the given key is not found in the flattened document, it returns
	// nil.
	GetNodesByKey(key string) *NodeIterator

	// IsMyType allows each flattener implementation to decide whether the given
	// Flattener is of the same type as itself or not.
	IsMyType(flattener Flattener) bool

	// Len the final number of categories or keys that were created by the flattener.
	Len() int
}

Flattener is an interface for the logic that decides how the HTML tree should be traversed and flattened.

type MultiCursor

type MultiCursor struct {
	// contains filtered or unexported fields
}

MultiCursor is a helper struct that holds all the configured flatteners. It will usually be initiated by the NodeManager using the configured flatteners which can be later filtered to a single flattener using *MultiCursor.SelectFlattener method.

func NewMultiCursor

func NewMultiCursor(flatteners ...Flattener) *MultiCursor

NewMultiCursor returns a new MultiCursor initiated by the NodeManager. This holds all the configured flatteners that are used separately to flatten the HTML tree. To perform the variety of operations on the flattened documents, first you need to select your desired flattener cursor using methods defined on MultiCursor.

func (*MultiCursor) First added in v0.2.0

func (m *MultiCursor) First() *Cursor

First returns the first Cursor from the MultiCursor initiated by the NodeManager. This Cursor will hold the reference to the first flattener you configured for the NodeManager.Parse method. If MultiCursor has no cursor, the result will be nil.

func (*MultiCursor) RegisterNewNode added in v0.3.0

func (m *MultiCursor) RegisterNewNode(node *Node) error

RegisterNewNode is used to add a newly and manually added nodes by the user to the cycle. It calls flatten method of all it's flatteners by giving the Node's underlying html.Node. New node can only be accessed by the NodeIterator and Cursor, if it is added to the cycle using this method.

func (*MultiCursor) SelectCursor added in v0.2.0

func (m *MultiCursor) SelectCursor(flattener Flattener) (*Cursor, error)

SelectCursor returns a new Cursor with the selected flattener from the MultiCursor initiated by the NodeManager. If the given flattener is not found in the MultiCursor, it returns ErrNoFlattener.

type Node

type Node struct {
	// contains filtered or unexported fields
}

Node is a simple wrapper around *html.Node. It allows read/write operations on the *html.Node along with keeping the structure of the HTML tree.

func NewNode

func NewNode(htmlNode *html.Node) *Node

NewNode creates a new Node with the given *html.Node.

func (*Node) AppendChild added in v0.3.0

func (n *Node) AppendChild(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

AppendChild appends a new child to the Node. The new child will be added to the end of the children list of the Node. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) AppendSibling added in v0.3.0

func (n *Node) AppendSibling(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

AppendSibling appends a new sibling to the Node. The new node will be the next node after this node in the parent's children list. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) Attribute

func (n *Node) Attribute(key string) (string, bool)

Attribute returns the value of the given attribute key. The second return value is a boolean that indicates whether the given key is found.

func (*Node) Attributes

func (n *Node) Attributes() map[string]string

Attributes returns a map of strings containing attributes key and values of the Node.

func (*Node) HTMLNode added in v0.2.0

func (n *Node) HTMLNode() *html.Node

HTMLNode returns the underlying *html.Node of the Node. Any write operation on the *html.Node might corrupt the structure of the HTML tree.

func (*Node) IsRemoved

func (n *Node) IsRemoved() bool

IsRemoved returns true if the Node is removed from the NodeIterator and html.Node tree.

func (*Node) PrependChild added in v0.3.0

func (n *Node) PrependChild(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

PrependChild prepends a new child to the Node. The new child will be added to the beginning of the children list of the Node. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) PrependSibling added in v0.3.0

func (n *Node) PrependSibling(
	nodeType NodeType,
	tagNameOrContent string,
	attributes map[string]string,
) *Node

PrependSibling prepends a new sibling to the Node. The new node will be the previous node before this node in the parent's children list. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.

func (*Node) Remove

func (n *Node) Remove() error

Remove removes the Node from the NodeIterator and html.Node tree. It won't be available if you use the NodeManager.Render.

func (*Node) RemoveAttribute

func (n *Node) RemoveAttribute(key string)

RemoveAttribute removes the given attribute key from the node. If the given key does not exist, it will be ignored.

func (*Node) SetAttribute

func (n *Node) SetAttribute(key, value string)

SetAttribute sets the value of the given attribute key for the node. If the given key does not exist, it will be added to the node as a new attribute. Otherwise, the value of the given key will be updated.

func (*Node) TagName

func (n *Node) TagName() string

TagName returns the tag name of the Node.

type NodeIterator

type NodeIterator struct {
	// contains filtered or unexported fields
}

NodeIterator is a simple iterator that can iterate over a slice of *Node. It is used to iterate over the nodes that are flattened by a Flattener and perform different operations using the methods that are defined on the NodeIterator.

func NewNodeIterator

func NewNodeIterator() *NodeIterator

NewNodeIterator creates a new NodeIterator.

func (*NodeIterator) Add

func (n *NodeIterator) Add(node *Node) *NodeIterator

Add adds the given *Node to the NodeIterator. This does not change the html.Node tree. It is expected that NodeIterator and Node are managed by the flattener.

func (*NodeIterator) Each

func (n *NodeIterator) Each(f func(node *Node))

Each iterates over the nodes in the NodeIterator and calls the given function.

func (*NodeIterator) Filter

func (n *NodeIterator) Filter(option FilterOption) *NodeIterator

Filter filters the nodes in the NodeIterator using the given FilterOption. It returns a new NodeIterator that can iterate over the filtered nodes. For more complex filtering, you can use FilterOr or FilterAnd methods.

func (*NodeIterator) FilterAnd

func (n *NodeIterator) FilterAnd(options ...FilterOption) *NodeIterator

FilterAnd filters the nodes in the NodeIterator using the given FilterOptions. All the given options will be combined using AND operator. It means that if all the given options return true for a node, it will be included in the filtered NodeIterator. If any of the given options returns false for a node, the node will be filtered out and the rest of the options will be ignored for that node.

func (*NodeIterator) FilterOr

func (n *NodeIterator) FilterOr(options ...FilterOption) *NodeIterator

FilterOr filters the nodes in the NodeIterator using the given FilterOptions. All the given options will be combined using OR operator. It means that if any of the given options returns true for a node, it will be included in the filtered NodeIterator and the rest of options will be ignored for that node.

func (*NodeIterator) First added in v0.2.0

func (n *NodeIterator) First() *Node

First returns the first non-removed node in the NodeIterator. If there is no non-removed node, it returns nil.

func (*NodeIterator) Len

func (n *NodeIterator) Len() int

Len returns the number of nodes in the NodeIterator.

func (*NodeIterator) Next added in v0.2.0

func (n *NodeIterator) Next() *Node

Next iterates over the nodes in the NodeIterator and returns the next non-removed node. It starts from the first element of the NodeIterator and proceed to the next item on each call to Next. If there is no non-removed node, it returns nil. Once received nil, must be considered as the end of the iteration. Use Reset to start the iteration from the beginning.

func (*NodeIterator) Reset added in v0.2.0

func (n *NodeIterator) Reset()

Reset resets the cursor index to the beginning of the NodeIterator.

type NodeManager

type NodeManager struct {
	// contains filtered or unexported fields
}

NodeManager is an interface for the top-level logic of this package. This package is responsible to parse HTML nodes in some way, perform some modifications or read-only operations on them, and then render the HTML tree. There are different approaches to initiate a NodeManager:

  1. NewNodeManagerFromReader: It accepts an io.Reader and parses the HTML tree from it.
  2. NewNodeManagerFromURL: It accepts a URL and parses the HTML tree from the response body of the URL.
  3. NewNodeManager: It accepts a *html.Node and uses it as the root of the HTML tree.

Using approaches 2 and 3 follow the html.Parse method to parse the HTML tree.

func NewNodeManager

func NewNodeManager(root *html.Node) *NodeManager

NewNodeManager creates a new DefaultNodeManager with the given *html.Node as the root of the HTML tree.

func NewNodeManagerFromReader

func NewNodeManagerFromReader(r io.Reader) (*NodeManager, error)

NewNodeManagerFromReader creates a new DefaultNodeManager with the HTML tree parsed from the given io.Reader.

func NewNodeManagerFromURL

func NewNodeManagerFromURL(ctx context.Context, url string) (*NodeManager, error)

NewNodeManagerFromURL creates a new DefaultNodeManager with the HTML tree parsed from the response body of the given URL.

func (*NodeManager) Parse

func (n *NodeManager) Parse(flatteners ...Flattener) (*MultiCursor, error)

Parse parses the HTML tree tha has been converted to *html.Node before. It accepts a set of Flattener that decides how the HTML tree should be traversed and flattened. If any of the flatteners returns an error, the iteration stops and the error is returned.

func (*NodeManager) Render

func (n *NodeManager) Render(w io.Writer) error

Render renders the HTML tree to the given writer.

type NodeType added in v0.3.0

type NodeType html.NodeType
const (
	NodeTypeElement NodeType = NodeType(html.ElementNode)
	NodeTypeText    NodeType = NodeType(html.TextNode)
)

type TagFlattener

type TagFlattener struct {
	// contains filtered or unexported fields
}

TagFlattener is a Flattener that flattens the HTML tree by the tag name. When the NodeManager is initialized with this flattener, it will categorize NodeIterator by the tag name. Therefore, you can access all nodes with the same tag name (i.e., meta, a, p, etc.) using the GetNodesByKey method or Cursor.SelectNodes method.

Example
package main

import (
	"bytes"
	"fmt"
	"strings"

	"github.com/seinshah/flattenhtml"
)

func main() {
	rawHTML := `<html><body><div><p class="p1">hello</p><p class="p2">world</p></div></body></html>`

	manager, err := flattenhtml.NewNodeManagerFromReader(strings.NewReader(rawHTML))
	if err != nil {
		panic(err)
	}

	mc, err := manager.Parse(flattenhtml.NewTagFlattener())
	if err != nil {
		panic(err)
	}

	cursor := mc.First()
	targetP := cursor.SelectNodes("p").Filter(flattenhtml.WithAttributeValueAs("class", "p2"))

	targetP.Each(func(node *flattenhtml.Node) {
		err = node.Remove()
		if err != nil {
			panic(err)
		}
	})

	output := bytes.Buffer{}

	err = manager.Render(&output)
	if err != nil {
		panic(err)
	}

	fmt.Println(output.String())

}
Output:

<html><head></head><body><div><p class="p1">hello</p></div></body></html>

func NewTagFlattener

func NewTagFlattener() *TagFlattener

NewTagFlattener creates a new TagFlattener.

func (*TagFlattener) Flatten

func (t *TagFlattener) Flatten(node *html.Node) error

Flatten is a callback function called for each node during the NodeManager.Parse. It will continue to categorize all nodes in their tag NodeIterator as NodeManager traverses the HTML tree. This method does not return an error.

func (*TagFlattener) GetNodesByKey

func (t *TagFlattener) GetNodesByKey(key string) *NodeIterator

func (*TagFlattener) IsMyType

func (t *TagFlattener) IsMyType(flattener Flattener) bool

func (*TagFlattener) Len

func (t *TagFlattener) Len() int

Len for tagflattener gives you the concrete number of tags in the HTML tree.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL