Documentation ¶
Overview ¶
Package flattenhtml provides a way to flatten the HTML tree structure and then use the flattened data to do different kinds of lookups.
Go provides html package that bear the heavy load of parsing HTML. However, this package results in a tree structure. Although it is generic and can be utilized for any traversal purposes, it is not very convenient for some use cases, such as, searching for a specific element.
Here is where flattenhtml comes in. It provides different mechanism to flatten the HTML tree structure based on the use case. For example, if you want to work with the nodes based on their tag name, you can use TagFlattener flattener to first flatten all the nodes based on their tag name and then do continues tag lookup without the need for constantly traversing the tree.
TagFlattener is currently the only built-in flattener of this package. However, all flatteners implement flattenhtml.Flattener interface and you can easily implement your own flattener.
When you use the following statement to initialize the NodeManager, parsed HTML tree will be traversed once and for any further lookups, the flattener data is accessible without the need for traversing the tree again. Also, there is the possibility of using multiple flatteners at the same time. For example, you can use TagFlattener to flatten the nodes based on their tag name and then use AttributeFlattener to flatten the nodes based on their attributes. The same as before, the HTML tree will be traversed only once to utilize all flatteners.
html := "<html><head></head><body><div><p></p></div></body></html>" flatteners := []flattenhtml.Flattener{flattenhtml.TagFlattener, ...} nm := flattenhtml.NewNodeManagerFromReader(strings.NewReader(html)) mc := nm.Parse(flatteners...)
Once the flattening process is done, you will have a *flattenhtml.MultiCursor Which holds a pointer to all the flatteners. Now, before proceeding, you need to select a single flattener of your choice, to continue the lookup process.
tagFlattenerCursor := mc.First()
Now, you can get nodes of the same tag name using the following statement:
nodes := tagFlattenerCursor.SelectNodes("div")
This will return a *flattenhtml.NodeIterator that can be used to iterate over the nodes that are selected by the given key. In this case, all the nodes that have "div" tag name.
Note that the underlying engine for parsing the HTML is golang.org/x/net/html package and all the fact about standardizing the HTML tree applies to this package.
Index ¶
- Variables
- type Cursor
- type FilterOption
- type Flattener
- type MultiCursor
- type Node
- func (n *Node) AppendChild(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) AppendSibling(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) Attribute(key string) (string, bool)
- func (n *Node) Attributes() map[string]string
- func (n *Node) HTMLNode() *html.Node
- func (n *Node) IsRemoved() bool
- func (n *Node) PrependChild(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) PrependSibling(nodeType NodeType, tagNameOrContent string, attributes map[string]string) *Node
- func (n *Node) Remove() error
- func (n *Node) RemoveAttribute(key string)
- func (n *Node) SetAttribute(key, value string)
- func (n *Node) TagName() string
- type NodeIterator
- func (n *NodeIterator) Add(node *Node) *NodeIterator
- func (n *NodeIterator) Each(f func(node *Node))
- func (n *NodeIterator) Filter(option FilterOption) *NodeIterator
- func (n *NodeIterator) FilterAnd(options ...FilterOption) *NodeIterator
- func (n *NodeIterator) FilterOr(options ...FilterOption) *NodeIterator
- func (n *NodeIterator) First() *Node
- func (n *NodeIterator) Len() int
- func (n *NodeIterator) Next() *Node
- func (n *NodeIterator) Reset()
- type NodeManager
- type NodeType
- type TagFlattener
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ErrNoFlattener = errors.New("at least one flattener should be provided")
ErrNoFlattener is returned when no flattener is provided to the Parse method, or no flattener is found in the MultiCursor.
var ErrParentlessNode = errors.New("node with no parent cannot be removed")
Functions ¶
This section is empty.
Types ¶
type Cursor ¶
type Cursor struct {
// contains filtered or unexported fields
}
Cursor is a helper struct that holds the selected flattener from the MultiCursor. It allows the caller to perform different operations on the flattened document using the selected flattener by *MultiCursor.SelectFlattener method.
func (*Cursor) Len ¶
Len returns the final number of categories or keys that were created by the flattener.
func (*Cursor) RegisterNewNode ¶ added in v0.3.1
RegisterNewNode is used to add a newly and manually added nodes by the user to the cycle. It calls flatten method of the cursor's flatteners by giving the Node's underlying html.Node. New node can only be accessed by the NodeIterator and Cursor, if it is added to the cycle using this method.
func (*Cursor) SelectNodes ¶
func (c *Cursor) SelectNodes(key string) *NodeIterator
SelectNodes returns a new NodeIterator that can iterates over the nodes that are selected by the given key and perform different operations. If the given key is not found in the flattened document, nodeIterator will have a zero length.
type FilterOption ¶
FilterOption is a function that accepts a *Node and returns a boolean. The boolean value is true if the given *Node should be included in the NodeIterator and false otherwise.
func WithAttribute ¶
func WithAttribute(key string) FilterOption
WithAttribute returns a FilterOption that filters nodes by the given key. The Node will be included in the final output if it has an attribute with the given key.
func WithAttributeValueAs ¶
func WithAttributeValueAs(key, value string) FilterOption
WithAttributeValueAs returns a FilterOption that filters nodes by the given key and value. The Node will be included in the final output if it has an attribute with the given key and the value of that attribute is equal to the given value.
func WithTag ¶
func WithTag(tag string) FilterOption
WithTag is a function that filters Node based on their tag name. If the node's tag name is the same is the given tag, it will be included in the final output.
type Flattener ¶
type Flattener interface { // Flatten is a callback function called for each node // in the HTML tree. It accepts a *html.Node as the argument and returns // an error if any. If the error is not nil, the iteration stops and the // error is returned. Flatten(node *html.Node) error // GetNodesByKey returns a NodeIterator that can iterate over the nodes // that are flattened using the flattener and filtered by the given key. // If the given key is not found in the flattened document, it returns // nil. GetNodesByKey(key string) *NodeIterator // IsMyType allows each flattener implementation to decide whether the given // Flattener is of the same type as itself or not. IsMyType(flattener Flattener) bool // Len the final number of categories or keys that were created by the flattener. Len() int }
Flattener is an interface for the logic that decides how the HTML tree should be traversed and flattened.
type MultiCursor ¶
type MultiCursor struct {
// contains filtered or unexported fields
}
MultiCursor is a helper struct that holds all the configured flatteners. It will usually be initiated by the NodeManager using the configured flatteners which can be later filtered to a single flattener using *MultiCursor.SelectFlattener method.
func NewMultiCursor ¶
func NewMultiCursor(flatteners ...Flattener) *MultiCursor
NewMultiCursor returns a new MultiCursor initiated by the NodeManager. This holds all the configured flatteners that are used separately to flatten the HTML tree. To perform the variety of operations on the flattened documents, first you need to select your desired flattener cursor using methods defined on MultiCursor.
func (*MultiCursor) First ¶ added in v0.2.0
func (m *MultiCursor) First() *Cursor
First returns the first Cursor from the MultiCursor initiated by the NodeManager. This Cursor will hold the reference to the first flattener you configured for the NodeManager.Parse method. If MultiCursor has no cursor, the result will be nil.
func (*MultiCursor) RegisterNewNode ¶ added in v0.3.0
func (m *MultiCursor) RegisterNewNode(node *Node) error
RegisterNewNode is used to add a newly and manually added nodes by the user to the cycle. It calls flatten method of all it's flatteners by giving the Node's underlying html.Node. New node can only be accessed by the NodeIterator and Cursor, if it is added to the cycle using this method.
func (*MultiCursor) SelectCursor ¶ added in v0.2.0
func (m *MultiCursor) SelectCursor(flattener Flattener) (*Cursor, error)
SelectCursor returns a new Cursor with the selected flattener from the MultiCursor initiated by the NodeManager. If the given flattener is not found in the MultiCursor, it returns ErrNoFlattener.
type Node ¶
type Node struct {
// contains filtered or unexported fields
}
Node is a simple wrapper around *html.Node. It allows read/write operations on the *html.Node along with keeping the structure of the HTML tree.
func (*Node) AppendChild ¶ added in v0.3.0
func (n *Node) AppendChild( nodeType NodeType, tagNameOrContent string, attributes map[string]string, ) *Node
AppendChild appends a new child to the Node. The new child will be added to the end of the children list of the Node. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.
func (*Node) AppendSibling ¶ added in v0.3.0
func (n *Node) AppendSibling( nodeType NodeType, tagNameOrContent string, attributes map[string]string, ) *Node
AppendSibling appends a new sibling to the Node. The new node will be the next node after this node in the parent's children list. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.
func (*Node) Attribute ¶
Attribute returns the value of the given attribute key. The second return value is a boolean that indicates whether the given key is found.
func (*Node) Attributes ¶
Attributes returns a map of strings containing attributes key and values of the Node.
func (*Node) HTMLNode ¶ added in v0.2.0
HTMLNode returns the underlying *html.Node of the Node. Any write operation on the *html.Node might corrupt the structure of the HTML tree.
func (*Node) IsRemoved ¶
IsRemoved returns true if the Node is removed from the NodeIterator and html.Node tree.
func (*Node) PrependChild ¶ added in v0.3.0
func (n *Node) PrependChild( nodeType NodeType, tagNameOrContent string, attributes map[string]string, ) *Node
PrependChild prepends a new child to the Node. The new child will be added to the beginning of the children list of the Node. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.
func (*Node) PrependSibling ¶ added in v0.3.0
func (n *Node) PrependSibling( nodeType NodeType, tagNameOrContent string, attributes map[string]string, ) *Node
PrependSibling prepends a new sibling to the Node. The new node will be the previous node before this node in the parent's children list. It returns the newly added Node. tagNameOrContent can be used as a tag name if nodeType is NodeTypeElement, or as a content if nodeType is NodeTypeText. The newly added node in this approach will be available if you render the NodeManager. However, the newly added node will not be accessible using NodeIterator or Cursor. To add the new node to the cycle, you can use MultiCursor.RegisterNewNode method.
func (*Node) Remove ¶
Remove removes the Node from the NodeIterator and html.Node tree. It won't be available if you use the NodeManager.Render.
func (*Node) RemoveAttribute ¶
RemoveAttribute removes the given attribute key from the node. If the given key does not exist, it will be ignored.
func (*Node) SetAttribute ¶
SetAttribute sets the value of the given attribute key for the node. If the given key does not exist, it will be added to the node as a new attribute. Otherwise, the value of the given key will be updated.
type NodeIterator ¶
type NodeIterator struct {
// contains filtered or unexported fields
}
NodeIterator is a simple iterator that can iterate over a slice of *Node. It is used to iterate over the nodes that are flattened by a Flattener and perform different operations using the methods that are defined on the NodeIterator.
func NewNodeIterator ¶
func NewNodeIterator() *NodeIterator
NewNodeIterator creates a new NodeIterator.
func (*NodeIterator) Add ¶
func (n *NodeIterator) Add(node *Node) *NodeIterator
Add adds the given *Node to the NodeIterator. This does not change the html.Node tree. It is expected that NodeIterator and Node are managed by the flattener.
func (*NodeIterator) Each ¶
func (n *NodeIterator) Each(f func(node *Node))
Each iterates over the nodes in the NodeIterator and calls the given function.
func (*NodeIterator) Filter ¶
func (n *NodeIterator) Filter(option FilterOption) *NodeIterator
Filter filters the nodes in the NodeIterator using the given FilterOption. It returns a new NodeIterator that can iterate over the filtered nodes. For more complex filtering, you can use FilterOr or FilterAnd methods.
func (*NodeIterator) FilterAnd ¶
func (n *NodeIterator) FilterAnd(options ...FilterOption) *NodeIterator
FilterAnd filters the nodes in the NodeIterator using the given FilterOptions. All the given options will be combined using AND operator. It means that if all the given options return true for a node, it will be included in the filtered NodeIterator. If any of the given options returns false for a node, the node will be filtered out and the rest of the options will be ignored for that node.
func (*NodeIterator) FilterOr ¶
func (n *NodeIterator) FilterOr(options ...FilterOption) *NodeIterator
FilterOr filters the nodes in the NodeIterator using the given FilterOptions. All the given options will be combined using OR operator. It means that if any of the given options returns true for a node, it will be included in the filtered NodeIterator and the rest of options will be ignored for that node.
func (*NodeIterator) First ¶ added in v0.2.0
func (n *NodeIterator) First() *Node
First returns the first non-removed node in the NodeIterator. If there is no non-removed node, it returns nil.
func (*NodeIterator) Len ¶
func (n *NodeIterator) Len() int
Len returns the number of nodes in the NodeIterator.
func (*NodeIterator) Next ¶ added in v0.2.0
func (n *NodeIterator) Next() *Node
Next iterates over the nodes in the NodeIterator and returns the next non-removed node. It starts from the first element of the NodeIterator and proceed to the next item on each call to Next. If there is no non-removed node, it returns nil. Once received nil, must be considered as the end of the iteration. Use Reset to start the iteration from the beginning.
func (*NodeIterator) Reset ¶ added in v0.2.0
func (n *NodeIterator) Reset()
Reset resets the cursor index to the beginning of the NodeIterator.
type NodeManager ¶
type NodeManager struct {
// contains filtered or unexported fields
}
NodeManager is an interface for the top-level logic of this package. This package is responsible to parse HTML nodes in some way, perform some modifications or read-only operations on them, and then render the HTML tree. There are different approaches to initiate a NodeManager:
- NewNodeManagerFromReader: It accepts an io.Reader and parses the HTML tree from it.
- NewNodeManagerFromURL: It accepts a URL and parses the HTML tree from the response body of the URL.
- NewNodeManager: It accepts a *html.Node and uses it as the root of the HTML tree.
Using approaches 2 and 3 follow the html.Parse method to parse the HTML tree.
func NewNodeManager ¶
func NewNodeManager(root *html.Node) *NodeManager
NewNodeManager creates a new DefaultNodeManager with the given *html.Node as the root of the HTML tree.
func NewNodeManagerFromReader ¶
func NewNodeManagerFromReader(r io.Reader) (*NodeManager, error)
NewNodeManagerFromReader creates a new DefaultNodeManager with the HTML tree parsed from the given io.Reader.
func NewNodeManagerFromURL ¶
func NewNodeManagerFromURL(ctx context.Context, url string) (*NodeManager, error)
NewNodeManagerFromURL creates a new DefaultNodeManager with the HTML tree parsed from the response body of the given URL.
func (*NodeManager) Parse ¶
func (n *NodeManager) Parse(flatteners ...Flattener) (*MultiCursor, error)
Parse parses the HTML tree tha has been converted to *html.Node before. It accepts a set of Flattener that decides how the HTML tree should be traversed and flattened. If any of the flatteners returns an error, the iteration stops and the error is returned.
type TagFlattener ¶
type TagFlattener struct {
// contains filtered or unexported fields
}
TagFlattener is a Flattener that flattens the HTML tree by the tag name. When the NodeManager is initialized with this flattener, it will categorize NodeIterator by the tag name. Therefore, you can access all nodes with the same tag name (i.e., meta, a, p, etc.) using the GetNodesByKey method or Cursor.SelectNodes method.
Example ¶
package main import ( "bytes" "fmt" "strings" "github.com/seinshah/flattenhtml" ) func main() { rawHTML := `<html><body><div><p class="p1">hello</p><p class="p2">world</p></div></body></html>` manager, err := flattenhtml.NewNodeManagerFromReader(strings.NewReader(rawHTML)) if err != nil { panic(err) } mc, err := manager.Parse(flattenhtml.NewTagFlattener()) if err != nil { panic(err) } cursor := mc.First() targetP := cursor.SelectNodes("p").Filter(flattenhtml.WithAttributeValueAs("class", "p2")) targetP.Each(func(node *flattenhtml.Node) { err = node.Remove() if err != nil { panic(err) } }) output := bytes.Buffer{} err = manager.Render(&output) if err != nil { panic(err) } fmt.Println(output.String()) }
Output: <html><head></head><body><div><p class="p1">hello</p></div></body></html>
func NewTagFlattener ¶
func NewTagFlattener() *TagFlattener
NewTagFlattener creates a new TagFlattener.
func (*TagFlattener) Flatten ¶
func (t *TagFlattener) Flatten(node *html.Node) error
Flatten is a callback function called for each node during the NodeManager.Parse. It will continue to categorize all nodes in their tag NodeIterator as NodeManager traverses the HTML tree. This method does not return an error.
func (*TagFlattener) GetNodesByKey ¶
func (t *TagFlattener) GetNodesByKey(key string) *NodeIterator
func (*TagFlattener) IsMyType ¶
func (t *TagFlattener) IsMyType(flattener Flattener) bool
func (*TagFlattener) Len ¶
func (t *TagFlattener) Len() int
Len for tagflattener gives you the concrete number of tags in the HTML tree.