chtml

package
v0.0.0-...-c8449d3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2024 License: BSD-3-Clause, MIT Imports: 20 Imported by: 0

Documentation

Overview

Package chtml implements an HTML parser to be used by the `chtml` package.

The parser is based on golang.org/x/net/html with following modifications:

  • The Parse function may parse an entire HTML document or a fragment.
  • The original ParseFragment function is removed, since it is always context-aware. See StackOverflow post: https://stackoverflow.com/questions/21421704/using-html-parsefragment-in-a-generic-way
  • There is no goal to follow HTML5 spec.
  • The modified package tries to use the upstream `golang.org/x/net/html` package as much as possible.
  • The contents of the <noscript> tag is always parsed as HTML nodes. The scripting flag is removed.
  • Frameset/frame tags are not handled to simplify the parser. Those tags are deprecated in HTML5.
  • Foreign content is not handled per spec. Nested elements of <svg> and <math> tags are parsed as regular HTML nodes.
  • Foster parenting is disabled.
  • Active Formatting Elements algorithm is removed for simplicity and performance.

Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.

z := html.NewTokenizer(r)

Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:

for {
	tt := z.Next()
	if tt == html.ErrorToken {
		// ...
		return ...
	}
	// Process the current token.
}

There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:

Next {Raw} [ Token | Text | TagName {TagAttr} ]

Token returns an independent data structure that completely describes a token. Entities (such as "&lt;") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:

for {
	if z.Next() == html.ErrorToken {
		// Returning io.EOF indicates success.
		return z.Err()
	}
	emitToken(z.Token())
}

The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:

depth := 0
for {
	tt := z.Next()
	switch tt {
	case html.ErrorToken:
		return z.Err()
	case html.TextToken:
		if depth > 0 {
			// emitBytes should copy the []byte it receives,
			// if it doesn't process it immediately.
			emitBytes(z.Text())
		}
	case html.StartTagToken, html.EndTagToken:
		tn, _ := z.TagName()
		if len(tn) == 1 && tn[0] == 'a' {
			if tt == html.StartTagToken {
				depth++
			} else {
				depth--
			}
		}
	}
}

Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:

doc, err := html.Parse(r)
if err != nil {
	// ...
}
var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		// Do something with n...
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		f(c)
	}
}
f(doc)

The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization

Security Considerations

Care should be taken when parsing and interpreting HTML, whether full documents or fragments, within the framework of the HTML specification, especially with regard to untrusted inputs.

This package provides both a tokenizer and a parser, which implement the tokenization, and tokenization and tree construction stages of the WHATWG HTML parsing specification respectively. While the tokenizer parses and normalizes individual HTML tokens, only the parser constructs the DOM tree from the tokenized HTML, as described in the tree construction stage of the specification, dynamically modifying or extending the document's DOM tree.

If your use case requires semantically well-formed HTML documents, as defined by the WHATWG specification, the parser should be used rather than the tokenizer.

In security contexts, if trust decisions are being made using the tokenized or parsed content, the input must be re-serialized (for instance by using Render or Token.String) in order for those trust decisions to hold, as the process of tokenization or parsing may alter the content.

Example
s := "<html><body><p>Hello World</p></body></html>"
r := strings.NewReader(s)
docNode, err := Parse(r, nil)
if err != nil {
	panic(err)
}

fmt.Println(docNode)
Output:

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// ErrComponentNotFound is returned by Importer implementations when a component is not found.
	ErrComponentNotFound = errors.New("component not found")

	// ErrImportNotAllowed is returned when an Importer is not set for the component.
	ErrImportNotAllowed = errors.New("imports are not allowed")
)

Functions

func AnyPlusAny

func AnyPlusAny(a any, b any) any

func AnyToHtml

func AnyToHtml(a any) *html.Node

func MarshalScope

func MarshalScope(s Scope, src any) error

MarshalScope stores the variables from the source in the scope. The source must be a struct or a map. The function returns an error if the source is not a struct or a map or if the source variables cannot be stored in the scope.

func UnmarshalScope

func UnmarshalScope(s Scope, target any) error

UnmarshalScope reads the variables from the scope and converts them to a provided target. The target must be a pointer to a struct or a map. The function returns an error if the target is not a pointer or if the scope variables cannot be converted to the target.

Types

type Attribute

type Attribute struct {
	Namespace string
	Key       string
	Val       Expr
}

type BaseScope

type BaseScope struct {
	// contains filtered or unexported fields
}

BaseScope is a base implementation of the Scope interface. For extra functionality, this type can be wrapped (embedded) in a custom scope implementation.

func NewBaseScope

func NewBaseScope(vars map[string]any) *BaseScope

func (*BaseScope) Spawn

func (s *BaseScope) Spawn(vars map[string]any) Scope

Spawn creates a new child scope. If the current scope is closed, the new scope is also closed.

func (*BaseScope) Touch

func (s *BaseScope) Touch()

func (*BaseScope) Touched

func (s *BaseScope) Touched() <-chan struct{}

func (*BaseScope) Vars

func (s *BaseScope) Vars() map[string]any

type CAttr

type CAttr struct{}

func (*CAttr) Render

func (c *CAttr) Render(s Scope) (any, error)

type Component

type Component interface {
	// Render transforms the input data from the scope into another data object, typically
	// an HTML document (*html.Node) or anything else that can be sent over the wire or
	// passed to another Component as an input.
	Render(s Scope) (any, error)
}

func NewComponent

func NewComponent(n *Node, opts *ComponentOptions) Component

type ComponentError

type ComponentError struct {
	// contains filtered or unexported fields
}

func (*ComponentError) Error

func (e *ComponentError) Error() string

func (*ComponentError) HTMLContext

func (e *ComponentError) HTMLContext() string

func (*ComponentError) Unwrap

func (e *ComponentError) Unwrap() error

type ComponentOptions

type ComponentOptions struct {
	// Importer is the factory for components. It is invoked when a <c:NAME> element is encountered.
	Importer Importer

	// RenderComments is a flag to enable rendering of comments
	RenderComments bool
}

type DecodeError

type DecodeError struct {
	Key string
	Err error
}

func (*DecodeError) Error

func (e *DecodeError) Error() string

func (*DecodeError) Is

func (e *DecodeError) Is(target error) bool

func (*DecodeError) Unwrap

func (e *DecodeError) Unwrap() error

type Disposable

type Disposable interface {
	// Dispose releases any resources held by the component.
	// It should be called when the component is no longer needed to prevent resource leaks.
	// If an error occurs during disposal, it should be returned.
	Dispose() error
}

Disposable is an optional interface for components that require explicit resource cleanup. Components that allocate resources such as files, network connections, or memory buffers should implement this interface to release those resources when they are no longer needed.

type Expr

type Expr struct {
	// contains filtered or unexported fields
}

Expr is a struct to hold interpolated string data for the CHTML nodes.

func NewExpr

func NewExpr(s string, args map[string]any) (Expr, error)

func NewExprConst

func NewExprConst(v any) Expr

func NewExprInterpol

func NewExprInterpol(s string, args map[string]any) (Expr, error)

func NewExprRaw

func NewExprRaw(s string) Expr

NewExprRaw creates an Expr with a raw string, no interpolation.

func (Expr) IsEmpty

func (e Expr) IsEmpty() bool

func (Expr) RawString

func (e Expr) RawString() string

func (Expr) Value

func (e Expr) Value(vm *vm.VM, env any) (any, error)

type Importer

type Importer interface {
	Import(name string) (Component, error)
}

Importer acts as a factory for components. It is invoked when a <c:NAME> element is encountered.

type Node

type Node struct {
	// The following fields are replicated from golang.org/x/net/html.Node.
	Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node

	Type      html.NodeType
	DataAtom  atom.Atom
	Data      Expr
	Namespace string

	// Attr is the list of attributes for the node. Also includes c:attr elements.
	Attr []Attribute

	// Cond is the value of c:if attribute. The c:if attribute itself is not included in Attr.
	Cond Expr

	// PrevCond is the previous c:else-if, or c:if node in the condition chain. It is not being
	// used during the rendering process (only NextCond), but is useful for the testing.
	// NextCond is the next c:else-if, or c:else node in the condition chain.
	PrevCond, NextCond *Node

	// Loop is the value of c:for attribute. The c:for attribute itself is not included in Attr.
	Loop Expr

	// LoopIdx is the index variable name for c:for loops.
	LoopIdx string

	// LoopVar is the value variable name for c:for loops.
	LoopVar string
}

func Parse

func Parse(r io.Reader, imp Importer) (*Node, error)

Parse returns the parsed *Node tree for the HTML from the given Reader. The input is assumed to be UTF-8 encoded.

func (*Node) AppendChild

func (n *Node) AppendChild(c *Node)

AppendChild adds a node c as a child of n.

It will panic if c already has a parent or siblings.

func (*Node) InsertBefore

func (n *Node) InsertBefore(newChild, oldChild *Node)

InsertBefore inserts newChild as a child of n, immediately before oldChild in the sequence of n'scope children. oldChild may be nil, in which case newChild is appended to the end of n'scope children.

It will panic if newChild already has a parent or siblings.

func (*Node) IsWhitespace

func (n *Node) IsWhitespace() bool

func (*Node) RemoveChild

func (n *Node) RemoveChild(c *Node)

RemoveChild removes a node c that is a child of n. Afterwards, c will have no parent and no siblings.

It will panic if c'scope parent is not n.

type Scope

type Scope interface {
	// Spawn creates a new child scope. It is initialized with variables that can be accessed from
	// the Component's Render() method using the Scope.Vars() method.
	Spawn(vars map[string]any) Scope

	// Vars provides access to variables stored in the scope.
	Vars() map[string]any

	// Touch marks the component as changed. The implementation should re-render the page
	// when this method is called.
	Touch()
}

Scope defines an interface for managing arguments in a CHTML component. Scopes are organized in a hierarchical structure, with each scope potentially having a parent scope and multiple child scopes. Changes in a child scope propagate to its parent scope.

A scope is closed when its associated component will not be rendered further. This occurs either when the HTTP request completes or when the component is removed from the page (e.g., due to c:if or c:for directives). Closing a parent scope results in the closure of all its child scopes.

The CHTML component creates new scopes for each loop iteration, conditional branch, and component import using the Spawn method.

This interface allows for custom implementations of components, enabling the inclusion of additional data such as HTTP request or WebSocket connection information.

type UnrecognizedArgumentError

type UnrecognizedArgumentError struct {
	Name string
}

func (*UnrecognizedArgumentError) Error

func (e *UnrecognizedArgumentError) Error() string

func (*UnrecognizedArgumentError) Is

func (e *UnrecognizedArgumentError) Is(target error) bool

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL