chtml

package

v0.0.0-...-c8449d3 Latest Latest Go to latest Published: Oct 19, 2024 License: BSD-3-Clause, MIT Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dpotapov/go-pages

Documentation ¶

Overview ¶

Package chtml implements an HTML parser to be used by the `chtml` package.

The parser is based on golang.org/x/net/html with following modifications:

The Parse function may parse an entire HTML document or a fragment.
The original ParseFragment function is removed, since it is always context-aware. See StackOverflow post: https://stackoverflow.com/questions/21421704/using-html-parsefragment-in-a-generic-way
There is no goal to follow HTML5 spec.
The modified package tries to use the upstream `golang.org/x/net/html` package as much as possible.
The contents of the <noscript> tag is always parsed as HTML nodes. The scripting flag is removed.
Frameset/frame tags are not handled to simplify the parser. Those tags are deprecated in HTML5.
Foreign content is not handled per spec. Nested elements of <svg> and <math> tags are parsed as regular HTML nodes.
Foster parenting is disabled.
Active Formatting Elements algorithm is removed for simplicity and performance.

Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.

z := html.NewTokenizer(r)

Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:

for {
	tt := z.Next()
	if tt == html.ErrorToken {
		// ...
		return ...
	}
	// Process the current token.
}

There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:

Next {Raw} [ Token | Text | TagName {TagAttr} ]

Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:

for {
	if z.Next() == html.ErrorToken {
		// Returning io.EOF indicates success.
		return z.Err()
	}
	emitToken(z.Token())
}

The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:

depth := 0
for {
	tt := z.Next()
	switch tt {
	case html.ErrorToken:
		return z.Err()
	case html.TextToken:
		if depth > 0 {
			// emitBytes should copy the []byte it receives,
			// if it doesn't process it immediately.
			emitBytes(z.Text())
		}
	case html.StartTagToken, html.EndTagToken:
		tn, _ := z.TagName()
		if len(tn) == 1 && tn[0] == 'a' {
			if tt == html.StartTagToken {
				depth++
			} else {
				depth--
			}
		}
	}
}

Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:

doc, err := html.Parse(r)
if err != nil {
	// ...
}
var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		// Do something with n...
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		f(c)
	}
}
f(doc)

The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization

Security Considerations ¶

Care should be taken when parsing and interpreting HTML, whether full documents or fragments, within the framework of the HTML specification, especially with regard to untrusted inputs.

This package provides both a tokenizer and a parser, which implement the tokenization, and tokenization and tree construction stages of the WHATWG HTML parsing specification respectively. While the tokenizer parses and normalizes individual HTML tokens, only the parser constructs the DOM tree from the tokenized HTML, as described in the tree construction stage of the specification, dynamically modifying or extending the document's DOM tree.

If your use case requires semantically well-formed HTML documents, as defined by the WHATWG specification, the parser should be used rather than the tokenizer.

In security contexts, if trust decisions are being made using the tokenized or parsed content, the input must be re-serialized (for instance by using Render or Token.String) in order for those trust decisions to hold, as the process of tokenization or parsing may alter the content.

Example ¶

s := "<html><body><p>Hello World</p></body></html>"
r := strings.NewReader(s)
docNode, err := Parse(r, nil)
if err != nil {
	panic(err)
}

fmt.Println(docNode)

Output:

Index ¶

Variables
func AnyPlusAny(a any, b any) any
func AnyToHtml(a any) *html.Node
func MarshalScope(s Scope, src any) error
func UnmarshalScope(s Scope, target any) error
type Attribute
type BaseScope
- func NewBaseScope(vars map[string]any) *BaseScope
- func (s *BaseScope) Spawn(vars map[string]any) Scope
- func (s *BaseScope) Touch()
- func (s *BaseScope) Touched() <-chan struct{}
- func (s *BaseScope) Vars() map[string]any
type CAttr
- func (c *CAttr) Render(s Scope) (any, error)
type Component
- func NewComponent(n *Node, opts *ComponentOptions) Component
type ComponentError
- func (e *ComponentError) Error() string
- func (e *ComponentError) HTMLContext() string
- func (e *ComponentError) Unwrap() error
type ComponentOptions
type DecodeError
- func (e *DecodeError) Error() string
- func (e *DecodeError) Is(target error) bool
- func (e *DecodeError) Unwrap() error
type Disposable
type Expr
- func NewExpr(s string, args map[string]any) (Expr, error)
- func NewExprConst(v any) Expr
- func NewExprInterpol(s string, args map[string]any) (Expr, error)
- func NewExprRaw(s string) Expr
- func (e Expr) IsEmpty() bool
- func (e Expr) RawString() string
- func (e Expr) Value(vm *vm.VM, env any) (any, error)
type Importer
type Node
- func Parse(r io.Reader, imp Importer) (*Node, error)
- func (n *Node) AppendChild(c *Node)
- func (n *Node) InsertBefore(newChild, oldChild *Node)
- func (n *Node) IsWhitespace() bool
- func (n *Node) RemoveChild(c *Node)
type Scope
type UnrecognizedArgumentError
- func (e *UnrecognizedArgumentError) Error() string
- func (e *UnrecognizedArgumentError) Is(target error) bool

Examples ¶

Package

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrComponentNotFound is returned by Importer implementations when a component is not found.
	ErrComponentNotFound = errors.New("component not found")

	// ErrImportNotAllowed is returned when an Importer is not set for the component.
	ErrImportNotAllowed = errors.New("imports are not allowed")
)

Functions ¶

func AnyPlusAny ¶

func AnyPlusAny(a any, b any) any

func AnyToHtml ¶

func AnyToHtml(a any) *html.Node

func MarshalScope ¶

func MarshalScope(s Scope, src any) error

MarshalScope stores the variables from the source in the scope. The source must be a struct or a map. The function returns an error if the source is not a struct or a map or if the source variables cannot be stored in the scope.

func UnmarshalScope ¶

func UnmarshalScope(s Scope, target any) error

UnmarshalScope reads the variables from the scope and converts them to a provided target. The target must be a pointer to a struct or a map. The function returns an error if the target is not a pointer or if the scope variables cannot be converted to the target.

Types ¶

type Attribute ¶

type Attribute struct {
	Namespace string
	Key       string
	Val       Expr
}

type BaseScope ¶

type BaseScope struct {
	// contains filtered or unexported fields
}

BaseScope is a base implementation of the Scope interface. For extra functionality, this type can be wrapped (embedded) in a custom scope implementation.

func NewBaseScope ¶

func NewBaseScope(vars map[string]any) *BaseScope

func (*BaseScope) Spawn ¶

func (s *BaseScope) Spawn(vars map[string]any) Scope

Spawn creates a new child scope. If the current scope is closed, the new scope is also closed.

func (*BaseScope) Touch ¶

func (s *BaseScope) Touch()

func (*BaseScope) Touched ¶

func (s *BaseScope) Touched() <-chan struct{}

func (*BaseScope) Vars ¶

func (s *BaseScope) Vars() map[string]any

type CAttr ¶

type CAttr struct{}

func (*CAttr) Render ¶

func (c *CAttr) Render(s Scope) (any, error)

type Component ¶

type Component interface {
	// Render transforms the input data from the scope into another data object, typically
	// an HTML document (*html.Node) or anything else that can be sent over the wire or
	// passed to another Component as an input.
	Render(s Scope) (any, error)
}

func NewComponent ¶

func NewComponent(n *Node, opts *ComponentOptions) Component

type ComponentError ¶

type ComponentError struct {
	// contains filtered or unexported fields
}

func (*ComponentError) Error ¶

func (e *ComponentError) Error() string

func (*ComponentError) HTMLContext ¶

func (e *ComponentError) HTMLContext() string

func (*ComponentError) Unwrap ¶

func (e *ComponentError) Unwrap() error

type ComponentOptions ¶

type ComponentOptions struct {
	// Importer is the factory for components. It is invoked when a <c:NAME> element is encountered.
	Importer Importer

	// RenderComments is a flag to enable rendering of comments
	RenderComments bool
}

type DecodeError ¶

type DecodeError struct {
	Key string
	Err error
}

func (*DecodeError) Error ¶

func (e *DecodeError) Error() string

func (*DecodeError) Is ¶

func (e *DecodeError) Is(target error) bool

func (*DecodeError) Unwrap ¶

func (e *DecodeError) Unwrap() error

type Disposable ¶

type Disposable interface {
	// Dispose releases any resources held by the component.
	// It should be called when the component is no longer needed to prevent resource leaks.
	// If an error occurs during disposal, it should be returned.
	Dispose() error
}

Disposable is an optional interface for components that require explicit resource cleanup. Components that allocate resources such as files, network connections, or memory buffers should implement this interface to release those resources when they are no longer needed.

type Expr ¶

type Expr struct {
	// contains filtered or unexported fields
}

Expr is a struct to hold interpolated string data for the CHTML nodes.

func NewExpr ¶

func NewExpr(s string, args map[string]any) (Expr, error)

func NewExprConst ¶

func NewExprConst(v any) Expr

func NewExprInterpol ¶

func NewExprInterpol(s string, args map[string]any) (Expr, error)

func NewExprRaw ¶

func NewExprRaw(s string) Expr

NewExprRaw creates an Expr with a raw string, no interpolation.

func (Expr) IsEmpty ¶

func (e Expr) IsEmpty() bool

func (Expr) RawString ¶

func (e Expr) RawString() string

func (Expr) Value ¶

func (e Expr) Value(vm *vm.VM, env any) (any, error)

type Importer ¶

type Importer interface {
	Import(name string) (Component, error)
}

Importer acts as a factory for components. It is invoked when a <c:NAME> element is encountered.

type Node ¶

type Node struct {
	// The following fields are replicated from golang.org/x/net/html.Node.
	Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node

	Type      html.NodeType
	DataAtom  atom.Atom
	Data      Expr
	Namespace string

	// Attr is the list of attributes for the node. Also includes c:attr elements.
	Attr []Attribute

	// Cond is the value of c:if attribute. The c:if attribute itself is not included in Attr.
	Cond Expr

	// PrevCond is the previous c:else-if, or c:if node in the condition chain. It is not being
	// used during the rendering process (only NextCond), but is useful for the testing.
	// NextCond is the next c:else-if, or c:else node in the condition chain.
	PrevCond, NextCond *Node

	// Loop is the value of c:for attribute. The c:for attribute itself is not included in Attr.
	Loop Expr

	// LoopIdx is the index variable name for c:for loops.
	LoopIdx string

	// LoopVar is the value variable name for c:for loops.
	LoopVar string
}

func Parse ¶

func Parse(r io.Reader, imp Importer) (*Node, error)

Parse returns the parsed *Node tree for the HTML from the given Reader. The input is assumed to be UTF-8 encoded.

func (*Node) AppendChild ¶

func (n *Node) AppendChild(c *Node)

AppendChild adds a node c as a child of n.

It will panic if c already has a parent or siblings.

func (*Node) InsertBefore ¶

func (n *Node) InsertBefore(newChild, oldChild *Node)

InsertBefore inserts newChild as a child of n, immediately before oldChild in the sequence of n'scope children. oldChild may be nil, in which case newChild is appended to the end of n'scope children.

It will panic if newChild already has a parent or siblings.

func (*Node) IsWhitespace ¶

func (n *Node) IsWhitespace() bool

func (*Node) RemoveChild ¶

func (n *Node) RemoveChild(c *Node)

RemoveChild removes a node c that is a child of n. Afterwards, c will have no parent and no siblings.

It will panic if c'scope parent is not n.

type Scope ¶

type Scope interface {
	// Spawn creates a new child scope. It is initialized with variables that can be accessed from
	// the Component's Render() method using the Scope.Vars() method.
	Spawn(vars map[string]any) Scope

	// Vars provides access to variables stored in the scope.
	Vars() map[string]any

	// Touch marks the component as changed. The implementation should re-render the page
	// when this method is called.
	Touch()
}

Scope defines an interface for managing arguments in a CHTML component. Scopes are organized in a hierarchical structure, with each scope potentially having a parent scope and multiple child scopes. Changes in a child scope propagate to its parent scope.

A scope is closed when its associated component will not be rendered further. This occurs either when the HTTP request completes or when the component is removed from the page (e.g., due to c:if or c:for directives). Closing a parent scope results in the closure of all its child scopes.

The CHTML component creates new scopes for each loop iteration, conditional branch, and component import using the Spawn method.

This interface allows for custom implementations of components, enabling the inclusion of additional data such as HTTP request or WebSocket connection information.

type UnrecognizedArgumentError ¶

type UnrecognizedArgumentError struct {
	Name string
}

func (*UnrecognizedArgumentError) Error ¶

func (e *UnrecognizedArgumentError) Error() string

func (*UnrecognizedArgumentError) Is ¶

func (e *UnrecognizedArgumentError) Is(target error) bool

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL