Documentation ¶
Overview ¶
Package chtml implements an HTML parser to be used by the `chtml` package.
The parser is based on golang.org/x/net/html with following modifications:
- The Parse function may parse an entire HTML document or a fragment.
- The original ParseFragment function is removed, since it is always context-aware. See StackOverflow post: https://stackoverflow.com/questions/21421704/using-html-parsefragment-in-a-generic-way
- There is no goal to follow HTML5 spec.
- The modified package tries to use the upstream `golang.org/x/net/html` package as much as possible.
- The contents of the <noscript> tag is always parsed as HTML nodes. The scripting flag is removed.
- Frameset/frame tags are not handled to simplify the parser. Those tags are deprecated in HTML5.
- Foreign content is not handled per spec. Nested elements of <svg> and <math> tags are parsed as regular HTML nodes.
- Foster parenting is disabled.
- Active Formatting Elements algorithm is removed for simplicity and performance.
Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.
z := html.NewTokenizer(r)
Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:
for { tt := z.Next() if tt == html.ErrorToken { // ... return ... } // Process the current token. }
There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]
Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:
for { if z.Next() == html.ErrorToken { // Returning io.EOF indicates success. return z.Err() } emitToken(z.Token()) }
The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:
depth := 0 for { tt := z.Next() switch tt { case html.ErrorToken: return z.Err() case html.TextToken: if depth > 0 { // emitBytes should copy the []byte it receives, // if it doesn't process it immediately. emitBytes(z.Text()) } case html.StartTagToken, html.EndTagToken: tn, _ := z.TagName() if len(tn) == 1 && tn[0] == 'a' { if tt == html.StartTagToken { depth++ } else { depth-- } } } }
Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:
doc, err := html.Parse(r) if err != nil { // ... } var f func(*html.Node) f = func(n *html.Node) { if n.Type == html.ElementNode && n.Data == "a" { // Do something with n... } for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) } } f(doc)
The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Security Considerations ¶
Care should be taken when parsing and interpreting HTML, whether full documents or fragments, within the framework of the HTML specification, especially with regard to untrusted inputs.
This package provides both a tokenizer and a parser, which implement the tokenization, and tokenization and tree construction stages of the WHATWG HTML parsing specification respectively. While the tokenizer parses and normalizes individual HTML tokens, only the parser constructs the DOM tree from the tokenized HTML, as described in the tree construction stage of the specification, dynamically modifying or extending the document's DOM tree.
If your use case requires semantically well-formed HTML documents, as defined by the WHATWG specification, the parser should be used rather than the tokenizer.
In security contexts, if trust decisions are being made using the tokenized or parsed content, the input must be re-serialized (for instance by using Render or Token.String) in order for those trust decisions to hold, as the process of tokenization or parsing may alter the content.
Example ¶
s := "<html><body><p>Hello World</p></body></html>" r := strings.NewReader(s) docNode, err := Parse(r, nil) if err != nil { panic(err) } fmt.Println(docNode)
Output:
Index ¶
- Variables
- func AnyPlusAny(a any, b any) any
- func AnyToHtml(a any) *html.Node
- func MarshalScope(s Scope, src any) error
- func UnmarshalScope(s Scope, target any) error
- type Attribute
- type BaseScope
- type CAttr
- type Component
- type ComponentError
- type ComponentOptions
- type DecodeError
- type Disposable
- type Expr
- type Importer
- type Node
- type Scope
- type UnrecognizedArgumentError
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ErrComponentNotFound is returned by Importer implementations when a component is not found. ErrComponentNotFound = errors.New("component not found") // ErrImportNotAllowed is returned when an Importer is not set for the component. ErrImportNotAllowed = errors.New("imports are not allowed") )
Functions ¶
func AnyPlusAny ¶
func MarshalScope ¶
MarshalScope stores the variables from the source in the scope. The source must be a struct or a map. The function returns an error if the source is not a struct or a map or if the source variables cannot be stored in the scope.
func UnmarshalScope ¶
UnmarshalScope reads the variables from the scope and converts them to a provided target. The target must be a pointer to a struct or a map. The function returns an error if the target is not a pointer or if the scope variables cannot be converted to the target.
Types ¶
type BaseScope ¶
type BaseScope struct {
// contains filtered or unexported fields
}
BaseScope is a base implementation of the Scope interface. For extra functionality, this type can be wrapped (embedded) in a custom scope implementation.
func NewBaseScope ¶
type Component ¶
type Component interface { // Render transforms the input data from the scope into another data object, typically // an HTML document (*html.Node) or anything else that can be sent over the wire or // passed to another Component as an input. Render(s Scope) (any, error) }
func NewComponent ¶
func NewComponent(n *Node, opts *ComponentOptions) Component
type ComponentError ¶
type ComponentError struct {
// contains filtered or unexported fields
}
func (*ComponentError) Error ¶
func (e *ComponentError) Error() string
func (*ComponentError) HTMLContext ¶
func (e *ComponentError) HTMLContext() string
func (*ComponentError) Unwrap ¶
func (e *ComponentError) Unwrap() error
type ComponentOptions ¶
type DecodeError ¶
func (*DecodeError) Error ¶
func (e *DecodeError) Error() string
func (*DecodeError) Is ¶
func (e *DecodeError) Is(target error) bool
func (*DecodeError) Unwrap ¶
func (e *DecodeError) Unwrap() error
type Disposable ¶
type Disposable interface { // Dispose releases any resources held by the component. // It should be called when the component is no longer needed to prevent resource leaks. // If an error occurs during disposal, it should be returned. Dispose() error }
Disposable is an optional interface for components that require explicit resource cleanup. Components that allocate resources such as files, network connections, or memory buffers should implement this interface to release those resources when they are no longer needed.
type Expr ¶
type Expr struct {
// contains filtered or unexported fields
}
Expr is a struct to hold interpolated string data for the CHTML nodes.
func NewExprConst ¶
func NewExprRaw ¶
NewExprRaw creates an Expr with a raw string, no interpolation.
type Importer ¶
Importer acts as a factory for components. It is invoked when a <c:NAME> element is encountered.
type Node ¶
type Node struct {
// The following fields are replicated from golang.org/x/net/html.Node.
Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node
Type html.NodeType
DataAtom atom.Atom
Data Expr
Namespace string
// Attr is the list of attributes for the node. Also includes c:attr elements.
Attr []Attribute
// Cond is the value of c:if attribute. The c:if attribute itself is not included in Attr.
Cond Expr
// PrevCond is the previous c:else-if, or c:if node in the condition chain. It is not being
// used during the rendering process (only NextCond), but is useful for the testing.
// NextCond is the next c:else-if, or c:else node in the condition chain.
PrevCond, NextCond *Node
// Loop is the value of c:for attribute. The c:for attribute itself is not included in Attr.
Loop Expr
// LoopIdx is the index variable name for c:for loops.
LoopIdx string
// LoopVar is the value variable name for c:for loops.
LoopVar string
}
func Parse ¶
Parse returns the parsed *Node tree for the HTML from the given Reader. The input is assumed to be UTF-8 encoded.
func (*Node) AppendChild ¶
AppendChild adds a node c as a child of n.
It will panic if c already has a parent or siblings.
func (*Node) InsertBefore ¶
InsertBefore inserts newChild as a child of n, immediately before oldChild in the sequence of n'scope children. oldChild may be nil, in which case newChild is appended to the end of n'scope children.
It will panic if newChild already has a parent or siblings.
func (*Node) IsWhitespace ¶
func (*Node) RemoveChild ¶
RemoveChild removes a node c that is a child of n. Afterwards, c will have no parent and no siblings.
It will panic if c'scope parent is not n.
type Scope ¶
type Scope interface { // Spawn creates a new child scope. It is initialized with variables that can be accessed from // the Component's Render() method using the Scope.Vars() method. Spawn(vars map[string]any) Scope // Vars provides access to variables stored in the scope. Vars() map[string]any // Touch marks the component as changed. The implementation should re-render the page // when this method is called. Touch() }
Scope defines an interface for managing arguments in a CHTML component. Scopes are organized in a hierarchical structure, with each scope potentially having a parent scope and multiple child scopes. Changes in a child scope propagate to its parent scope.
A scope is closed when its associated component will not be rendered further. This occurs either when the HTTP request completes or when the component is removed from the page (e.g., due to c:if or c:for directives). Closing a parent scope results in the closure of all its child scopes.
The CHTML component creates new scopes for each loop iteration, conditional branch, and component import using the Spawn method.
This interface allows for custom implementations of components, enabling the inclusion of additional data such as HTTP request or WebSocket connection information.
type UnrecognizedArgumentError ¶
type UnrecognizedArgumentError struct {
Name string
}
func (*UnrecognizedArgumentError) Error ¶
func (e *UnrecognizedArgumentError) Error() string
func (*UnrecognizedArgumentError) Is ¶
func (e *UnrecognizedArgumentError) Is(target error) bool