Documentation ¶
Overview ¶
Package html implements an HTML5-compliant tokenizer and parser.
Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.
z := html.NewTokenizer(r)
Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:
for { tt := z.Next() if tt == html.ErrorToken { // ... return ... } // Process the current token. }
There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]
Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:
for { if z.Next() == html.ErrorToken { // Returning io.EOF indicates success. return z.Err() } emitToken(z.Token()) }
The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:
depth := 0 for { tt := z.Next() switch tt { case html.ErrorToken: return z.Err() case html.TextToken: if depth > 0 { // emitBytes should copy the []byte it receives, // if it doesn't process it immediately. emitBytes(z.Text()) } case html.StartTagToken, html.EndTagToken: tn, _ := z.TagName() if len(tn) == 1 && tn[0] == 'a' { if tt == html.StartTagToken { depth++ } else { depth-- } } } }
Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:
doc, err := html.Parse(r) if err != nil { // ... } var f func(*html.Node) f = func(n *html.Node) { if n.Type == html.ElementNode && n.Data == "a" { // Do something with n... } for c := n.FirstChild; c != nil; c = c.NextSibling { f(c) } } f(doc)
The relevant specifications include: https://html.spec.whatwg.org/multipage/syntax.html and https://html.spec.whatwg.org/multipage/syntax.html#tokenization
Index ¶
- Variables
- func EscapeString(s string) string
- func Render(w io.Writer, n *Node) error
- func UnescapeString(s string) string
- type Attribute
- type Node
- type NodeType
- type ParseOption
- type Token
- type TokenType
- type Tokenizer
- func (z *Tokenizer) AllowCDATA(allowCDATA bool)
- func (z *Tokenizer) Buffered() []byte
- func (z *Tokenizer) Err() error
- func (z *Tokenizer) Next() TokenType
- func (z *Tokenizer) NextIsNotRawText()
- func (z *Tokenizer) Raw() []byte
- func (z *Tokenizer) SetMaxBuf(n int)
- func (z *Tokenizer) TagAttr() (key, val []byte, moreAttr bool)
- func (z *Tokenizer) TagName() (name []byte, hasAttr bool)
- func (z *Tokenizer) Text() []byte
- func (z *Tokenizer) Token() Token
Constants ¶
This section is empty.
Variables ¶
ErrBufferExceeded means that the buffering limit was exceeded.
Functions ¶
func EscapeString ¶
EscapeString escapes special characters like "<" to become "<". It escapes only five such characters: <, >, &, ' and ". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.
func Render ¶
Render renders the parse tree n to the given writer.
Rendering is done on a 'best effort' basis: calling Parse on the output of Render will always result in something similar to the original tree, but it is not necessarily an exact clone unless the original tree was 'well-formed'. 'Well-formed' is not easily specified; the HTML5 specification is complicated.
Calling Parse on arbitrary input typically results in a 'well-formed' parse tree. However, it is possible for Parse to yield a 'badly-formed' parse tree. For example, in a 'well-formed' parse tree, no <a> element is a child of another <a> element: parsing "<a><a>" results in two sibling elements. Similarly, in a 'well-formed' parse tree, no <a> element is a child of a <table> element: parsing "<p><table><a>" results in a <p> with two sibling children; the <a> is reparented to the <table>'s parent. However, calling Parse on "<a><table><a>" does not return an error, but the result has an <a> element with an <a> child, and is therefore not 'well-formed'.
Programmatically constructed trees are typically also 'well-formed', but it is possible to construct a tree that looks innocuous but, when rendered and re-parsed, results in a different tree. A simple example is that a solitary text node would become a tree containing <html>, <head> and <body> elements. Another example is that the programmatic equivalent of "a<head>b</head>c" becomes "<html><head><head/><body>abc</body></html>".
func UnescapeString ¶
UnescapeString unescapes entities like "<" to become "<". It unescapes a larger range of entities than EscapeString escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.
Types ¶
type Attribute ¶
type Attribute struct {
Namespace, Key, Val string
}
An Attribute is an attribute namespace-key-value triple. Namespace is non-empty for foreign attributes like xlink, Key is alphabetic (and hence does not contain escapable characters like '&', '<' or '>'), and Val is unescaped (it looks like "a<b" rather than "a<b").
Namespace is only used by the parser, not the tokenizer.
type Node ¶
type Node struct {
Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node
Type NodeType
DataAtom atom.Atom
Data string
Namespace string
Attr []Attribute
}
A Node consists of a NodeType and some Data (tag name for element nodes, content for text) and are part of a tree of Nodes. Element nodes may also have a Namespace and contain a slice of Attributes. Data is unescaped, so that it looks like "a<b" rather than "a<b". For element nodes, DataAtom is the atom for Data, or zero if Data is not a known tag name.
An empty Namespace implies a "http://www.w3.org/1999/xhtml" namespace. Similarly, "math" is short for "http://www.w3.org/1998/Math/MathML", and "svg" is short for "http://www.w3.org/2000/svg".
func Parse ¶
Parse returns the parse tree for the HTML from the given Reader.
It implements the HTML5 parsing algorithm (https://html.spec.whatwg.org/multipage/syntax.html#tree-construction), which is very complicated. The resultant tree can contain implicitly created nodes that have no explicit <tag> listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end <tag>s. Conversely, explicit <tag>s in r's data can be silently dropped, with no corresponding node in the resulting tree.
The input is assumed to be UTF-8 encoded.
func ParseFragment ¶
ParseFragment parses a fragment of HTML and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.
It has the same intricacies as Parse.
func ParseFragmentWithOptions ¶
ParseFragmentWithOptions is like ParseFragment, with options.
func ParseWithOptions ¶
func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error)
ParseWithOptions is like Parse, with options.
func (*Node) AppendChild ¶
AppendChild adds a node c as a child of n.
It will panic if c already has a parent or siblings.
func (*Node) InsertBefore ¶
InsertBefore inserts newChild as a child of n, immediately before oldChild in the sequence of n's children. oldChild may be nil, in which case newChild is appended to the end of n's children.
It will panic if newChild already has a parent or siblings.
type ParseOption ¶
type ParseOption func(p *parser)
ParseOption configures a parser.
func ParseOptionEnableScripting ¶
func ParseOptionEnableScripting(enable bool) ParseOption
ParseOptionEnableScripting configures the scripting flag. https://html.spec.whatwg.org/multipage/webappapis.html#enabling-and-disabling-scripting
By default, scripting is enabled.
type Token ¶
A Token consists of a TokenType and some Data (tag name for start and end tags, content for text, comments and doctypes). A tag Token may also contain a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b" rather than "a<b"). For tag Tokens, DataAtom is the atom for Data, or zero if Data is not a known tag name.
type TokenType ¶
type TokenType uint32
A TokenType is the type of a Token.
const ( // ErrorToken means that an error occurred during tokenization. ErrorToken TokenType = iota // TextToken means a text node. TextToken // A StartTagToken looks like <a>. StartTagToken // An EndTagToken looks like </a>. EndTagToken // A SelfClosingTagToken tag looks like <br/>. SelfClosingTagToken // A CommentToken looks like <!--x-->. CommentToken // A DoctypeToken looks like <!DOCTYPE x> DoctypeToken )
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
A Tokenizer returns a stream of HTML Tokens.
func NewTokenizer ¶
NewTokenizer returns a new HTML Tokenizer for the given Reader. The input is assumed to be UTF-8 encoded.
func NewTokenizerFragment ¶
NewTokenizerFragment returns a new HTML Tokenizer for the given Reader, for tokenizing an existing element's InnerHTML fragment. contextTag is that element's tag, such as "div" or "iframe".
For example, how the InnerHTML "a<b" is tokenized depends on whether it is for a <p> tag or a <script> tag.
The input is assumed to be UTF-8 encoded.
func (*Tokenizer) AllowCDATA ¶
AllowCDATA sets whether or not the tokenizer recognizes <![CDATA[foo]]> as the text "foo". The default value is false, which means to recognize it as a bogus comment "<!-- [CDATA[foo]] -->" instead.
Strictly speaking, an HTML5 compliant tokenizer should allow CDATA if and only if tokenizing foreign content, such as MathML and SVG. However, tracking foreign-contentness is difficult to do purely in the tokenizer, as opposed to the parser, due to HTML integration points: an <svg> element can contain a <foreignObject> that is foreign-to-SVG but not foreign-to- HTML. For strict compliance with the HTML5 tokenization algorithm, it is the responsibility of the user of a tokenizer to call AllowCDATA as appropriate. In practice, if using the tokenizer without caring whether MathML or SVG CDATA is text or comments, such as tokenizing HTML to find all the anchor text, it is acceptable to ignore this responsibility.
func (*Tokenizer) Buffered ¶
Buffered returns a slice containing data buffered but not yet tokenized.
func (*Tokenizer) Err ¶
Err returns the error associated with the most recent ErrorToken token. This is typically io.EOF, meaning the end of tokenization.
func (*Tokenizer) Next ¶
Next scans the next token and returns its type.
func (*Tokenizer) NextIsNotRawText ¶
func (z *Tokenizer) NextIsNotRawText()
NextIsNotRawText instructs the tokenizer that the next token should not be considered as 'raw text'. Some elements, such as script and title elements, normally require the next token after the opening tag to be 'raw text' that has no child elements. For example, tokenizing "<title>a<b>c</b>d</title>" yields a start tag token for "<title>", a text token for "a<b>c</b>d", and an end tag token for "</title>". There are no distinct start tag or end tag tokens for the "<b>" and "</b>".
This tokenizer implementation will generally look for raw text at the right times. Strictly speaking, an HTML5 compliant tokenizer should not look for raw text if in foreign content: <title> generally needs raw text, but a <title> inside an <svg> does not. Another example is that a <textarea> generally needs raw text, but a <textarea> is not allowed as an immediate child of a <select>; in normal parsing, a <textarea> implies </select>, but one cannot close the implicit element when parsing a <select>'s InnerHTML. Similarly to AllowCDATA, tracking the correct moment to override raw-text- ness is difficult to do purely in the tokenizer, as opposed to the parser. For strict compliance with the HTML5 tokenization algorithm, it is the responsibility of the user of a tokenizer to call NextIsNotRawText as appropriate. In practice, like AllowCDATA, it is acceptable to ignore this responsibility for basic usage.
Note that this 'raw text' concept is different from the one offered by the Tokenizer.Raw method.
func (*Tokenizer) Raw ¶
Raw returns the unmodified text of the current token. Calling Next, Token, Text, TagName or TagAttr may change the contents of the returned slice.
func (*Tokenizer) SetMaxBuf ¶
SetMaxBuf sets a limit on the amount of data buffered during tokenization. A value of 0 means unlimited.
func (*Tokenizer) TagAttr ¶
TagAttr returns the lower-cased key and unescaped value of the next unparsed attribute for the current tag token and whether there are more attributes. The contents of the returned slices may change on the next call to Next.
func (*Tokenizer) TagName ¶
TagName returns the lower-cased name of a tag token (the `img` out of `<IMG SRC="foo">`) and whether the tag has attributes. The contents of the returned slice may change on the next call to Next.
func (*Tokenizer) Text ¶
Text returns the unescaped text of a text, comment or doctype token. The contents of the returned slice may change on the next call to Next.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
|
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id". |