Documentation ¶
Overview ¶
Package html implements an HTML5-compliant tokenizer and parser. INCOMPLETE.
Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.
z := html.NewTokenizer(r)
Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:
for { tt := z.Next() if tt == html.ErrorToken { // ... return ... } // Process the current token. }
There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:
Next {Raw} [ Token | Text | TagName {TagAttr} ]
Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:
for { if z.Next() == html.ErrorToken { // Returning os.EOF indicates success. return z.Error() } emitToken(z.Token()) }
The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:
depth := 0 for { tt := z.Next() switch tt { case ErrorToken: return z.Error() case TextToken: if depth > 0 { // emitBytes should copy the []byte it receives, // if it doesn't process it immediately. emitBytes(z.Text()) } case StartTagToken, EndTagToken: tn, _ := z.TagName() if len(tn) == 1 && tn[0] == 'a' { if tt == StartTag { depth++ } else { depth-- } } } }
A Tokenizer typically skips over HTML comments. To return comment tokens, set Tokenizer.ReturnComments to true before looping over calls to Next.
Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:
doc, err := html.Parse(r) if err != nil { // ... } var f func(*html.Node) f = func(n *html.Node) { if n.Type == html.ElementNode && n.Data == "a" { // Do something with n... } for _, c := range n.Child { f(c) } } f(doc)
The relevant specifications include: http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html and http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func EscapeString ¶
EscapeString escapes special characters like "<" to become "<". It escapes only five such characters: amp, apos, lt, gt and quot. UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.
func UnescapeString ¶
UnescapeString unescapes entities like "<" to become "<". It unescapes a larger range of entities than EscapeString escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.
Types ¶
type Attribute ¶
type Attribute struct {
Key, Val string
}
An Attribute is an attribute key-value pair. Key is alphabetic (and hence does not contain escapable characters like '&', '<' or '>'), and Val is unescaped (it looks like "a<b" rather than "a<b").
type Node ¶
A Node consists of a NodeType and some Data (tag name for element nodes, content for text) and are part of a tree of Nodes. Element nodes may also contain a slice of Attributes. Data is unescaped, so that it looks like "a<b" rather than "a<b".
func Parse ¶
Parse returns the parse tree for the HTML from the given Reader. The input is assumed to be UTF-8 encoded.
type Token ¶
A Token consists of a TokenType and some Data (tag name for start and end tags, content for text, comments and doctypes). A tag Token may also contain a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b" rather than "a<b").
type TokenType ¶
type TokenType int
A TokenType is the type of a Token.
const ( // ErrorToken means that an error occurred during tokenization. ErrorToken TokenType = iota // TextToken means a text node. TextToken // A StartTagToken looks like <a>. StartTagToken // An EndTagToken looks like </a>. EndTagToken // A SelfClosingTagToken tag looks like <br/>. SelfClosingTagToken // A CommentToken looks like <!--x-->. CommentToken // A DoctypeToken looks like <!DOCTYPE x> DoctypeToken )
type Tokenizer ¶
type Tokenizer struct { // If ReturnComments is set, Next returns comment tokens; // otherwise it skips over comments (default). ReturnComments bool // contains filtered or unexported fields }
A Tokenizer returns a stream of HTML Tokens.
func NewTokenizer ¶
NewTokenizer returns a new HTML Tokenizer for the given Reader. The input is assumed to be UTF-8 encoded.
func (*Tokenizer) Error ¶
Error returns the error associated with the most recent ErrorToken token. This is typically os.EOF, meaning the end of tokenization.
func (*Tokenizer) Raw ¶
Raw returns the unmodified text of the current token. Calling Next, Token, Text, TagName or TagAttr may change the contents of the returned slice.
func (*Tokenizer) TagAttr ¶
TagAttr returns the lower-cased key and unescaped value of the next unparsed attribute for the current tag token and whether there are more attributes. The contents of the returned slices may change on the next call to Next.
func (*Tokenizer) TagName ¶
TagName returns the lower-cased name of a tag token (the `img` out of `<IMG SRC="foo">`) and whether the tag has attributes. The contents of the returned slice may change on the next call to Next.