html

package

v0.0.0-...-2286dd8 Latest Latest Go to latest Published: Feb 29, 2012 License: BSD-3-Clause Imports: 8 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/tav/go

Links

Open Source Insights

Documentation ¶

Overview ¶

Package html implements an HTML5-compliant tokenizer and parser. INCOMPLETE.

Tokenization is done by creating a Tokenizer for an io.Reader r. It is the caller's responsibility to ensure that r provides UTF-8 encoded HTML.

z := html.NewTokenizer(r)

Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(), which parses the next token and returns its type, or an error:

for {
	tt := z.Next()
	if tt == html.ErrorToken {
		// ...
		return ...
	}
	// Process the current token.
}

There are two APIs for retrieving the current token. The high-level API is to call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs allow optionally calling Raw after Next but before Token, Text, TagName, or TagAttr. In EBNF notation, the valid call sequence per token is:

Next {Raw} [ Token | Text | TagName {TagAttr} ]

Token returns an independent data structure that completely describes a token. Entities (such as "<") are unescaped, tag names and attribute keys are lower-cased, and attributes are collected into a []Attribute. For example:

for {
	if z.Next() == html.ErrorToken {
		// Returning io.EOF indicates success.
		return z.Err()
	}
	emitToken(z.Token())
}

The low-level API performs fewer allocations and copies, but the contents of the []byte values returned by Text, TagName and TagAttr may change on the next call to Next. For example, to extract an HTML page's anchor text:

depth := 0
for {
	tt := z.Next()
	switch tt {
	case ErrorToken:
		return z.Err()
	case TextToken:
		if depth > 0 {
			// emitBytes should copy the []byte it receives,
			// if it doesn't process it immediately.
			emitBytes(z.Text())
		}
	case StartTagToken, EndTagToken:
		tn, _ := z.TagName()
		if len(tn) == 1 && tn[0] == 'a' {
			if tt == StartTagToken {
				depth++
			} else {
				depth--
			}
		}
	}
}

Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML. For example, to process each anchor node in depth-first order:

doc, err := html.Parse(r)
if err != nil {
	// ...
}
var f func(*html.Node)
f = func(n *html.Node) {
	if n.Type == html.ElementNode && n.Data == "a" {
		// Do something with n...
	}
	for _, c := range n.Child {
		f(c)
	}
}
f(doc)

The relevant specifications include: http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html and http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html

Index ¶

func EscapeString(s string) string
func Render(w io.Writer, n *Node) error
func UnescapeString(s string) string
type Attribute
type Node
- func Parse(r io.Reader) (*Node, error)
- func ParseFragment(r io.Reader, context *Node) ([]*Node, error)
- func (n *Node) Add(child *Node)
- func (n *Node) Remove(child *Node)
type NodeType
type Token
- func (t Token) String() string
type TokenType
- func (t TokenType) String() string
type Tokenizer
- func NewTokenizer(r io.Reader) *Tokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func EscapeString ¶

func EscapeString(s string) string

EscapeString escapes special characters like "<" to become "<". It escapes only five such characters: amp, apos, lt, gt and quot. UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

func Render ¶

func Render(w io.Writer, n *Node) error

Render renders the parse tree n to the given writer.

Rendering is done on a 'best effort' basis: calling Parse on the output of Render will always result in something similar to the original tree, but it is not necessarily an exact clone unless the original tree was 'well-formed'. 'Well-formed' is not easily specified; the HTML5 specification is complicated.

Calling Parse on arbitrary input typically results in a 'well-formed' parse tree. However, it is possible for Parse to yield a 'badly-formed' parse tree. For example, in a 'well-formed' parse tree, no <a> element is a child of another <a> element: parsing "<a><a>" results in two sibling elements. Similarly, in a 'well-formed' parse tree, no <a> element is a child of a <table> element: parsing "<p><table><a>" results in a <p> with two sibling children; the <a> is reparented to the <table>'s parent. However, calling Parse on "<a><table><a>" does not return an error, but the result has an <a> element with an <a> child, and is therefore not 'well-formed'.

Programmatically constructed trees are typically also 'well-formed', but it is possible to construct a tree that looks innocuous but, when rendered and re-parsed, results in a different tree. A simple example is that a solitary text node would become a tree containing <html>, <head> and <body> elements. Another example is that the programmatic equivalent of "a<head>b</head>c" becomes "<html><head><head/><body>abc</body></html>".

func UnescapeString ¶

func UnescapeString(s string) string

UnescapeString unescapes entities like "<" to become "<". It unescapes a larger range of entities than EscapeString escapes. For example, "á" unescapes to "á", as does "á" and "&xE1;". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

Types ¶

type Attribute ¶

type Attribute struct {
	Namespace, Key, Val string
}

An Attribute is an attribute namespace-key-value triple. Namespace is non-empty for foreign attributes like xlink, Key is alphabetic (and hence does not contain escapable characters like '&', '<' or '>'), and Val is unescaped (it looks like "a<b" rather than "a<b").

Namespace is only used by the parser, not the tokenizer.

type Node ¶

type Node struct {
	Parent    *Node
	Child     []*Node
	Type      NodeType
	Data      string
	Namespace string
	Attr      []Attribute
}

A Node consists of a NodeType and some Data (tag name for element nodes, content for text) and are part of a tree of Nodes. Element nodes may also have a Namespace and contain a slice of Attributes. Data is unescaped, so that it looks like "a<b" rather than "a<b".

An empty Namespace implies a "http://www.w3.org/1999/xhtml" namespace. Similarly, "math" is short for "http://www.w3.org/1998/Math/MathML", and "svg" is short for "http://www.w3.org/2000/svg".

func Parse ¶

func Parse(r io.Reader) (*Node, error)

Parse returns the parse tree for the HTML from the given Reader. The input is assumed to be UTF-8 encoded.

func ParseFragment ¶

func ParseFragment(r io.Reader, context *Node) ([]*Node, error)

ParseFragment parses a fragment of HTML and returns the nodes that were found. If the fragment is the InnerHTML for an existing element, pass that element in context.

func (*Node) Add ¶

func (n *Node) Add(child *Node)

Add adds a node as a child of n. It will panic if the child's parent is not nil.

func (*Node) Remove ¶

func (n *Node) Remove(child *Node)

Remove removes a node as a child of n. It will panic if the child's parent is not n.

type NodeType ¶

type NodeType int

A NodeType is the type of a Node.

const (
	ErrorNode NodeType = iota
	TextNode
	DocumentNode
	ElementNode
	CommentNode
	DoctypeNode
)

type Token ¶

type Token struct {
	Type TokenType
	Data string
	Attr []Attribute
}

A Token consists of a TokenType and some Data (tag name for start and end tags, content for text, comments and doctypes). A tag Token may also contain a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b" rather than "a<b").

func (Token) String ¶

func (t Token) String() string

String returns a string representation of the Token.

type TokenType ¶

type TokenType int

A TokenType is the type of a Token.

const (
	// ErrorToken means that an error occurred during tokenization.
	ErrorToken TokenType = iota
	// TextToken means a text node.
	TextToken
	// A StartTagToken looks like <a>.
	StartTagToken
	// An EndTagToken looks like </a>.
	EndTagToken
	// A SelfClosingTagToken tag looks like <br/>.
	SelfClosingTagToken
	// A CommentToken looks like <!--x-->.
	CommentToken
	// A DoctypeToken looks like <!DOCTYPE x>
	DoctypeToken
)

func (TokenType) String ¶

func (t TokenType) String() string

String returns a string representation of the TokenType.

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

A Tokenizer returns a stream of HTML Tokens.

func NewTokenizer ¶

func NewTokenizer(r io.Reader) *Tokenizer

NewTokenizer returns a new HTML Tokenizer for the given Reader. The input is assumed to be UTF-8 encoded.

func (*Tokenizer) Err ¶

func (z *Tokenizer) Err() error

Err returns the error associated with the most recent ErrorToken token. This is typically io.EOF, meaning the end of tokenization.

func (*Tokenizer) Next ¶

func (z *Tokenizer) Next() TokenType

Next scans the next token and returns its type.

func (*Tokenizer) Raw ¶

func (z *Tokenizer) Raw() []byte

Raw returns the unmodified text of the current token. Calling Next, Token, Text, TagName or TagAttr may change the contents of the returned slice.

func (*Tokenizer) TagAttr ¶

func (z *Tokenizer) TagAttr() (key, val []byte, moreAttr bool)

TagAttr returns the lower-cased key and unescaped value of the next unparsed attribute for the current tag token and whether there are more attributes. The contents of the returned slices may change on the next call to Next.

func (*Tokenizer) TagName ¶

func (z *Tokenizer) TagName() (name []byte, hasAttr bool)

TagName returns the lower-cased name of a tag token (the `img` out of `<IMG SRC="foo">`) and whether the tag has attributes. The contents of the returned slice may change on the next call to Next.

func (*Tokenizer) Text ¶

func (z *Tokenizer) Text() []byte

Text returns the unescaped text of a text, comment or doctype token. The contents of the returned slice may change on the next call to Next.

func (*Tokenizer) Token ¶

func (z *Tokenizer) Token() Token

Token returns the next Token. The result's Data and Attr values remain valid after subsequent Next calls.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL