htmltoken

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 2, 2024 License: BSD-3-Clause Imports: 8 Imported by: 1

README

This is a fork of part of the golang.org/x/net/html package.

v0.2.0

For v0.2.0 we made a more radical change to the tokenizer package.

We added a new syntax to allow attributes to be set with '{}' syntax. Any valid JSON expression is allowed within the curly brackets (this more closely matches JSX syntax).

<div data-num={5}></div>

To support proper decoding in the client, attributes now have a an IsJson bool field which is set to true if an attribute was parsed with the new {} syntax.

If you only need the case-sensitive tokenization for tags/attributes it is recommended to use v0.1.0 and not v0.2.0.

v0.1.0

It is not a complete fork as we only want to modify and change https://pkg.go.dev/golang.org/x/net/html#Tokenizer. So this is the minimal amount of code to get html.Tokenizer working.

The reason for the fork is to allow for returning of case-sensitive tag names and attribute names. The current package normalizes the tag names and attribute names by calling (the equivalent of) strings.ToLower on them before returning them to the caller. We made a very small two line change in token.go to remove those ToLower calls. Other changes involve copying enough code from other files to get all the dependencies satisfied and get it compling again.

Why did we not fork the entire package? Because the rest of the html package is a validating html parser and is quite complicated. As the HTML rules can change over time, it would need to be continually updated and synced with the upstream to keep it compliant. As the actual syntax (tokenization rules) of HTML does not change often, this part of the package is likely much more stable.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrBufferExceeded = errors.New("max buffer exceeded")

ErrBufferExceeded means that the buffering limit was exceeded.

Functions

func EscapeString

func EscapeString(s string) string

EscapeString escapes special characters like "<" to become "&lt;". It escapes only five such characters: <, >, &, ' and ". UnescapeString(EscapeString(s)) == s always holds, but the converse isn't always true.

Types

type Attribute

type Attribute struct {
	Namespace, Key, Val string
	IsJson              bool // MOD - added to support json attributes
}

An Attribute is an attribute namespace-key-value triple. Namespace is non-empty for foreign attributes like xlink, Key is alphabetic (and hence does not contain escapable characters like '&', '<' or '>'), and Val is unescaped (it looks like "a<b" rather than "a&lt;b").

Namespace is only used by the parser, not the tokenizer.

type Token

type Token struct {
	Type     TokenType
	DataAtom atom.Atom
	Data     string
	Attr     []Attribute
}

A Token consists of a TokenType and some Data (tag name for start and end tags, content for text, comments and doctypes). A tag Token may also contain a slice of Attributes. Data is unescaped for all Tokens (it looks like "a<b" rather than "a&lt;b"). For tag Tokens, DataAtom is the atom for Data, or zero if Data is not a known tag name.

func (Token) String

func (t Token) String() string

String returns a string representation of the Token.

type TokenType

type TokenType uint32

A TokenType is the type of a Token.

const (
	// ErrorToken means that an error occurred during tokenization.
	ErrorToken TokenType = iota
	// TextToken means a text node.
	TextToken
	// A StartTagToken looks like <a>.
	StartTagToken
	// An EndTagToken looks like </a>.
	EndTagToken
	// A SelfClosingTagToken tag looks like <br/>.
	SelfClosingTagToken
	// A CommentToken looks like <!--x-->.
	CommentToken
	// A DoctypeToken looks like <!DOCTYPE x>
	DoctypeToken
)

func (TokenType) String

func (t TokenType) String() string

String returns a string representation of the TokenType.

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

A Tokenizer returns a stream of HTML Tokens.

func NewTokenizer

func NewTokenizer(r io.Reader) *Tokenizer

NewTokenizer returns a new HTML Tokenizer for the given Reader. The input is assumed to be UTF-8 encoded.

func NewTokenizerFragment

func NewTokenizerFragment(r io.Reader, contextTag string) *Tokenizer

NewTokenizerFragment returns a new HTML Tokenizer for the given Reader, for tokenizing an existing element's InnerHTML fragment. contextTag is that element's tag, such as "div" or "iframe".

For example, how the InnerHTML "a<b" is tokenized depends on whether it is for a <p> tag or a <script> tag.

The input is assumed to be UTF-8 encoded.

func (*Tokenizer) AllowCDATA

func (z *Tokenizer) AllowCDATA(allowCDATA bool)

AllowCDATA sets whether or not the tokenizer recognizes <![CDATA[foo]]> as the text "foo". The default value is false, which means to recognize it as a bogus comment "<!-- [CDATA[foo]] -->" instead.

Strictly speaking, an HTML5 compliant tokenizer should allow CDATA if and only if tokenizing foreign content, such as MathML and SVG. However, tracking foreign-contentness is difficult to do purely in the tokenizer, as opposed to the parser, due to HTML integration points: an <svg> element can contain a <foreignObject> that is foreign-to-SVG but not foreign-to- HTML. For strict compliance with the HTML5 tokenization algorithm, it is the responsibility of the user of a tokenizer to call AllowCDATA as appropriate. In practice, if using the tokenizer without caring whether MathML or SVG CDATA is text or comments, such as tokenizing HTML to find all the anchor text, it is acceptable to ignore this responsibility.

func (*Tokenizer) Buffered

func (z *Tokenizer) Buffered() []byte

Buffered returns a slice containing data buffered but not yet tokenized.

func (*Tokenizer) Err

func (z *Tokenizer) Err() error

Err returns the error associated with the most recent ErrorToken token. This is typically io.EOF, meaning the end of tokenization.

func (*Tokenizer) Next

func (z *Tokenizer) Next() TokenType

Next scans the next token and returns its type.

func (*Tokenizer) NextIsNotRawText

func (z *Tokenizer) NextIsNotRawText()

NextIsNotRawText instructs the tokenizer that the next token should not be considered as 'raw text'. Some elements, such as script and title elements, normally require the next token after the opening tag to be 'raw text' that has no child elements. For example, tokenizing "<title>a<b>c</b>d</title>" yields a start tag token for "<title>", a text token for "a<b>c</b>d", and an end tag token for "</title>". There are no distinct start tag or end tag tokens for the "<b>" and "</b>".

This tokenizer implementation will generally look for raw text at the right times. Strictly speaking, an HTML5 compliant tokenizer should not look for raw text if in foreign content: <title> generally needs raw text, but a <title> inside an <svg> does not. Another example is that a <textarea> generally needs raw text, but a <textarea> is not allowed as an immediate child of a <select>; in normal parsing, a <textarea> implies </select>, but one cannot close the implicit element when parsing a <select>'s InnerHTML. Similarly to AllowCDATA, tracking the correct moment to override raw-text- ness is difficult to do purely in the tokenizer, as opposed to the parser. For strict compliance with the HTML5 tokenization algorithm, it is the responsibility of the user of a tokenizer to call NextIsNotRawText as appropriate. In practice, like AllowCDATA, it is acceptable to ignore this responsibility for basic usage.

Note that this 'raw text' concept is different from the one offered by the Tokenizer.Raw method.

func (*Tokenizer) Raw

func (z *Tokenizer) Raw() []byte

Raw returns the unmodified text of the current token. Calling Next, Token, Text, TagName or TagAttr may change the contents of the returned slice.

The token stream's raw bytes partition the byte stream (up until an ErrorToken). There are no overlaps or gaps between two consecutive token's raw bytes. One implication is that the byte offset of the current token is the sum of the lengths of all previous tokens' raw bytes.

func (*Tokenizer) SetMaxBuf

func (z *Tokenizer) SetMaxBuf(n int)

SetMaxBuf sets a limit on the amount of data buffered during tokenization. A value of 0 means unlimited.

func (*Tokenizer) TagAttr

func (z *Tokenizer) TagAttr() (key, val []byte, isJson bool, moreAttr bool)

TagAttr returns the lower-cased key and unescaped value of the next unparsed attribute for the current tag token and whether there are more attributes. The contents of the returned slices may change on the next call to Next. MOD -- added isJson bool return value

func (*Tokenizer) TagName

func (z *Tokenizer) TagName() (name []byte, hasAttr bool)

TagName returns the lower-cased name of a tag token (the `img` out of `<IMG SRC="foo">`) and whether the tag has attributes. The contents of the returned slice may change on the next call to Next.

func (*Tokenizer) Text

func (z *Tokenizer) Text() []byte

Text returns the unescaped text of a text, comment or doctype token. The contents of the returned slice may change on the next call to Next.

func (*Tokenizer) Token

func (z *Tokenizer) Token() Token

Token returns the current Token. The result's Data and Attr values remain valid after subsequent Next calls.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL