html

package
v2.7.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 17, 2024 License: MIT Imports: 7 Imported by: 13

README

HTML API reference

This package is an HTML5 lexer written in Go. It follows the specification at The HTML syntax. The lexer takes an io.Reader and converts it into tokens until the EOF.

Installation

Run the following command

go get -u github.com/tdewolff/parse/v2/html

or add the following import and run project with go get

import "github.com/tdewolff/parse/v2/html"

Lexer

Usage

The following initializes a new Lexer with io.Reader r:

l := html.NewLexer(parse.NewInput(r))

To tokenize until EOF an error, use:

for {
	tt, data := l.Next()
	switch tt {
	case html.ErrorToken:
		// error or EOF set in l.Err()
		return
	case html.StartTagToken:
		// ...
		for {
			ttAttr, dataAttr := l.Next()
			if ttAttr != html.AttributeToken {
				break
			}
			// ...
		}
	// ...
	}
}

All tokens:

ErrorToken TokenType = iota // extra token when errors occur
CommentToken
DoctypeToken
StartTagToken
StartTagCloseToken
StartTagVoidToken
EndTagToken
AttributeToken
TextToken
Examples
package main

import (
	"os"

	"github.com/tdewolff/parse/v2/html"
)

// Tokenize HTML from stdin.
func main() {
	l := html.NewLexer(parse.NewInput(os.Stdin))
	for {
		tt, data := l.Next()
		switch tt {
		case html.ErrorToken:
			if l.Err() != io.EOF {
				fmt.Println("Error on line", l.Line(), ":", l.Err())
			}
			return
		case html.StartTagToken:
			fmt.Println("Tag", string(data))
			for {
				ttAttr, dataAttr := l.Next()
				if ttAttr != html.AttributeToken {
					break
				}

				key := dataAttr
				val := l.AttrVal()
				fmt.Println("Attribute", string(key), "=", string(val))
			}
		// ...
		}
	}
}

License

Released under the MIT license.

Documentation

Overview

Package html is an HTML5 lexer following the specifications at http://www.w3.org/TR/html5/syntax.html.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ASPTemplate = [2]string{"<%", "%>"}
View Source
var EJSTemplate = [2]string{"<%", "%>"}
View Source
var GoTemplate = [2]string{"{{", "}}"}
View Source
var HandlebarsTemplate = [2]string{"{{", "}}"}
View Source
var MustacheTemplate = [2]string{"{{", "}}"}
View Source
var PHPTemplate = [2]string{"<?", "?>"}

Functions

func EscapeAttrVal

func EscapeAttrVal(buf *[]byte, b []byte, origQuote byte, mustQuote bool) []byte

EscapeAttrVal returns the escaped attribute value bytes with quotes. Either single or double quotes are used, whichever is shorter. If there are no quotes present in the value and the value is in HTML (not XML), it will return the value without quotes.

func ParseSelector added in v2.7.7

func ParseSelector(s string) (selector, error)

Types

type AST added in v2.7.7

type AST struct {
	Children []*Tag
	Text     []byte
}

func Parse added in v2.7.7

func Parse(r *parse.Input) (*AST, error)

func (*AST) Query added in v2.7.7

func (ast *AST) Query(s string) (*Tag, error)

func (*AST) QueryAll added in v2.7.7

func (ast *AST) QueryAll(s string) ([]*Tag, error)

func (*AST) String added in v2.7.7

func (ast *AST) String() string

type Attr added in v2.7.7

type Attr struct {
	Key, Val []byte
}

func (*Attr) String added in v2.7.7

func (attr *Attr) String() string

type Hash

type Hash uint32

Hash defines perfect hashes for a predefined list of strings

const (
	Iframe    Hash = 0x6    // iframe
	Math      Hash = 0x604  // math
	Plaintext Hash = 0x1e09 // plaintext
	Script    Hash = 0xa06  // script
	Style     Hash = 0x1405 // style
	Svg       Hash = 0x1903 // svg
	Textarea  Hash = 0x2308 // textarea
	Title     Hash = 0xf05  // title
	Xmp       Hash = 0x1c03 // xmp
)

Unique hash definitions to be used instead of strings

func ToHash

func ToHash(s []byte) Hash

ToHash returns the hash whose name is s. It returns zero if there is no such hash. It is case sensitive.

func (Hash) String

func (i Hash) String() string

String returns the hash' name.

type Lexer

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer is the state for the lexer.

func NewLexer

func NewLexer(r *parse.Input) *Lexer

NewLexer returns a new Lexer for a given io.Reader.

Example
l := NewLexer(parse.NewInputString("<span class='user'>John Doe</span>"))
out := ""
for {
	tt, data := l.Next()
	if tt == ErrorToken {
		break
	}
	out += string(data)
}
fmt.Println(out)
Output:

<span class='user'>John Doe</span>

func NewTemplateLexer added in v2.7.2

func NewTemplateLexer(r *parse.Input, tmpl [2]string) *Lexer

func (*Lexer) AttrKey added in v2.7.2

func (l *Lexer) AttrKey() []byte

AttrKey returns the attribute key when an AttributeToken was returned from Next.

func (*Lexer) AttrVal

func (l *Lexer) AttrVal() []byte

AttrVal returns the attribute value when an AttributeToken was returned from Next.

func (*Lexer) Err

func (l *Lexer) Err() error

Err returns the error encountered during lexing, this is often io.EOF but also other errors can be returned.

func (*Lexer) HasTemplate added in v2.7.2

func (l *Lexer) HasTemplate() bool

HasTemplate returns the true if the token value contains a template.

func (*Lexer) Next

func (l *Lexer) Next() (TokenType, []byte)

Next returns the next Token. It returns ErrorToken when an error was encountered. Using Err() one can retrieve the error message.

func (*Lexer) Text

func (l *Lexer) Text() []byte

Text returns the textual representation of a token. This excludes delimiters and additional leading/trailing characters.

type Tag added in v2.7.7

type Tag struct {
	Root       *AST
	Parent     *Tag
	Prev, Next *Tag
	Children   []*Tag
	Index      int

	Name  []byte
	Attrs []Attr
	// contains filtered or unexported fields
}

func (*Tag) ASTString added in v2.7.7

func (tag *Tag) ASTString() string

func (*Tag) GetAttr added in v2.7.7

func (tag *Tag) GetAttr(key string) (string, bool)

func (*Tag) String added in v2.7.7

func (tag *Tag) String() string

func (*Tag) Text added in v2.7.7

func (tag *Tag) Text() string

type TokenType

type TokenType uint32

TokenType determines the type of token, eg. a number or a semicolon.

const (
	ErrorToken TokenType = iota // extra token when errors occur
	CommentToken
	DoctypeToken
	StartTagToken
	StartTagCloseToken
	StartTagVoidToken
	EndTagToken
	AttributeToken
	TextToken
	SvgToken
	MathToken
)

TokenType values.

func (TokenType) String

func (tt TokenType) String() string

String returns the string representation of a TokenType.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL