tokenizer

package

v2.16.0 Latest Latest Go to latest Published: Jul 24, 2024 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tawesoft/golib

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer tokenizes CSS based on part four of the CSS Syntax Module Level 3 (W3C Candidate Recommendation Draft), 24 December 2021.

The main elements of this package are the New function, which returns a new Tokenizer, and the Tokenizer.Next method.

This package also exposes several low-level "Consume" functions, which implement specific algorithms in the CSS specification. Note that all "Consume" functions may panic on I/O error. The Tokenizer.Next method catches these panics. Also note that all "Consume" functions operate on a steam of filtered code points (see https://www.w3.org/TR/css-syntax-3/#input-preprocessing), not raw input. This is implemented by css/tokenizer/filter.Transform and automatically handled by a New Tokenizer.

Disclaimer: although this software runs against a thorough and diverse set of test cases, no claims are made of this software's performance or conformance against the W3C Specification itself (because there is no official W3C test suite for the tokenization step alone).

This software includes material derived from CSS Syntax Module Level 3, W3C Candidate Recommendation Draft, 24 December 2021. Copyright © 2021 W3C® (MIT, ERCIM, Keio, Beihang). See LICENSE-PARTS.txt and TRADEMARKS.md.

Index ¶

Variables
func ConsumeBadUrl(rdr *runeio.Reader)
func ConsumeComments(rdr *runeio.Reader) error
func ConsumeEscapedCodepoint(rdr *runeio.Reader) rune
func ConsumeIdentLikeToken(rdr *runeio.Reader) (token.Token, error)
func ConsumeIdentSequence(rdr *runeio.Reader) string
func ConsumeNumber(rdr *runeio.Reader) (nt token.NumberType, repr string, value float64)
func ConsumeNumericToken(rdr *runeio.Reader) token.Token
func ConsumeString(rdr *runeio.Reader, endpoint rune) (t token.Token, err error)
func ConsumeUrlToken(rdr *runeio.Reader) (token.Token, error)
func ConsumeWhitespace(rdr *runeio.Reader) token.Token
func StringToNumber(x string) float64
type Tokenizer
- func New(r io.Reader) *Tokenizer

Examples ¶

Tokenizer

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrUnexpectedEOF       = fmt.Errorf("unexpected end of file")
	ErrUnexpectedLinebreak = fmt.Errorf("unexpected line break")
	ErrUnexpectedInput     = fmt.Errorf("unexpected input")
	ErrBadUrl              = fmt.Errorf("invalid URL syntax")
)

Functions ¶

func ConsumeBadUrl ¶

func ConsumeBadUrl(rdr *runeio.Reader)

ConsumeBadUrl consumes the remnants of a bad url from a stream of code points, "cleaning up" after the tokenizer realizes that it’s in the middle of a <bad-url-token> rather than a <url-token>. It returns nothing; its sole use is to consume enough of the input stream to reach a recovery point where normal tokenizing can resume.

func ConsumeComments ¶

func ConsumeComments(rdr *runeio.Reader) error

ConsumeComments consumes zero or more CSS comments.

func ConsumeEscapedCodepoint ¶

func ConsumeEscapedCodepoint(rdr *runeio.Reader) rune

ConsumeEscapedCodepoint consumes an escaped code point. It assumes that the U+005C REVERSE SOLIDUS (\) has already been consumed and that the next input code point has already been verified to be part of a valid escape.

func ConsumeIdentLikeToken ¶

func ConsumeIdentLikeToken(rdr *runeio.Reader) (token.Token, error)

ConsumeIdentLikeToken consumes an ident-like token from a stream of code points. It returns an <ident-token>, <function-token>, <url-token>, or <bad-url-token>.

func ConsumeIdentSequence ¶

func ConsumeIdentSequence(rdr *runeio.Reader) string

ConsumeIdentSequence consumes an ident sequence from a stream of code points. It returns a string containing the largest name that can be formed from adjacent code points in the stream, starting from the first.

Note: This algorithm does not do the verification of the first few code points that are necessary to ensure the returned code points would constitute an <ident-token>. If that is the intended use, ensure that the stream starts with an ident sequence before calling this algorithm.

func ConsumeNumber ¶

func ConsumeNumber(rdr *runeio.Reader) (nt token.NumberType, repr string, value float64)

ConsumeNumber consumes a number from a stream of code points. It returns a representation, a numeric value, and a type which is either "integer" or "number".

The representation is the token lexeme as it appears in the input stream. This preserves details such as whether .009 was written as .009 or 9e-3.

Note: This algorithm does not do the verification of the first few code points that are necessary to ensure a number can be obtained from the stream. Ensure that the stream starts with a number before calling this algorithm.

func ConsumeNumericToken ¶

func ConsumeNumericToken(rdr *runeio.Reader) token.Token

ConsumeNumericToken consumes a numeric token from a stream of code points. It returns either a <number-token>, <percentage-token>, or <dimension-token>.

func ConsumeString ¶

func ConsumeString(rdr *runeio.Reader, endpoint rune) (t token.Token, err error)

ConsumeString consumes a string token. It is assumed that the character that opens a string (if any) has already been consumed. Returns either a <string-token> or a <bad-string-token>. Endpoint specifies the codepoint that terminates the string (e.g. a double or single quotation mark).

func ConsumeUrlToken ¶

func ConsumeUrlToken(rdr *runeio.Reader) (token.Token, error)

ConsumeUrlToken describes how to consume a url token from a stream of code points. It returns either a <url-token> or a <bad-url-token>.

Note: This algorithm assumes that the initial "url(" has already been consumed. This algorithm also assumes that it’s being called to consume an "unquoted" value, like url(foo). A quoted value, like url("foo"), is parsed as a <function-token>. ConsumeIdentLikeToken automatically handles this distinction; this algorithm shouldn’t be called directly otherwise.

func ConsumeWhitespace ¶

func ConsumeWhitespace(rdr *runeio.Reader) token.Token

ConsumeWhitespace consumes as much whitespace as possible and returns a <whitespace-token>.

func StringToNumber ¶

func StringToNumber(x string) float64

StringToNumber describes how to convert a string to a number according to the CSS specification.

Note: This algorithm does not do any verification to ensure that the string contains only a number. Ensure that the string contains only a valid CSS number before calling this algorithm.

Types ¶

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Example ¶

package main

import (
	"fmt"

	"strings"

	"github.com/tawesoft/golib/v2/css/tokenizer"
	"github.com/tawesoft/golib/v2/css/tokenizer/token"
)

func main() {
	str := `/* example */
#something[rel~="external"] {
    background-color: rgb(128, 64, 64);
}`
	t := tokenizer.New(strings.NewReader(str))

	for {
		tok := t.NextExcept(token.TypeWhitespace)
		if tok.Is(token.TypeEOF) {
			break
		}
		fmt.Println(tok)
	}

	if len(t.Errors()) > 0 {
		fmt.Printf("%v\n", t.Errors())
	}

}

Output:

<hash-token>{type: "id", value: "something"}
<[-token>
<ident-token>{value: "rel"}
<delim-token>{delim: '~'}
<delim-token>{delim: '='}
<string-token>{value: "external"}
<]-token>
<{-token>
<ident-token>{value: "background-color"}
<colon-token>
<function-token>{value: "rgb"}
<number-token>{type: "integer", value: 128.000000, repr: "128"}
<comma-token>
<number-token>{type: "integer", value: 64.000000, repr: "64"}
<comma-token>
<number-token>{type: "integer", value: 64.000000, repr: "64"}
<)-token>
<semicolon-token>
<}-token>

func New ¶

func New(r io.Reader) *Tokenizer

func (*Tokenizer) Errors ¶

func (z *Tokenizer) Errors() []error

Errors reports parse errors.

func (*Tokenizer) Next ¶

func (z *Tokenizer) Next() (result token.Token)

Next returns the next token from the input stream. Once the stream has ended, it returns token.EOF().

Check z.Errors() once the stream has ended, or at any point if you want to fail-fast without recovering, to detect parse errors.

func (*Tokenizer) NextExcept ¶

func (z *Tokenizer) NextExcept(types ...token.Type) (result token.Token)

NextExcept is like Tokenizer.Next however any tokens matching the given types are suppressed. For example, it is common to ignore whitespace. token.EOF() is never ignored.

func (*Tokenizer) Push ¶ added in v2.7.0

func (z *Tokenizer) Push(x token.Token)

Push places a token back on a pushback buffer (first in, first out) so that it is returned by Next() before advancing the input stream. This has a limited capacity, and will panic if exceeded.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
filter Package filter implements a [transform.Transformer] that performs the Unicode code point filtering preprocessing step defined in [CSS Syntax Module Level 3, section 3.3]:	Package filter implements a [transform.Transformer] that performs the Unicode code point filtering preprocessing step defined in [CSS Syntax Module Level 3, section 3.3]:
token Package token defines CSS tokens produced by a tokenizer.	Package token defines CSS tokens produced by a tokenizer.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL