gosaxml

package module
v0.0.39 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2022 License: MIT Imports: 5 Imported by: 0

README

Go Reference Go Report Card build workflow

gosaxml is a streaming XML decoder and encoder, similar in interface to the encoding/xml, but with a focus on performance, low memory footprint and on fixing many of the issues present in encoding/xml mainly related to handling of namespaces (see https://github.com/golang/go/issues/13400).

In addition to handling namespaces, gosaxml can also canonicalize and minify XML namespaces bindings in a document (with and without prefixes) and does not repeat the prefix-less namespace declaration on all encoded XML elements, like encoding/xml does.

Due to the way it is implemented right now, the byte slices it reads from a provided io.Reader can only be ASCII or UTF-8 encoded. UTF-16 and UTF-32 character encodings or other encodings not identical with the ASCII character set or not using multi-byte encodings where the high bit is always set, are not supported.

Get it

go get -u github.com/HBTGmbH/gosaxml

Features

  • zero-allocation stream decoding of XML inputs (from io.Reader)
  • zero-allocation stream encoding of XML elements (to io.Writer)
  • tidying of XML namespace declarations of the encoder input

Simple examples

Decode and re-encode

The following example (in the form of a Go test) decodes from a given io.Reader and encodes the same tokens into a provided io.Writer:

func TestDecodeAndEncode(t *testing.T) {
	// given
	var r io.Reader = strings.NewReader(
		`<a xmlns="http://mynamespace.org">
		<b>Hi!</b>
		<c></c>
		</a>`)
	var w bytes.Buffer
	dec := gosaxml.NewDecoder(r)
	enc := gosaxml.NewEncoder(&w)

	// when
	var tk gosaxml.Token
	for {
		err := dec.NextToken(&tk)
		if err == io.EOF {
			break
		}
		assert.Nil(t, err)

		err = enc.EncodeToken(&tk)
		assert.Nil(t, err)
	}
	assert.Nil(t, enc.Flush())

	// then
	assert.Equal(t,
	`<a xmlns="http://mynamespace.org"><b>Hi!</b><c/></a>`,
	w.String())
}

Documentation

Index

Constants

View Source
const (
	TokenTypeInvalid = iota
	TokenTypeStartElement
	TokenTypeEndElement
	TokenTypeProcInst
	TokenTypeDirective
	TokenTypeTextElement
	TokenTypeCharData
)

constants for Token.Kind

Variables

This section is empty.

Functions

This section is empty.

Types

type Attr

type Attr struct {
	Name        Name
	Value       []byte
	SingleQuote bool
}

Attr is an attribute of an element. Only tokens of type TokenTypeStartElement can have attributes.

type Decoder

type Decoder interface {
	// NextToken decodes and stores the next Token into
	// the provided Token pointer.
	// Only the fields relevant for the decoded token type
	// are written to the Token. Other fields may have previous
	// values. The caller should thus determine the Token.Kind
	// and then only read/touch the fields relevant for that kind.
	NextToken(t *Token) error

	// Reset resets the Decoder to the given io.Reader.
	Reset(r io.Reader)
}

Decoder decodes an XML input stream into Token values.

func NewDecoder

func NewDecoder(r io.Reader) Decoder

NewDecoder creates a new Decoder.

type Encoder

type Encoder struct {
	// contains filtered or unexported fields
}

Encoder encodes Token values to an io.Writer.

func NewEncoder

func NewEncoder(w io.Writer, middlewares ...EncoderMiddleware) *Encoder

NewEncoder creates a new Encoder with the given middlewares and returns a pointer to it.

func (*Encoder) EncodeToken

func (thiz *Encoder) EncodeToken(t *Token) error

EncodeToken first calls any EncoderMiddleware and then writes the byte-representation of that Token to the io.Writer of this Encoder.

func (*Encoder) Flush added in v0.0.30

func (thiz *Encoder) Flush() error

Flush writes all buffered output into the io.Writer. It must be called after token encoding is done in order to write all remaining bytes into the io.Writer.

func (*Encoder) Reset

func (thiz *Encoder) Reset(w io.Writer)

Reset resets this Encoder to write into the provided io.Writer and resets all middlewares.

type EncoderMiddleware

type EncoderMiddleware interface {
	// EncodeToken will be called by the Encoder before the provided Token
	// is finally byte-encoded into the io.Writer.
	// The Encoder will ensure that the pointed-to Token and all its contained
	// field values will remain unmodified for the lexical scope of the
	// XML-element represented by the Token.
	// If, for example, the Token represents a TokenTypeStartElement, then
	// the Token and all of its contained fields/byte-slices will contain
	// their values until after its corresponding TokenTypeEndElement is processed
	// by the EncoderMiddleware.
	EncodeToken(token *Token) error

	// Reset resets the state of an EncoderMiddleware.
	// This can be used to e.g. reset all pre-allocated data structures
	// and reinitialize the EncoderMiddleware to the state before the
	// any first call to EncodeToken.
	Reset()
}

EncoderMiddleware allows to pre-process a Token before it is finally encoded/written.

type Name

type Name struct {
	Local  []byte
	Prefix []byte
}

Name is a name with a possible prefix like "xmlns:blubb" or simply without prefix like "a"

type NamespaceModifier

type NamespaceModifier struct {
	PreserveOriginalPrefixes bool
	// contains filtered or unexported fields
}

NamespaceModifier can be used to obtain information about the effective namespace of a decoded Token via NamespaceOfToken and to canonicalize/minify namespace declarations.

func NewNamespaceModifier

func NewNamespaceModifier() *NamespaceModifier

NewNamespaceModifier creates a new NamespaceModifier and returns a pointer to it.

func (*NamespaceModifier) EncodeToken

func (thiz *NamespaceModifier) EncodeToken(t *Token) error

EncodeToken will be called by the Encoder before the provided Token is finally byte-encoded into the io.Writer. The Encoder will ensure that the pointed-to Token and all its contained field values will remain unmodified for the lexical scope of the XML-element represented by the Token. If, for example, the Token represents a TokenTypeStartElement, then the Token and all of its contained fields/byte-slices will contain their values until after its corresponding TokenTypeEndElement is processed by the EncoderMiddleware.

func (*NamespaceModifier) NamespaceOfToken

func (thiz *NamespaceModifier) NamespaceOfToken(t *Token) []byte

NamespaceOfToken returns the effective namespace (as byte slice) of the pointed-to Token. The caller must make sure that the Token's fields/values will remain unmodified for the lexical scope of the XML element represented by that token, as per the documentation of EncoderMiddleware.EncodeToken.

func (*NamespaceModifier) Reset

func (thiz *NamespaceModifier) Reset()

Reset resets this NamespaceModifier.

type Token

type Token struct {
	// only for TokenTypeStartElement, TokenTypeEndElement and TokenTypeProcInst
	Name Name

	// only for TokenTypeStartElement
	Attr []Attr

	// only for TokenTypeDirective, TokenTypeTextElement, TokenTypeCharData and TokenTypeProcInst
	ByteData []byte

	Kind byte
}

Token represents the union of all possible token types with their respective information.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL