text

package
v1.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 1, 2024 License: CC0-1.0 Imports: 6 Imported by: 0

Documentation

Overview

Package text implements an RFC8259 compliant string escaping with a pre-calculation stage that eliminates the risk of multiple allocations for long inputs.

Index

Constants

View Source
const (
	QuotationMark    = 0x22
	QuotationMarkGo  = '"'
	ReverseSolidus   = 0x5c
	ReverseSolidusGo = '\\'
	Solidus          = 0x2f
	SolidusGo        = '/'
	Backspace        = 0x08
	BackspaceGo      = '\b'
	FormFeed         = 0x0c
	FormFeedGo       = '\f'
	LineFeed         = 0x0a
	LineFeedGo       = '\n'
	CarriageReturn   = 0x0d
	CarriageReturnGo = '\r'
	Tab              = 0x09
	TabGo            = '\t'
	Space            = 0x20
	SpaceGo          = ' '
)

The character constants are used as their names. IDEs with inlays expanding the values will demonstrate the equivalence of these with the same decimal UTF-8 value, thus the secondary items with their Go rune equivalents.

The human readable forms are given in order to educate more than anything else. The same symbols can be used in regular Go double quoted "" strings to indicate the same character.

Different rules apply to backtick quoted strings, which allow any character to be placed in a string, escaped sequences are literally interpreted instead of parsed to their respective bytes, and generally editors won't allow the placement of control characters in these strings; their purpose is allowing properly flowed, line-break containing strings such as embedding literal text. Backtick strings can contain printf formatting same as double quote strings.

Variables

This section is empty.

Functions

func EscapeJSONStringAndWrap

func EscapeJSONStringAndWrap(s string) (escaped []byte)

func EscapeJSONStringAndWrapOld

func EscapeJSONStringAndWrapOld(s string) (escaped []byte)

EscapeJSONStringAndWrapOld takes an arbitrary string and escapes all control characters as per rfc8259 section 7 https://www.rfc-editor.org/rfc/rfc8259 (retrieved 2023-11-21):

The representation of strings is similar to conventions used in the C family
of programming languages. A string begins and ends with quotation marks. All
Unicode characters may be placed within the quotation marks, except for the
characters that MUST be escaped: quotation mark, reverse solidus, and the
control characters (U+0000 through U+001F).

The string is assumed to be UTF-8 and only the above escapes are processed. The string will be wrapped in double quotes `"` as it is assumed that the string will be added to a JSON document in a place where a string is valid.

The processing proceeds in two passes, first calculating the required expansion for the characters in the provided string, and then copying over and adding the required escape code expansions as indicated, to ensure that for very long strings only one allocation, of precisely the correct amount, is made.

Note the iteration through the string must proceed as though the string is []byte rather than be interpreted using a `for _, c := range s` which will prompt Go to interpret the string as UTF-8 and potentially return a different result, this occurs on the series of characters 0-255 at a certain point due to UTF-8 encoding rules.

One last thing to note. The stdlib function `json.Marshal` automatically runs a HTML escape processing which turns some valid characters, namely:

String values encode as JSON strings coerced to valid UTF-8, replacing
invalid bytes with the Unicode replacement rune. So that the JSON will be
safe to embed inside HTML <script> tags, the string is encoded using
HTMLEscape, which replaces "<", ">", "&", U+2028, and U+2029 are escaped to
"\u003c","\u003e", "\u0026", "\u2028", and "\u2029". This replacement can be
disabled when using an Encoder, by calling SetEscapeHTML(false).

And so the assumption this code here makes is that backslashes need to be escaped, needs to have special handling to not escape the escaped, in order to allow custom JSON marshalers to not keep adding backslashes to valid UTF-8 entities.

func EscapeString

func EscapeString(dst []byte, s string) []byte

EscapeString for JSON encoding according to RFC8259.

taken from https://github.com/nbd-wtf/go-nostr/blob/master/utils.go replaced by EscapeJSONStringAndWrap in file rfc8259.go tested to be functionally equivalent, the purpose of the above function is to eliminate extra heap allocations for very long strings such as long form posts.

Formatting is retained from the original despite being ugly.

func FirstHexCharToValue

func FirstHexCharToValue(in byte) (out byte)

FirstHexCharToValue returns the hex value of a provided character from the first place in an 8 bit value of two characters.

Two of these functions exist to minimise the computation cost, thus doubling the memory cost in the switch lookup table.

func SecondHexCharToValue

func SecondHexCharToValue(in byte) (out byte)

SecondHexCharToValue returns the hex value of a provided character from the second (last) place in an 8 bit value.

func UnescapeByteString

func UnescapeByteString(bs []byte) (o []byte)

UnescapeByteString scans a string assumed to be UTF-8 for escaped UTF-8 characters that must be escaped for JSON/HTML encoding. This means octal `\xxx` unicode backslash escapes \uXXXX and \UXXXX

func Unwrap

func Unwrap(wrapped []byte) (unwrapped []byte)

Unwrap is a dumb function that just slices off the first and last byte, which from the EscapeJSONStringAndWrap function is the quotes around it.

This can be unsafe to run as it assumes there is at least two bytes.

TODO: rewrite this all to work from []byte and optional quote wrapping.

Types

type Buffer

type Buffer struct {
	Pos int
	Buf []byte
}

func NewBuffer

func NewBuffer(b []byte) (buf *Buffer)

NewBuffer returns a new buffer containing the provided slice. This slice can/will be mutated.

func (*Buffer) Bytes

func (b *Buffer) Bytes() (bb []byte)

func (*Buffer) Copy

func (b *Buffer) Copy(length, src, dest int) (e error)

Copy a given length of bytes starting at src position to dest position, and move the cursor to the end of the written segment.

func (*Buffer) Head

func (b *Buffer) Head() []byte

Head returns the buffer from the start until the current Pos position.

func (*Buffer) Read

func (b *Buffer) Read() (bb byte, e error)

Read the next byte out of the buffer or return io.EOF if there is no more.

func (*Buffer) ReadBytes

func (b *Buffer) ReadBytes(count int) (bb []byte, e error)

ReadBytes returns the specified number of byte, and advances the cursor, or io.EOF if there isn't this much remaining after the cursor.

func (*Buffer) ReadEnclosed

func (b *Buffer) ReadEnclosed() (bb []byte, e error)

ReadEnclosed scans quickly while keeping count of open and close brackets [] or braces {} and returns the byte sub-slice starting with a bracket and ending with the same depth bracket. Selects the counted characters based on the first.

Ignores anything within quotes.

Useful for quickly finding a potentially valid array or object in JSON.

func (*Buffer) ReadThrough

func (b *Buffer) ReadThrough(c byte) (bb []byte, e error)

ReadThrough is the same as ReadUntil except it returns a slice *including* the character being sought.

func (*Buffer) ReadUntil

func (b *Buffer) ReadUntil(c byte) (bb []byte, e error)

ReadUntil returns all of the buffer from the Pos at invocation, until the index immediately before the match of the requested character.

The next Read or Write after this will return the found character or mutate it. If the first character at the index of the Pos is the one being sought, it returns a zero length slice.

Note that the implementation does not increment the Pos position until either the end of the buffer or when the requested character is found, because there is no need to write the value twice for no reason.

When this function returns an error, the state of the buffer is unchanged from prior to the invocation.

If the character is not `"` then any match within a pair of unescaped `"` is ignored. The closing `"` is not counted if it is escaped with a \.

If the character is `"` then any `"` with a `\` before it is ignored (and included in the returned slice).

func (*Buffer) Scan

func (b *Buffer) Scan(c byte, through, slice bool) (subSlice []byte, e error)

Scan is the utility back end that does all the scan/read functionality

func (*Buffer) ScanForOneOf

func (b *Buffer) ScanForOneOf(through bool, c ...byte) (which byte, e error)

ScanForOneOf provides the ability to scan for two or more different bytes.

For simplicity it does not skip quotes, it was actually written to find quotes or braces but just to make it clear this is very bare.

if through is set to true, the cursor is advanced to the next after the match

func (*Buffer) ScanThrough

func (b *Buffer) ScanThrough(c byte) (e error)

ScanThrough does the same as ScanUntil except it returns the next index *after* the found item.

func (*Buffer) ScanUntil

func (b *Buffer) ScanUntil(c byte) (e error)

ScanUntil does the same as ReadUntil except it doesn't slice what it passed over.

func (*Buffer) String

func (b *Buffer) String() (s string)

func (*Buffer) Tail

func (b *Buffer) Tail() []byte

Tail returns the buffer starting from the current Pos position.

func (*Buffer) Write

func (b *Buffer) Write(bb byte) (e error)

Write a byte into the next index of the buffer or return io.EOF if there is no space left.

func (*Buffer) WriteBytes

func (b *Buffer) WriteBytes(bb []byte) (e error)

WriteBytes copies over top of the current buffer with the bytes given.

Returns io.EOF if the write would exceed the end of the buffer, and does not perform the operation, nor move the cursor.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL