Documentation ¶
Overview ¶
Package text implements an RFC8259 compliant string escaping with a pre-calculation stage that eliminates the risk of multiple allocations for long inputs.
Index ¶
- Constants
- func EscapeJSONStringAndWrap(s string) (escaped []byte)
- func EscapeJSONStringAndWrapOld(s string) (escaped []byte)
- func EscapeString(dst []byte, s string) []byte
- func FirstHexCharToValue(in byte) (out byte)
- func SecondHexCharToValue(in byte) (out byte)
- func UnescapeByteString(bs []byte) (o []byte)
- func Unwrap(wrapped []byte) (unwrapped []byte)
- type Buffer
- func (b *Buffer) Bytes() (bb []byte)
- func (b *Buffer) Copy(length, src, dest int) (e error)
- func (b *Buffer) Head() []byte
- func (b *Buffer) Read() (bb byte, e error)
- func (b *Buffer) ReadBytes(count int) (bb []byte, e error)
- func (b *Buffer) ReadEnclosed() (bb []byte, e error)
- func (b *Buffer) ReadThrough(c byte) (bb []byte, e error)
- func (b *Buffer) ReadUntil(c byte) (bb []byte, e error)
- func (b *Buffer) Scan(c byte, through, slice bool) (subSlice []byte, e error)
- func (b *Buffer) ScanForOneOf(through bool, c ...byte) (which byte, e error)
- func (b *Buffer) ScanThrough(c byte) (e error)
- func (b *Buffer) ScanUntil(c byte) (e error)
- func (b *Buffer) String() (s string)
- func (b *Buffer) Tail() []byte
- func (b *Buffer) Write(bb byte) (e error)
- func (b *Buffer) WriteBytes(bb []byte) (e error)
Constants ¶
const ( QuotationMark = 0x22 QuotationMarkGo = '"' ReverseSolidus = 0x5c ReverseSolidusGo = '\\' Solidus = 0x2f SolidusGo = '/' Backspace = 0x08 BackspaceGo = '\b' FormFeed = 0x0c FormFeedGo = '\f' LineFeed = 0x0a LineFeedGo = '\n' CarriageReturn = 0x0d CarriageReturnGo = '\r' Tab = 0x09 TabGo = '\t' Space = 0x20 SpaceGo = ' ' )
The character constants are used as their names. IDEs with inlays expanding the values will demonstrate the equivalence of these with the same decimal UTF-8 value, thus the secondary items with their Go rune equivalents.
The human readable forms are given in order to educate more than anything else. The same symbols can be used in regular Go double quoted "" strings to indicate the same character.
Different rules apply to backtick quoted strings, which allow any character to be placed in a string, escaped sequences are literally interpreted instead of parsed to their respective bytes, and generally editors won't allow the placement of control characters in these strings; their purpose is allowing properly flowed, line-break containing strings such as embedding literal text. Backtick strings can contain printf formatting same as double quote strings.
Variables ¶
This section is empty.
Functions ¶
func EscapeJSONStringAndWrap ¶
func EscapeJSONStringAndWrapOld ¶
EscapeJSONStringAndWrapOld takes an arbitrary string and escapes all control characters as per rfc8259 section 7 https://www.rfc-editor.org/rfc/rfc8259 (retrieved 2023-11-21):
The representation of strings is similar to conventions used in the C family of programming languages. A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).
The string is assumed to be UTF-8 and only the above escapes are processed. The string will be wrapped in double quotes `"` as it is assumed that the string will be added to a JSON document in a place where a string is valid.
The processing proceeds in two passes, first calculating the required expansion for the characters in the provided string, and then copying over and adding the required escape code expansions as indicated, to ensure that for very long strings only one allocation, of precisely the correct amount, is made.
Note the iteration through the string must proceed as though the string is []byte rather than be interpreted using a `for _, c := range s` which will prompt Go to interpret the string as UTF-8 and potentially return a different result, this occurs on the series of characters 0-255 at a certain point due to UTF-8 encoding rules.
One last thing to note. The stdlib function `json.Marshal` automatically runs a HTML escape processing which turns some valid characters, namely:
String values encode as JSON strings coerced to valid UTF-8, replacing invalid bytes with the Unicode replacement rune. So that the JSON will be safe to embed inside HTML <script> tags, the string is encoded using HTMLEscape, which replaces "<", ">", "&", U+2028, and U+2029 are escaped to "\u003c","\u003e", "\u0026", "\u2028", and "\u2029". This replacement can be disabled when using an Encoder, by calling SetEscapeHTML(false).
And so the assumption this code here makes is that backslashes need to be escaped, needs to have special handling to not escape the escaped, in order to allow custom JSON marshalers to not keep adding backslashes to valid UTF-8 entities.
func EscapeString ¶
EscapeString for JSON encoding according to RFC8259.
taken from https://github.com/nbd-wtf/go-nostr/blob/master/utils.go replaced by EscapeJSONStringAndWrap in file rfc8259.go tested to be functionally equivalent, the purpose of the above function is to eliminate extra heap allocations for very long strings such as long form posts.
Formatting is retained from the original despite being ugly.
func FirstHexCharToValue ¶
FirstHexCharToValue returns the hex value of a provided character from the first place in an 8 bit value of two characters.
Two of these functions exist to minimise the computation cost, thus doubling the memory cost in the switch lookup table.
func SecondHexCharToValue ¶
SecondHexCharToValue returns the hex value of a provided character from the second (last) place in an 8 bit value.
func UnescapeByteString ¶
UnescapeByteString scans a string assumed to be UTF-8 for escaped UTF-8 characters that must be escaped for JSON/HTML encoding. This means octal `\xxx` unicode backslash escapes \uXXXX and \UXXXX
Types ¶
type Buffer ¶
func NewBuffer ¶
NewBuffer returns a new buffer containing the provided slice. This slice can/will be mutated.
func (*Buffer) Copy ¶
Copy a given length of bytes starting at src position to dest position, and move the cursor to the end of the written segment.
func (*Buffer) ReadBytes ¶
ReadBytes returns the specified number of byte, and advances the cursor, or io.EOF if there isn't this much remaining after the cursor.
func (*Buffer) ReadEnclosed ¶
ReadEnclosed scans quickly while keeping count of open and close brackets [] or braces {} and returns the byte sub-slice starting with a bracket and ending with the same depth bracket. Selects the counted characters based on the first.
Ignores anything within quotes.
Useful for quickly finding a potentially valid array or object in JSON.
func (*Buffer) ReadThrough ¶
ReadThrough is the same as ReadUntil except it returns a slice *including* the character being sought.
func (*Buffer) ReadUntil ¶
ReadUntil returns all of the buffer from the Pos at invocation, until the index immediately before the match of the requested character.
The next Read or Write after this will return the found character or mutate it. If the first character at the index of the Pos is the one being sought, it returns a zero length slice.
Note that the implementation does not increment the Pos position until either the end of the buffer or when the requested character is found, because there is no need to write the value twice for no reason.
When this function returns an error, the state of the buffer is unchanged from prior to the invocation.
If the character is not `"` then any match within a pair of unescaped `"` is ignored. The closing `"` is not counted if it is escaped with a \.
If the character is `"` then any `"` with a `\` before it is ignored (and included in the returned slice).
func (*Buffer) ScanForOneOf ¶
ScanForOneOf provides the ability to scan for two or more different bytes.
For simplicity it does not skip quotes, it was actually written to find quotes or braces but just to make it clear this is very bare.
if through is set to true, the cursor is advanced to the next after the match
func (*Buffer) ScanThrough ¶
ScanThrough does the same as ScanUntil except it returns the next index *after* the found item.
func (*Buffer) ScanUntil ¶
ScanUntil does the same as ReadUntil except it doesn't slice what it passed over.
func (*Buffer) Write ¶
Write a byte into the next index of the buffer or return io.EOF if there is no space left.
func (*Buffer) WriteBytes ¶
WriteBytes copies over top of the current buffer with the bytes given.
Returns io.EOF if the write would exceed the end of the buffer, and does not perform the operation, nor move the cursor.