stringutils

package

v0.0.2 Latest Latest Go to latest Published: Oct 16, 2024 License: MIT, MIT, Unlicense Imports: 7 Imported by: 17

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/unionj-cloud/toolkit

Links

Open Source Insights

README ¶

Go-string

Useful string utility functions for Go projects. Either because they are faster than the common Go version or do not exist in the standard library.

You can find all details here https://pkg.go.dev/github.com/boyter/go-string

Probably the most useful methods are IndexAll and IndexAllIgnoreCase which for string literal searches should be drop in replacements for regexp.FindAllIndex while totally avoiding the regular expression engine and as such being much faster.

Some quick benchmarks using a simple program which opens a 550MB file and searches over it in memory. Each search is done three times, the first using regexp.FindAllIndex and the second using IndexAllIgnoreCase.

For this specific example the wall clock time to run is at least 10x less, but with the same matching results.

$ ./csperf ſecret 550MB
File length 576683100

FindAllIndex (regex ignore case)
Scan took 25.403231773s 16680
Scan took 25.39742299s 16680
Scan took 25.227218738s 16680

IndexAllIgnoreCase (custom)
Scan took 2.04013314s 16680
Scan took 2.019360935s 16680
Scan took 1.996732171s 16680

The above example in code for you to copy

// Simple test comparison between various search methods
func main() {
	arg1 := os.Args[1]
	arg2 := os.Args[2]

	b, err := os.ReadFile(arg2)
	if err != nil {
		fmt.Print(err)
		return
	}

	fmt.Println("File length", len(b))

	haystack := string(b)

	var start time.Time
	var elapsed time.Duration

	fmt.Println("\nFindAllIndex (regex)")
	r := regexp.MustCompile(regexp.QuoteMeta(arg1))
	for i := 0; i < 3; i++ {
		start = time.Now()
		all := r.FindAllIndex(b, -1)
		elapsed = time.Since(start)
		fmt.Println("Scan took", elapsed, len(all))
	}

	fmt.Println("\nIndexAll (custom)")
	for i := 0; i < 3; i++ {
		start = time.Now()
		all := str.IndexAll(haystack, arg1, -1)
		elapsed = time.Since(start)
		fmt.Println("Scan took", elapsed, len(all))
	}

	r = regexp.MustCompile(`(?i)` + regexp.QuoteMeta(arg1))
	fmt.Println("\nFindAllIndex (regex ignore case)")
	for i := 0; i < 3; i++ {
		start = time.Now()
		all := r.FindAllIndex(b, -1)
		elapsed = time.Since(start)
		fmt.Println("Scan took", elapsed, len(all))
	}

	fmt.Println("\nIndexAllIgnoreCase (custom)")
	for i := 0; i < 3; i++ {
		start = time.Now()
		all := str.IndexAllIgnoreCase(haystack, arg1, -1)
		elapsed = time.Since(start)
		fmt.Println("Scan took", elapsed, len(all))
	}
}

Note that it performs best with real documents and wost when searching over random data. Depending on what you are searching you may have a similar speed up or a marginal one.

FindAllIndex has a similar speed up,

// BenchmarkFindAllIndex-8                         2458844	       480.0 ns/op
// BenchmarkIndexAll-8                            14819680	        79.6 ns/op

See the benchmarks for full proof where they test various edge cases.

The other most useful method is HighlightString. HighlightString takes in some content and locations and then inserts in/out strings which can be used for highlighting around matching terms. For example you could pass in "test" and have it return "<strong>te</strong>st". The argument locations accepts output from regexp.FindAllIndex or the included IndexAllIgnoreCase or IndexAll.

All code is dual-licenced as either MIT or Unlicence. Your choice when you use it.

Note that as an Australian I cannot put this into the public domain, hence the choice most liberal licences I can find.

Documentation ¶

Index ¶

Variables
func AllSimpleFold(input rune) []rune
func Contains(elements []string, needle string) bool
func ContainsI(s string, substr string) bool
func HasPrefixI(s, prefix string) bool
func HighlightString(content string, locations [][]int, in string, out string) string
func IndexAll(haystack string, needle string, limit int) [][]int
func IndexAllIgnoreCase(haystack string, needle string, limit int) [][]int
func IsEmpty(s string) bool
func IsNotEmpty(s string) bool
func IsSpace(firstByte, nextByte byte) bool
func PermuteCase(input string) []string
func PermuteCaseFolding(input string) []string
func RemoveStringDuplicates(elements []string) []string
func ReplaceAtRuneIndex(in string, r rune, i int) string
func ReplaceStringAtByteIndex(in string, replace string, start int, end int) string
func ReplaceStringAtByteIndexBatch(in string, args []string, locs [][]int) string
func StartOfRune(b byte) bool
func ToCamel(s string) string
func ToTitle(s string) string

Constants ¶

This section is empty.

Variables ¶

View Source

var CacheSize = 10

CacheSize this is public so it can be modified depending on project needs you can increase this value to cache more of the case permutations which can improve performance if doing the same searches over and over

Functions ¶

func AllSimpleFold ¶

func AllSimpleFold(input rune) []rune

AllSimpleFold given an input rune return a rune slice containing all of the possible simple fold

func Contains ¶

func Contains(elements []string, needle string) bool

Contains checks the supplied slice of string for the existence of a string and returns true if found, and false otherwise

func ContainsI ¶

func ContainsI(s string, substr string) bool

ContainsI assert s contains substr ignore case

func HasPrefixI ¶

func HasPrefixI(s, prefix string) bool

HasPrefixI assert s has prefix prefix ignore case

func HighlightString ¶

func HighlightString(content string, locations [][]int, in string, out string) string

HighlightString takes in some content and locations and then inserts in/out strings which can be used for highlighting around matching terms. For example you could pass in "test" and have it return "<strong>te</strong>st" locations accepts output from regex.FindAllIndex IndexAllIgnoreCase or IndexAll

func IndexAll ¶

func IndexAll(haystack string, needle string, limit int) [][]int

IndexAll extracts all of the locations of a string inside another string up-to the defined limit and does so without regular expressions which makes it faster than FindAllIndex in most situations while not being any slower. It performs worst when working against random data.

Some benchmark results to illustrate the point (find more in index_benchmark_test.go)

BenchmarkFindAllIndex-8 2458844 480.0 ns/op BenchmarkIndexAll-8 14819680 79.6 ns/op

For pure literal searches IE no regular expression logic this method is a drop in replacement for re.FindAllIndex but generally much faster.

Similar to how FindAllIndex the limit option can be passed -1 to get all matches.

Note that this method is explicitly case sensitive in its matching. A return value of nil indicates no match.

func IndexAllIgnoreCase ¶

func IndexAllIgnoreCase(haystack string, needle string, limit int) [][]int

IndexAllIgnoreCase extracts all of the locations of a string inside another string up-to the defined limit. It is designed to be faster than uses of FindAllIndex with case insensitive matching enabled, by looking for string literals first and then checking for exact matches. It also does so in a unicode aware way such that a search for S will search for S s and ſ which a simple strings.ToLower over the haystack and the needle will not.

The result is the ability to search for literals without hitting the regex engine which can at times be horribly slow. This by contrast is much faster. See index_ignorecase_benchmark_test.go for some head to head results. Generally so long as we aren't dealing with random data this method should be considerably faster (in some cases thousands of times) or just as fast. Of course it cannot do regular expressions, but that's fine.

For pure literal searches IE no regular expression logic this method is a drop in replacement for re.FindAllIndex but generally much faster.

func IsEmpty ¶

func IsEmpty(s string) bool

IsEmpty asserts s is empty

func IsNotEmpty ¶

func IsNotEmpty(s string) bool

IsNotEmpty asserts s is not empty

func IsSpace ¶

func IsSpace(firstByte, nextByte byte) bool

IsSpace checks bytes MUST which be UTF-8 encoded for a space List of spaces detected (same as unicode.IsSpace): '\t', '\n', '\v', '\f', '\r', ' ', U+0085 (NEL), U+00A0 (NBSP). N.B only two bytes are required for these cases. If we decided to support spaces like '，' then we'll need more bytes.

func PermuteCase ¶

func PermuteCase(input string) []string

PermuteCase given a str returns a slice containing all possible case permutations of that str such that input of foo will return foo Foo fOo FOo foO FoO fOO FOO Note that very long inputs can produce an enormous amount of results in the returned slice OR result in an overflow and return nothing

func PermuteCaseFolding ¶

func PermuteCaseFolding(input string) []string

PermuteCaseFolding given a str returns a slice containing all possible case permutations with characters being folded such that S will return S s ſ

func RemoveStringDuplicates ¶

func RemoveStringDuplicates(elements []string) []string

RemoveStringDuplicates is a simple helper method that removes duplicates from any given str slice and then returns a nice duplicate free str slice

func ReplaceAtRuneIndex ¶

func ReplaceAtRuneIndex(in string, r rune, i int) string

func ReplaceStringAtByteIndex ¶

func ReplaceStringAtByteIndex(in string, replace string, start int, end int) string

func ReplaceStringAtByteIndexBatch ¶

func ReplaceStringAtByteIndexBatch(in string, args []string, locs [][]int) string

func StartOfRune ¶

func StartOfRune(b byte) bool

StartOfRune a byte and returns true if its the start of a multibyte character or a single byte character otherwise false

func ToCamel ¶

func ToCamel(s string) string

func ToTitle ¶

func ToTitle(s string) string

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL