SortFold
SortFold is a Go package that enables sorting string slices in increasing
order using a case-insensitive comparison.
Sorting String Slices
The standard library's sort
package provides the ability to sort
slices of arbitrary types in increasing order, but when it comes to
strings, Go can be a little sensitive. For example:
(run):
package main
import (
"fmt"
"sort"
)
func main() {
data := []string{"A", "b", "c", "D"}
fmt.Println(sort.StringsAreSorted(data))
sort.Strings(data)
fmt.Println(data)
fmt.Println(sort.StringsAreSorted(data))
}
false
[A D b c]
true
While A b c D
is sorted if elements are compared in a case-insensitive
manner, the sort
package's string sorting is case-sensitive. Luckily,
the sort
package also makes it trivial to provide custom comparison
logic.
Sorting Sans Sensitivity
The sort
package defines functions that accept a custom comparator
function in order to implement custom sorting. The following example
illustrates how to use these functions to provide a case-insensitive
sort (run):
package main
import (
"fmt"
"sort"
"strings"
)
func main() {
data := []string{"A", "b", "c", "D"}
less := func(i, j int) bool {
return strings.ToLower(data[i]) < strings.ToLower(data[j])
}
fmt.Println(sort.SliceIsSorted(data, less))
}
true
The above example showcases the use of sort.SliceIsSorted
and the
comparator function less
to determine the slice data
is already
sorted in a case-insensitive manner. While this may seem like the end
of the story, the above implementation comes with a cost.
The Go sort.Sort
function is an
implementation of the quicksort
algorithm. This means at best a call to sort.Sort
is O(n*log(n))
complex
and at worst is O(n^2)
. In other words, in an ideal scenario the sort
operation will make n*log(n)
calls to Less
and in a worst case situation
Less
will be invoked n^2
times.
The example in the previous section uses
strings.ToLower
to standardize
the case of both operands prior to comparison. Because string
values in
Go are immutable, a function like ToLower
allocates new memory to hold
the result of the operation.
Thus, in a best-case scenario, 2(n*log(n))
new strings are allocated
when sorting slices of strings in a case-insensitive manner with
strings.ToLower
and up to 2(n^2)
new strings are created at worst.
Obviously this solution does not scale.
Comparing Case Folding
The reason strings.ToLower
(or
strings.ToUpper
) is required
in the first place is due to Go's lack of a function that uses
Unicode case-folding
to indicate if string a
is less than string b
. The Go standard library
includes the following functions for comparing strings:
This package introduces sortfold.CompareFold
, a function that compares
strings using Unicode case-folding like strings.EqualFold
and behaves
like strings.Compare
with respect to the return value:
package main
import (
"fmt"
"sort"
"strings"
"github.com/akutz/sortfold"
)
func main() {
data := []string{"A", "b", "c", "D"}
less := func(i, j int) bool {
return sortfold.CompareFold(data[i], data[j]) < 0
}
fmt.Println(sort.SliceIsSorted(data, less))
}
true
The introduction of sortfold.CompareFold
provides the ability to sort
slices of strings in case-insensitive manner that is highly performant.
The benchmark results below illustrate the difference between using two
implementations of sort.Interface
to sort string slices of varying
length. The benchmarks that have the prefix Benchmark_FoldedSort
use
sortfold.StringSlice
, which itself uses sortfold.CompareFold
. The
benchmarks with the prefix Benchmark_LCasedSort
convert strings to
lowercase prior to comparison:
Charted Data
Benchmark |
Chart |
2 Chars |
|
26 Chars |
|
542 Words |
|
54,200 Words |
|
542,000 Words |
|
Raw Data
$ go test -benchmem -run Bench -bench . -v
goos: darwin
goarch: amd64
pkg: github.com/akutz/sortfold
Benchmark_FoldedSort______2_Chars____LowerCase_Sorted-8 30000000 40.5 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort______2_Chars____LowerCase_Sorted-8 30000000 49.3 ns/op 0 B/op 0 allocs/op
Benchmark_FoldedSort______2_Chars____LowerCase_Shuffled-8 30000000 46.4 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort______2_Chars____LowerCase_Shuffled-8 20000000 59.7 ns/op 0 B/op 0 allocs/op
Benchmark_FoldedSort______2_Chars____MixedCase_Sorted-8 50000000 37.9 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort______2_Chars____MixedCase_Sorted-8 10000000 114 ns/op 8 B/op 2 allocs/op
Benchmark_FoldedSort______2_Chars__1_MixedCase_Shuffled-8 50000000 39.4 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort______2_Chars__1_MixedCase_Shuffled-8 10000000 113 ns/op 8 B/op 2 allocs/op
Benchmark_FoldedSort_____26_Chars____LowerCase_Sorted-8 500000 2889 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars____LowerCase_Sorted-8 500000 2584 ns/op 0 B/op 0 allocs/op
Benchmark_FoldedSort_____26_Chars____LowerCase_Shuffled-8 500000 2803 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars____LowerCase_Shuffled-8 500000 4112 ns/op 0 B/op 0 allocs/op
Benchmark_FoldedSort_____26_Chars__1_MixedCase_Sorted-8 500000 2694 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars__1_MixedCase_Sorted-8 500000 2677 ns/op 40 B/op 10 allocs/op
Benchmark_FoldedSort_____26_Chars__1_MixedCase_Shuffled-8 500000 2821 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars__1_MixedCase_Shuffled-8 300000 5272 ns/op 72 B/op 18 allocs/op
Benchmark_FoldedSort_____26_Chars__5_MixedCase_Sorted-8 1000000 1647 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars__5_MixedCase_Sorted-8 500000 3448 ns/op 160 B/op 40 allocs/op
Benchmark_FoldedSort_____26_Chars__5_MixedCase_Shuffled-8 1000000 2594 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars__5_MixedCase_Shuffled-8 300000 5918 ns/op 280 B/op 70 allocs/op
Benchmark_FoldedSort_____26_Chars_10_MixedCase_Sorted-8 1000000 1496 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars_10_MixedCase_Sorted-8 300000 4417 ns/op 328 B/op 82 allocs/op
Benchmark_FoldedSort_____26_Chars_10_MixedCase_Shuffled-8 1000000 2113 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_____26_Chars_10_MixedCase_Shuffled-8 200000 7425 ns/op 536 B/op 134 allocs/op
Benchmark_FoldedSort____542_Words____MixedCase_Sorted-8 10000 149475 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort____542_Words____MixedCase_Sorted-8 3000 477477 ns/op 20400 B/op 2046 allocs/op
Benchmark_FoldedSort____542_Words____MixedCase_Shuffled-8 10000 195502 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort____542_Words____MixedCase_Shuffled-8 3000 542718 ns/op 21856 B/op 2208 allocs/op
Benchmark_FoldedSort__54200_Words____MixedCase_Shuffled-8 100 16854892 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort__54200_Words____MixedCase_Shuffled-8 30 53410004 ns/op 2307312 B/op 216820 allocs/op
Benchmark_FoldedSort_542000_Words____MixedCase_Shuffled-8 10 176222982 ns/op 0 B/op 0 allocs/op
Benchmark_LCasedSort_542000_Words____MixedCase_Shuffled-8 2 545716283 ns/op 30080968 B/op 3180838 allocs/op
PASS
ok github.com/akutz/sortfold 93.355s