external

package module
v0.0.0-...-e9c89f9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 10, 2024 License: MIT Imports: 13 Imported by: 0

README

external go library

External is a library that takes an almost arbitrary iter.Seq[T] using Go1.23's range over func API and returns a sorted iter.Seq[T]. The difference from sort.Slice is that it can be used for data sets that are too large to fit into memory. external.Sort splits the massive input into small chunks, sorts each one using sort.Slice, writes them to temporary files, and merges them to return the final sorted iter.Seq[T]. By using json.Encoder for writing to temporary files and json.Decoder for reading back, it is possible to handle arbitrary iter.Seq[T]. In other words, types that can be encoded/decoded with the standard json package can be sorted with external.Sort/SortFunc. Simple integers or strings meet this condition, and any struct with public fields containing integers, strings, or slices thereof will also meet this condition without modifications. Structures with floating-point numbers and private fields may lose information during writing/reading. To avoid this, implement json.Marshaler/json.Unmarshaler.

Usage Example

import (
	"github.com/ajiyoshi-vg/external"
	"github.com/ajiyoshi-vg/external/scan"
)

func sort(r io.Reader) error {
	out := bufio.NewWriter(os.Stdout)
	defer out.Flush()
	// Sort scanned strings line by line
	// Write sorted lines to standard output
	for x := range external.Sort(scan.Lines(r)) {
		fmt.Fprintln(out, x)
	}
	return nil
}

Performance

Here are the results of a benchmark sorting 20 million lines of UUIDs. On a MacBook Pro 2023 with an Apple M3, it was faster than the sort command without any options. I assume the sort command uses a similar strategy to external.Sort, so I don't know why this is the case. Since the sort command does not write out in JSON, ultimately, the sort command should be faster. It might be achieved by adjusting the options. It seems like the sort command is not using multiple CPUs, but according to the man page, the default value for the parallel option is the same as the number of CPUs.

$ make bench sort_command
time cat bench.dat | go run cmd/sort/main.go | go run cmd/sorted/main.go

real    0m10.192s
user    0m20.503s
sys     0m3.466s
sort --version
2.3-Apple (165.100.8)
time cat bench.dat | sort | go run cmd/sorted/main.go

real    1m51.083s
user    1m49.487s
sys     0m1.324s

License

MIT

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Compare

func Compare[T constraints.Ordered](a, b T) int

func Merge

func Merge[T any](a, b <-chan T, cmp func(T, T) int) <-chan T

func Sort

func Sort[T constraints.Ordered](seq iter.Seq[T], opt ...Option) iter.Seq[T]

func SortFunc

func SortFunc[T any](seq iter.Seq[T], cmp func(T, T) int, opt ...Option) iter.Seq[T]

Types

type Chunk

type Chunk[T any] struct {
	// contains filtered or unexported fields
}

func NewChunk

func NewChunk[T any](data []T) *Chunk[T]

func (*Chunk[T]) Clean

func (x *Chunk[T]) Clean() error

func (*Chunk[T]) Length

func (x *Chunk[T]) Length() int

func (*Chunk[T]) Restore

func (x *Chunk[T]) Restore() (iter.Seq[T], error)

func (*Chunk[T]) Store

func (x *Chunk[T]) Store() error

type Chunks

type Chunks[T any] struct {
	// contains filtered or unexported fields
}

func NewChunks

func NewChunks[T any](chunk []*Chunk[T]) *Chunks[T]

func (*Chunks[T]) Clean

func (x *Chunks[T]) Clean() error

func (*Chunks[T]) Iters

func (x *Chunks[T]) Iters() ([]iter.Seq[T], error)

func (*Chunks[T]) Length

func (x *Chunks[T]) Length() int

type Merger

type Merger[T any] struct {
	// contains filtered or unexported fields
}

func NewMerger

func NewMerger[T any](cmp func(T, T) int) *Merger[T]

func (*Merger[T]) Merge

func (m *Merger[T]) Merge(xs []iter.Seq[T]) iter.Seq[T]

type Option

type Option func(*option)

func ChunkSize

func ChunkSize(size int) Option

func Limit

func Limit(n int) Option

type Sorter

type Sorter[T any] struct {
	// contains filtered or unexported fields
}

func New

func New[T any](cmp func(T, T) int, opt ...Option) *Sorter[T]

func (*Sorter[T]) Err

func (s *Sorter[T]) Err() error

func (*Sorter[T]) Sort

func (x *Sorter[T]) Sort(seq iter.Seq[T]) iter.Seq[T]

type Splitter

type Splitter[T any] struct {
	// contains filtered or unexported fields
}

func NewSplitter

func NewSplitter[T any](cmp func(T, T) int, opt ...Option) *Splitter[T]

func (*Splitter[T]) Split

func (s *Splitter[T]) Split(seq iter.Seq[T]) (*Chunks[T], error)

Directories

Path Synopsis
cmd
gen

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL