go-sortfile

module
v0.0.1-rc20230124 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 24, 2023 License: MIT

README

go-sortfile

go-sortfile is a simple Go library for sorting large files.

If the file size is smaller than the available memory, sort in-memory; if the file size is larger than the available memory, perform an external sort with a K-way merge sort.

Usage

go get "github.com/KEINOS/go-sortfile"
Example
import "github.com/KEINOS/go-sortfile/sortfile"

func ExampleFromPath() {
    // Input and output file paths
    pathFileIn := filepath.Join("path", "to", "large_file.txt")
    pathFileOut := filepath.Join("path", "to", "large_file-sorted.txt")

    // Let the library auto detect the best way to sort the file.
    forceExternalSort := false // false ==> auto detect sort method

    // Sort file in-memory or external sort.
    //
    // If the 3rd argument is false, the library will auto detect the best way
    // to sort the file. In-memory sorting or external sorting. If true it will
    // force to use external sort.
    err := sortfile.FromPath(pathFileIn, pathFileOut, forceExternalSort)
    if err != nil {
        log.Fatal(err)
    }

    // Print the result
    data, err := os.ReadFile(pathFileOut)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(string(data))
    // Output:
    // Alice
    // Bob
    // Carol
    // Charlie
    // Dave
    // Ellen
    // Eve
    // Frank
    // Isaac
    // Ivan
    // Justin
    // Mallet
    // Mallory
    // Marvin
    // Matilda
    // Oscar
    // Pat
    // Peggy
    // Steve
    // Trent
    // Trudy
    // Victor
    // Walter
    // Zoe
}

func ExampleFromPathFunc() {
    exitOnError := func(err error) {
        if err != nil {
            log.Fatal(err)
        }
    }

    // Input and output file paths
    pathFileIn := filepath.Join("testdata", "sorted_chunks", "input_shuffled.txt")
    pathFileOut := filepath.Join(os.TempDir(), "pkg-sortfile_example_from_path.txt")

    // Clean up the output file after the test
    defer func() {
        exitOnError(os.Remove(pathFileOut))
    }()

    // User defined sort function. If isLess is nil, it will use the default sort
    // function which is equivalent to `sortfile.FromPath`.
    isLess := func(a, b string) bool {
        return a > b // reverse sort
    }

    // Let the library auto detect the best way to sort the file.
    forceExternalSort := false // auto detect

    // Sort file in-memory or external sort.
    err := sortfile.FromPathFunc(pathFileIn, pathFileOut, forceExternalSort, isLess)
    exitOnError(err)

    // Print the result
    data, err := os.ReadFile(pathFileOut)
    exitOnError(err)

    fmt.Println(string(data))
    // Output:
    // Zoe
    // Walter
    // Victor
    // Trudy
    // Trent
    // Steve
    // Peggy
    // Pat
    // Oscar
    // Matilda
    // Marvin
    // Mallory
    // Mallet
    // Justin
    // Ivan
    // Isaac
    // Frank
    // Eve
    // Ellen
    // Dave
    // Charlie
    // Carol
    // Bob
    // Alice
}
Speed

Even a simple implementation is much faster than the ordinary sort command in linux/unix. Though, we beleive it can be improved further.

$ # Around 1 GB of randomly shuffled data
$ ls -lah shuffled_huge.txt
-rw-r--r--  1 keinos  staff   985M  1 12 22:29 shuffled_huge.txt

$ # Ordinary sort command of linux/unix
$ time sort shuffled_huge.txt -o out_sort.txt
real    5m35.706s
user    11m52.320s
sys     0m40.690s

$ # Our sortfile command
$ time sortfile shuffled_huge.txt out_sortfile.txt
real    0m43.294s
user    0m36.283s
sys     0m5.751s

$ # Compare the result (no diff)
$ diff out_sort.txt out_sortfile.txt
$

Contribute

  • PullRequest
    • Branch to PR: main
    • Any contribution for the better, faster, stronger implementation is welcome!
  • Issues
    • Bug/vulnerability report: Please attach a reproducible simple example or a link to reference. It will help us alot to fix the issue faster.
    • Feature request: Please describe the feature you want to add and usecase. Although, we recommend to PR the feature, since it's more prioritized.
  • Help wanted

Directories

Path Synopsis
cmd
Package sortfile provides functions to sort a file.
Package sortfile provides functions to sort a file.
chunk
Package chunk is a chunk file manager.
Package chunk is a chunk file manager.
datasize
Package datasize defines the type InBytes which represents a size in bytes.
Package datasize defines the type InBytes which represents a size in bytes.
inmemory
Package inmemory provides sorting algorithms for in-memory data.
Package inmemory provides sorting algorithms for in-memory data.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL