chunk

package
v0.0.1-rc20230124 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 24, 2023 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package chunk is a chunk file manager. It defines several types of reader and writers for the splitted chunks of data.

Index

Examples

Constants

This section is empty.

Variables

View Source
var OsOpen = os.Open

OsOpen is a copy of os.Open() as a dependency injection to ease testing.

Functions

func Chunker

func Chunker(inFile io.Reader, sizeFileIn datasize.InBytes, sizeChunk datasize.InBytes, isLess func(string, string) bool) ([]string, error)

Chunker is a file chunker. It returns a list of paths to the temporary chunked files. These files are sorted by lines using the given isLess function. If isLess is nil, the default is used.

It is similar to FileSplit() but takes a file pointer instead of file path. To chunk the file via file path, use FileSplit().

func FileSplit

func FileSplit(pathFileIn string, sizeChunk datasize.InBytes, isLess func(string, string) bool) (data []string, err error)

FileSplit is a file chunker. It returns a list of paths to the temporary chunked files. These files are sorted by lines using the given isLess function. If isLess is nil, the default is used.

It is similar to Chunker() but takes a file path instead of file pointer. To chunk the file via file pointer, use Chunker().

Example
package main

import (
	"fmt"
	"log"
	"os"
	"path/filepath"

	"github.com/KEINOS/go-sortfile/sortfile/chunk"
)

func main() {
	// Shuffled lines as a test file.
	pathFileIn := filepath.Join("..", "testdata", "sorted_chunks", "input_shuffled.txt")

	// Divide the lines of a file into 32 bytes and store in a temporary file.
	// Each file is also sorted by lines using the default isLess function (nil).
	chunks, err := chunk.FileSplit(pathFileIn, 32, nil)
	if err != nil {
		log.Fatal(err)
	}

	// Cleanup the temporary chunk files after the test.
	defer func() {
		for _, chunk := range chunks {
			_ = os.Remove(chunk)
		}
	}()

	// Print the contents of each chunk file.
	for index, chunk := range chunks {
		fmt.Printf("Chunk file #%d\n", index+1)

		data, err := os.ReadFile(chunk)
		if err != nil {
			log.Fatal(err)
		}

		fmt.Println(string(data))
	}
}
Output:

Chunk file #1
Bob
Carol
Eve
Peggy
Walter

Chunk file #2
Mallet
Mallory
Matilda
Zoe

Chunk file #3
Dave
Ellen
Frank
Ivan
Marvin

Chunk file #4
Alice
Pat
Steve
Victor

func IsLess

func IsLess(a, b string) bool

IsLess returns true if a is less than b.

Types

type FileReader

type FileReader struct {
	// contains filtered or unexported fields
}

FileReader is a line reader for the chunk file.

It aims to provide a simple interface for K-way merge sort usage to read the chunk file line by line.

Example
package main

import (
	"errors"
	"fmt"
	"io"
	"log"
	"path/filepath"

	"github.com/KEINOS/go-sortfile/sortfile/chunk"
)

func main() {
	pathFileTarget := filepath.Join("..", "testdata", "small_chunk.txt")

	fReader, err := chunk.NewFileReader(pathFileTarget)
	if err != nil {
		log.Fatal(err)
	}

	defer fReader.Close()

	for {
		// Read the next line from the chunk file.
		// In K-way merge sort, this will be called if the line is used.
		if err := fReader.NextLine(); err != nil {
			if errors.Is(err, io.EOF) {
				fmt.Println("EOF reached")

				break
			}

			log.Fatal(err)
		}

		// Print the current line twice.
		// Note that the returned line is the same and does not have the traling
		// line break.
		fmt.Printf("1st call: %#v\n", fReader.CurrentLine())
		fmt.Printf("2nd call: %#v\n", fReader.CurrentLine())
	}
}
Output:

1st call: "foo line"
2nd call: "foo line"
1st call: "bar line"
2nd call: "bar line"
1st call: "baz line"
2nd call: "baz line"
EOF reached

func NewFileReader

func NewFileReader(path string) (*FileReader, error)

NewFileReader returns a new FileReader object.

It will open the file of the given path but it will not read the initial first line. The caller should call NextLine() to read the first line right after the FileReader.Close() is deferred. See the example_test.go for the actual use case.

func NewIOReader

func NewIOReader(reader io.Reader) *FileReader

NewIOReader retruns a new FileReader object.

It is similar to NewFileReader() but it takes io.Reader instead of file path.

func (*FileReader) Close

func (f *FileReader) Close() error

Close closes the current chunk file.

It is the callers responsibility to close the file. Use defer to close the file.

func (*FileReader) CurrentLine

func (f *FileReader) CurrentLine() string

CurrentLine returns the line currently read from the file.

It will return the same line until NextLine() is called. If the line is used or selected for merge sort, the caller should call NextLine() to move to the next line.

func (*FileReader) IsEOF

func (f *FileReader) IsEOF() bool

IsEOF returns true if the end of the file is reached.

func (*FileReader) NextLine

func (f *FileReader) NextLine() error

NextLine reads the next line from the file and sets it to the CurrentLine().

Once it reaches the end of the file, it will return io.EOF error.

type FileWriter

type FileWriter struct {
	// contains filtered or unexported fields
}

FileWriter is wrapper of file writer but with a buffer.

You can specify a line to be written to a file and buffer it up to a certain amount before writing. See the WriteLine method for more information.

The data is added in the specified order, so if it is necessary to sort the data or perform other processing, use a Line object.

func NewFileWriter

func NewFileWriter(pathFileOut string, maxSizeBuf datasize.InBytes) (*FileWriter, error)

NewFileWriter returns a new FileWriter object.

The maxSizeBuf is the max size of the buffer before it flushes the buffer to the file. It is similar to NewIOWriter but it takes a path instead of a file pointer.

Example
package main

import (
	"fmt"
	"log"
	"os"
	"path/filepath"

	"github.com/KEINOS/go-sortfile/sortfile/chunk"
	"github.com/KEINOS/go-sortfile/sortfile/datasize"
)

func main() {
	// Preapare temp file to write to.
	pathFileTemp := filepath.Join(os.TempDir(), "ExampleFileWriter.txt")
	// Cleanup temp file after the test.
	defer func() {
		if err := os.Remove(pathFileTemp); err != nil {
			log.Fatal(err)
		}
	}()

	// The size of the buffer. It will write to the file when the buffer reaches
	// 128 bytes each time.
	sizeBufMax := datasize.InBytes(128)

	// Instantiate a new file writer.
	fWriter, err := chunk.NewFileWriter(pathFileTemp, sizeBufMax)
	if err != nil {
		log.Fatal(err)
	}

	defer fWriter.Close()

	for _, line := range []string{
		"智恵子は東京に空が無いといふ、",
		"ほんとの空が見たいといふ。",
		"私は驚いて空を見る。",
		"桜若葉の間に在るのは、",
		"切つても切れない",
		"むかしなじみのきれいな空だ。",
		"どんよりけむる地平のぼかしは",
		"うすもも色の朝のしめりだ。",
		"智恵子は遠くを見ながら言ふ。",
		"阿多多羅山あたたらやまの山の上に",
		"毎日出てゐる青い空が",
		"智恵子のほんとの空だといふ。",
		"あどけない空の話である。",
	} {
		// Let the writer write the line whether it buffers or not.
		if _, err := fWriter.WriteLine(line); err != nil {
			log.Fatal(err)
		}
	}

	// Flash the remaining buffer to the file.
	if err := fWriter.Done(); err != nil {
		log.Fatal(err)
	}

	dataWritten, err := os.ReadFile(pathFileTemp)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(dataWritten))
}
Output:

智恵子は東京に空が無いといふ、
ほんとの空が見たいといふ。
私は驚いて空を見る。
桜若葉の間に在るのは、
切つても切れない
むかしなじみのきれいな空だ。
どんよりけむる地平のぼかしは
うすもも色の朝のしめりだ。
智恵子は遠くを見ながら言ふ。
阿多多羅山あたたらやまの山の上に
毎日出てゐる青い空が
智恵子のほんとの空だといふ。
あどけない空の話である。

func NewIOWriter

func NewIOWriter(ptrFileOut io.Writer, maxSizeBuf datasize.InBytes) *FileWriter

NewIOWriter returns a new FileWriter object.

The maxSizeBuf is the max size of the buffer before it flushes the buffer to the file. It is similar to NewFileWriterPath but it takes a file pointer instead of a path.

func (*FileWriter) Close

func (fw *FileWriter) Close() error

func (*FileWriter) Done

func (fw *FileWriter) Done() error

Done flushes the remaining buffer to the file.

func (*FileWriter) WriteLine

func (fw *FileWriter) WriteLine(line string) (int, error)

WriteLine writes the line to the file adding a line break at the end.

It will buffer the line until it reaches the max size of the buffer, then flushes the buffer to the file.

type Lines

type Lines struct {
	// IsLess is the function to compare two strings during chunk file creation.
	// This function must be the same as the one to be used for merge-sorting.
	IsLess func(a, b string) bool
	// contains filtered or unexported fields
}

Lines holds a slice of string as a chunk of data with sort and save functionality.

This is a helper object to create a temporary file with sorted lines for K-way merge sort process. To simply write huge number of lines to a file, use FileWriter object instead.

Example
package main

import (
	"bytes"
	"fmt"

	"github.com/KEINOS/go-sortfile/sortfile/chunk"
)

func main() {
	const maxBytes = 32 // Max size of the chunk in bytes.

	// Instantiate a new chunk to be written to the output.
	chunk := chunk.Lines{}

	for _, line := range []string{
		"foo line",
		"bar line",
		"baz line",
		"hoge line", // this will overflow the maxSize so will not be appended.
	} {
		// Append only if the line will not overflow the maxSize by checking the
		// size before appending the line.
		if !chunk.WillOverSize(line, maxBytes) {
			// It will append the line with a line break at the end of the line.
			chunk.AppendLine(line)
			// Print current size of the chunk to be written to the output.
			fmt.Println("Size:", chunk.Size())
		}
	}

	// Recalculate the size of the chunk.
	fmt.Println("Size cached:", chunk.Size(), "Size caltulate:", chunk.SizeRaw())

	// Lines method returns the lines in the chunk (a slice of string) as is.
	fmt.Printf("%#v\n", chunk.Lines()[0])
	fmt.Printf("%#v\n", chunk.Lines()[1])
	fmt.Printf("%#v\n", chunk.Lines()[2])

	// Dump the chunk sorted by the lines to the io.Writer.
	var buff bytes.Buffer

	chunk.WriteSortedLines(&buff)
	fmt.Printf("Written: %#v\n", buff.String())

}
Output:

Size: 9
Size: 18
Size: 27
Size cached: 27 Size caltulate: 27
"foo line\n"
"bar line\n"
"baz line\n"
Written: "bar line\nbaz line\nfoo line\n"
Example (Custom_sort)
package main

import (
	"bytes"
	"fmt"

	"github.com/KEINOS/go-sortfile/sortfile/chunk"
)

func main() {
	const maxBytes = 32 // Max size of the chunk in bytes.

	// User defined sort function. Which is the reverse of the default sort.
	isLess := func(a, b string) bool {
		return a > b
	}

	// Instantiate a new chunk to be written to the output.
	chunk := chunk.Lines{}

	// Assign the custom sort function.
	chunk.IsLess = isLess

	// The below will use "alice", "bob", and "david" lines for the chunk.
	for _, line := range []string{
		"alice line",
		"bob line",
		"charlie line", // this will overflow the maxSize so will not be appended.
		"david line",
	} {
		// Append only if the line will not overflow the maxSize by checking the
		// size before appending the line.
		if !chunk.WillOverSize(line, maxBytes) {
			// It will append the line with a line break at the end of the line.
			chunk.AppendLine(line)
			// Print current size of the chunk to be written to the output.
			fmt.Printf("'%s' appended. Current chunk size: %v\n", line, chunk.Size())
		} else {
			fmt.Printf("Skip: '%s' will overflow. Current chunk size: %v\n", line, chunk.Size())
		}
	}

	// Recalculate the size of the chunk.
	fmt.Println("Size cached:", chunk.Size(), "Size caltulate:", chunk.SizeRaw())

	// Lines method returns the lines in the chunk (a slice of string) as is.
	fmt.Printf("%#v\n", chunk.Lines()[0])
	fmt.Printf("%#v\n", chunk.Lines()[1])
	fmt.Printf("%#v\n", chunk.Lines()[2])

	// Dump the chunk sorted by the lines to the io.Writer.
	var buff bytes.Buffer

	chunk.WriteSortedLines(&buff)
	fmt.Printf("Written: %#v\n", buff.String())

}
Output:

'alice line' appended. Current chunk size: 11
'bob line' appended. Current chunk size: 20
Skip: 'charlie line' will overflow. Current chunk size: 20
'david line' appended. Current chunk size: 31
Size cached: 31 Size caltulate: 31
"alice line\n"
"bob line\n"
"david line\n"
Written: "david line\nbob line\nalice line\n"

func NewLines

func NewLines() Lines

NewLines returns a new object of Lines.

By default it uses slice.Sort to sort the lines. Set IsLess to use a custom function to compare two strings while sorting.

func (*Lines) AppendLine

func (l *Lines) AppendLine(line string)

AppendLine appends the given line to the chunk.

func (*Lines) Dump

func (l *Lines) Dump() (string, error)

Dump sorts and writes the lines in the chunk to a temporary file and returns the path to the file.

func (*Lines) Lines

func (l *Lines) Lines() []string

Lines returns the lines in the chunk (a slice of string) as is.

func (*Lines) Size

func (l *Lines) Size() int

Size returns the byte size of the chunk to be written to the output.

func (*Lines) SizeRaw

func (l *Lines) SizeRaw() int

SizeRaw is similar to the Size method but it calculates the size from scratch an update the cached value. Thus, it is slower than Size().

It must return the same value as Size() method by design. It is for debugging purpose only.

func (Lines) UniformLineBreak

func (l Lines) UniformLineBreak(line string) string

UniformLineBreak returns the given line with the uniformed line break at the end.

func (*Lines) WillOverSize

func (l *Lines) WillOverSize(line string, sizeMax int) bool

WillOverSize returns true if the given line will make the chunk over the sizeMax, the size limit.

func (*Lines) WriteSortedLines

func (l *Lines) WriteSortedLines(output io.Writer) error

WriteSortedLines writes the sorted lines in the chunk to the given output.

type MergeSorter

type MergeSorter struct {

	// IsLess is the function to compare two strings during merge-sorting. This
	// function must be the same as the one used to sort the chunk files.
	IsLess func(a, b string) bool
	// contains filtered or unexported fields
}

MergeSorter merge-sorts the sorted chunk files.

Note that each chunk file must be sorted.

Example

This example shows how to use MergeSorter to merge-sort the sorted chunk files.

To generate the sorted chunk files, use Lines object.

package main

import (
	"fmt"
	"log"
	"os"
	"path/filepath"

	"github.com/KEINOS/go-sortfile/sortfile/chunk"
	"github.com/KEINOS/go-sortfile/sortfile/datasize"
)

func main() {
	// Prepare sorted chunk of files
	chunks := make([]*chunk.FileReader, 3)

	for index, pathFile := range []string{
		filepath.Join("..", "testdata", "sorted_chunks", "chunk1_sorted.txt"),
		filepath.Join("..", "testdata", "sorted_chunks", "chunk2_sorted.txt"),
		filepath.Join("..", "testdata", "sorted_chunks", "chunk3_sorted.txt"),
	} {
		// Create the chunk file
		reader, err := chunk.NewFileReader(pathFile)
		if err != nil {
			log.Fatal(err)
		}

		defer func(fReader *chunk.FileReader) {
			if err := fReader.Close(); err != nil {
				log.Fatal(err)
			}
		}(reader)

		chunks[index] = reader
	}

	// Prepare the output file
	pathFileOut := filepath.Join(os.TempDir(), "example_merge_sorter.out.txt")

	defer func() {
		_ = os.Remove(pathFileOut) // clean up the output file after the test
	}()

	sizeBufMax := datasize.InBytes(128)

	fWriter, err := chunk.NewFileWriter(pathFileOut, sizeBufMax)
	if err != nil {
		log.Fatal(err)
	}

	// Create the merge sorter
	mergeSorter := chunk.NewMergeSorter(chunks, fWriter)

	// Sort and write to the output file
	if err := mergeSorter.Sort(); err != nil {
		log.Fatal(err)
	}

	// Close the output file
	if err := fWriter.Close(); err != nil {
		log.Fatal(err)
	}

	// Print the sorted output file
	outData, err := os.ReadFile(pathFileOut)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(outData))
}
Output:

Alice
Bob
Carol
Charlie
Dave
Ellen
Eve
Frank
Isaac
Ivan
Justin
Mallet
Mallory
Marvin
Matilda
Oscar
Pat
Peggy
Steve
Trent
Trudy
Victor
Walter
Zoe

func NewMergeSorter

func NewMergeSorter(inFiles []*FileReader, outFile *FileWriter) *MergeSorter

NewMergeSorter returns a new MergeSorter object. The inFiles must be a slice of FileReader objects and each file must be sorted.

func (*MergeSorter) Sort

func (ms *MergeSorter) Sort() error

Sort merge-sorts the chunk files and writes the result to the output file.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL