gosync

package module
v0.0.0-...-d9b3aeb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 8, 2020 License: MIT Imports: 14 Imported by: 6

README

Go-Sync

Build Status GoDoc

The Command-line tool has moved!

In order to split issues between the library and the CLI tool, as well as correctly vendor dependencies, the command-line tool code has been moved to its own repository: https://github.com/Redundancy/gosync-cmd

Why not use a Zsync mechanism?

Consider if a binary differential sync mechanism is appropriate to your use case:

The ZSync mechanism has the weakness that HTTP1.1 ranged requests are not always well supported by CDN providers and ISP proxies. When issues happen, they're very difficult to respond to correctly in software (if possible at all). Using HTTP 1.0 and fully completed GET requests would be better, if possible.

There are some other issues too - ZSync doesn't (as far as I'm aware) solve any issues to do with storage of a files, which can get more and more onerous for large files that are not changing much from one version to another.

On a project I worked on, we switched instead to storing individual files that were part of a larger build (like an ISO) by filename and hashes, mainly maintaining an index of which files comprised the full build. By doing this, we significantly decreased the required storage (new files were only required when they changed), allowed multiple versions to sit efficiently side by side and very simple file serving to be used efficiently (with a tiny library to resolve and fetch files).

The GoSync library

gosync is a library inspired by zsync and rsync. Here are the goals:

Fast

Using the concurrency and performance features of Golang, Go-sync is designed to take advantage of multiple processors and multiple HTTP connections to make the most of modern hardware and minimize the impact of the bandwidth latency product.

Cross Platform

Works on Windows and Linux, without cygwin or fuss.

Easy

A new high-level interface designed to reduce the work of implementing block transfer in your application:

fs := &BasicSummary{...}

rsync, err := MakeRSync(
    localFilename,
    referencePath,
    outFilename,
    fs,
)

if err != nil {
    return err
}

err = rsync.Patch()

if err != nil {
    return err
}

return rsync.Close()

Extensible

All functionality is based on interfaces, allowing customization of behavior:

// Here, the input version is a local string
inputFile := bytes.NewReader(localVersionAsBytes)

// And the output is a buffer
patchedFile := bytes.NewBuffer(nil)

// This information is meta-data on the file that should be loaded / provided
// You can also provide your own implementation of the FileSummary interface
summary := &BasicSummary{
    ChecksumIndex:  referenceFileIndex,
    // Disable verification of hashes for downloaded data (not really a good idea!)
    ChecksumLookup: nil,
    BlockCount:     uint(blockCount),
    BlockSize:      blockSize,
    FileSize:       int64(len(referenceAsBytes)),
}

rsync := &RSync{
    Input:  inputFile,
    Output: patchedFile,
    // An in-memory block source
    Source: blocksources.NewReadSeekerBlockSource(
        bytes.NewReader(referenceAsBytes),
        blocksources.MakeNullFixedSizeResolver(uint64(blockSize)),
    ),
    Index:   summary,
    Summary: summary,
    OnClose: nil,
}

Reuse low level objects to build a new high level library, or implement a new lower-level object to add a new transfer protocol (for example).

Tested

GoSync has been built from the ground up with unit tests. The GoSync command-line tool has acceptance tests, although not everything is covered.

HOWEVER this library has never been used in production against real-world network problems, and I cannot personally guarantee that it will work as intended.

Current State

The GoSync library is fairly well unit-tested, but not tested through exposure to real-world network conditions. As an example, the HTTP client used is a default HTTP client, and is therefore lacking decent timeouts. As such, I would not recommend depending on the code in production unless you're willing to validate the results and debug issues like that.

In terms of activity, I have been extremely busy with other things for the last few months, and will continue to be. I do not expect to put a huge amount more work into this, since we solved our problem in a simpler (and significant more elegant) way explained in a section above.

Request for Enhancement

If the library or tool are still something that you feel would be useful, here are some issues and ideas for work that could be done.

GZip support - Performance / Efficiency Enhancement (!)

In order to be more efficient in the transfer of data from the source to the client, gosync should support compressed blocks. This requires changing any assumptions about the offset of a block, and the length of a block to read (especially when merging block ranges), then adding a compression / decompression call to the interfaces.

In terms of the CLI tool, this probably means that gosync should build a version of the source file where each block is independently compressed and store the block-sizes in the index. It can then rebuild the offsets incrementally.

Patch payloads - Feature

Given a known original version, and a known desired state, it would be possible to create a "patch", which has enough information to store the required blocks for the transformation only, and only enough of the index to validate that it's transforming the correct file.

Patched file Validation - Feature (!)

GoSync should validate the full MD5 and length of a file after it is done with patching it. This should be minimally expensive, and help increase confidence that GoSync has produced the correct result.

This one is pretty simple. :)

Network Error handling - Improvement (!!)

The HTTP Blocksource does not handle connection / read timeouts and other myriad possible network failures . Handling these correctly is important to making it robust and production-ready.

Rolled into this is to correctly identify resumable errors (including rate-limiting, try-again-later and temporary errors) and back-off strategies.

Rate limiting - Feature

In order to be a good network denizen, GoSync should be able to support rate-limiting.

Better / Consistent naming - Improvement

The current naming of some packages and concepts is a bit broken. The RSync object, for example, has nothing to do with RSync. Blocks and Chunks are used interchangeably for a byte range.

Testing

All tests are run by Travis-CI

Unit tests
go test github.com/Redundancy/go-sync/...

Documentation

Overview

Package gosync is inspired by zsync, and rsync. It aims to take the fundamentals and create a very flexible library that can be adapted to work in many ways.

We rely heavily on built in Go abstractions like io.Reader, hash.Hash and our own interfaces - this makes the code easier to change, and to test. In particular, no part of the core library should know anything about the transport or layout of the reference data. If you want to do rsync and do http/https range requests, that's just as good as zsync client-server over an SSH tunnel. The goal is also to allow support for multiple concurrent connections, so that you can make the best use of your line in the face of the bandwidth latency product (or other concerns that require concurrency to solve).

The following optimizations are possible: * Generate hashes with multiple threads (both during reference generation and local file interrogation) * Multiple ranged requests (can even be used to get the hashes)

Example
// due to short example strings, use a very small block size
// using one this small in practice would increase your file transfer!
const blockSize = 4

// This is the "file" as described by the authoritive version
const reference = "The quick brown fox jumped over the lazy dog"

// This is what we have locally. Not too far off, but not correct.
const localVersion = "The qwik brown fox jumped 0v3r the lazy"

generator := filechecksum.NewFileChecksumGenerator(blockSize)
_, referenceFileIndex, _, err := indexbuilder.BuildIndexFromString(
	generator,
	reference,
)

if err != nil {
	return
}

referenceAsBytes := []byte(reference)
localVersionAsBytes := []byte(localVersion)

blockCount := len(referenceAsBytes) / blockSize
if len(referenceAsBytes)%blockSize != 0 {
	blockCount++
}

inputFile := bytes.NewReader(localVersionAsBytes)
patchedFile := bytes.NewBuffer(nil)

// This is more complicated than usual, because we're using in-memory
// "files" and sources. Normally you would use MakeRSync
summary := &BasicSummary{
	ChecksumIndex:  referenceFileIndex,
	ChecksumLookup: nil,
	BlockCount:     uint(blockCount),
	BlockSize:      blockSize,
	FileSize:       int64(len(referenceAsBytes)),
}

rsync := &RSync{
	Input:  inputFile,
	Output: patchedFile,
	Source: blocksources.NewReadSeekerBlockSource(
		bytes.NewReader(referenceAsBytes),
		blocksources.MakeNullFixedSizeResolver(uint64(blockSize)),
	),
	Summary: summary,
	OnClose: nil,
}

if err := rsync.Patch(); err != nil {
	fmt.Printf("Error: %v", err)
	return
}

fmt.Printf("Patched result: \"%s\"\n", patchedFile.Bytes())
Output:

Patched result: "The quick brown fox jumped over the lazy dog"
Example (HttpBlockSource)

This is exceedingly similar to the module Example, but uses the http blocksource and a local http server

package main

import (
	"bytes"
	"crypto/md5"
	"fmt"
	"net"
	"net/http"
	"time"

	"github.com/Redundancy/go-sync/blocksources"
	"github.com/Redundancy/go-sync/comparer"
	"github.com/Redundancy/go-sync/filechecksum"
	"github.com/Redundancy/go-sync/indexbuilder"
	"github.com/Redundancy/go-sync/patcher"
)

// due to short example strings, use a very small block size
// using one this small in practice would increase your file transfer!
const BLOCK_SIZE = 4

// This is the "file" as described by the authoritive version
const REFERENCE = "The quick brown fox jumped over the lazy dog"

// This is what we have locally. Not too far off, but not correct.
const LOCAL_VERSION = "The qwik brown fox jumped 0v3r the lazy"

var content = bytes.NewReader([]byte(REFERENCE))

func handler(w http.ResponseWriter, req *http.Request) {
	http.ServeContent(w, req, "", time.Now(), content)
}

// set up a http server locally that will respond predictably to ranged requests
func setupServer() <-chan int {
	var PORT = 8000
	s := http.NewServeMux()
	s.HandleFunc("/content", handler)

	portChan := make(chan int)

	go func() {
		var listener net.Listener
		var err error

		for {
			PORT++
			p := fmt.Sprintf(":%v", PORT)
			listener, err = net.Listen("tcp", p)

			if err == nil {
				break
			}
		}
		portChan <- PORT
		http.Serve(listener, s)
	}()

	return portChan
}

// This is exceedingly similar to the module Example, but uses the http blocksource and a local http server
func main() {
	PORT := <-setupServer()
	LOCAL_URL := fmt.Sprintf("http://localhost:%v/content", PORT)

	generator := filechecksum.NewFileChecksumGenerator(BLOCK_SIZE)
	_, referenceFileIndex, checksumLookup, err := indexbuilder.BuildIndexFromString(generator, REFERENCE)

	if err != nil {
		return
	}

	fileSize := int64(len([]byte(REFERENCE)))

	// This would normally be saved in a file

	blockCount := fileSize / BLOCK_SIZE
	if fileSize%BLOCK_SIZE != 0 {
		blockCount++
	}

	fs := &BasicSummary{
		ChecksumIndex:  referenceFileIndex,
		ChecksumLookup: checksumLookup,
		BlockCount:     uint(blockCount),
		BlockSize:      uint(BLOCK_SIZE),
		FileSize:       fileSize,
	}

	/*
		// Normally, this would be:
		rsync, err := MakeRSync(
			"toPatch.file",
			"http://localhost/content",
			"out.file",
			fs,
		)
	*/
	// Need to replace the output and the input
	inputFile := bytes.NewReader([]byte(LOCAL_VERSION))
	patchedFile := bytes.NewBuffer(nil)

	resolver := blocksources.MakeFileSizedBlockResolver(
		uint64(fs.GetBlockSize()),
		fs.GetFileSize(),
	)

	rsync := &RSync{
		Input:  inputFile,
		Output: patchedFile,
		Source: blocksources.NewHttpBlockSource(
			LOCAL_URL,
			1,
			resolver,
			&filechecksum.HashVerifier{
				Hash:                md5.New(),
				BlockSize:           fs.GetBlockSize(),
				BlockChecksumGetter: fs,
			},
		),
		Summary: fs,
		OnClose: nil,
	}

	err = rsync.Patch()

	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	err = rsync.Close()

	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Patched content: \"%v\"\n", patchedFile.String())

	// Just for inspection
	remoteReferenceSource := rsync.Source.(*blocksources.BlockSourceBase)
	fmt.Printf("Downloaded Bytes: %v\n", remoteReferenceSource.ReadBytes())

}

func ToPatcherFoundSpan(sl comparer.BlockSpanList, blockSize int64) []patcher.FoundBlockSpan {
	result := make([]patcher.FoundBlockSpan, len(sl))

	for i, v := range sl {
		result[i].StartBlock = v.StartBlock
		result[i].EndBlock = v.EndBlock
		result[i].MatchOffset = v.ComparisonStartOffset
		result[i].BlockSize = blockSize
	}

	return result
}

func ToPatcherMissingSpan(sl comparer.BlockSpanList, blockSize int64) []patcher.MissingBlockSpan {
	result := make([]patcher.MissingBlockSpan, len(sl))

	for i, v := range sl {
		result[i].StartBlock = v.StartBlock
		result[i].EndBlock = v.EndBlock
		result[i].BlockSize = blockSize
	}

	return result
}
Output:

Patched content: "The quick brown fox jumped over the lazy dog"
Downloaded Bytes: 16

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// DefaultConcurrency is the default concurrency level used by patching and downloading
	DefaultConcurrency = runtime.NumCPU()
)

Functions

func IsSameFile

func IsSameFile(path1, path2 string) (same bool, err error)

IsSameFile checks if two file paths are the same file

Types

type BasicSummary

type BasicSummary struct {
	BlockSize  uint
	BlockCount uint
	FileSize   int64
	*index.ChecksumIndex
	filechecksum.ChecksumLookup
}

BasicSummary implements a version of the FileSummary interface

func (*BasicSummary) GetBlockCount

func (fs *BasicSummary) GetBlockCount() uint

GetBlockCount gets the number of blocks

func (*BasicSummary) GetBlockSize

func (fs *BasicSummary) GetBlockSize() uint

GetBlockSize gets the size of each block

func (*BasicSummary) GetFileSize

func (fs *BasicSummary) GetFileSize() int64

GetFileSize gets the file size of the file

type FileSummary

type FileSummary interface {
	GetBlockSize() uint
	GetBlockCount() uint
	GetFileSize() int64
	FindWeakChecksum2(bytes []byte) interface{}
	FindStrongChecksum2(bytes []byte, weak interface{}) []chunks.ChunkChecksum
	GetStrongChecksumForBlock(blockID int) []byte
}

FileSummary combines many of the interfaces that are needed It is expected that you might implement it by embedding existing structs

type RSync

type RSync struct {
	Input  ReadSeekerAt
	Source patcher.BlockSource
	Output io.Writer

	Summary FileSummary

	OnClose []closer
}

RSync is an object designed to make the standard use-case for gosync as easy as possible.

To this end, it hides away many low level choices by default, and makes some assumptions.

func MakeRSync

func MakeRSync(
	InputFile,
	Source,
	OutFile string,
	Summary FileSummary,
) (r *RSync, err error)

MakeRSync creates an RSync object using string paths, inferring most of the configuration

func (*RSync) Close

func (rsync *RSync) Close() error

Close - close open files, copy to the final location from a temporary one if neede

func (*RSync) Patch

func (rsync *RSync) Patch() (err error)

Patch the files

type ReadSeekerAt

type ReadSeekerAt interface {
	io.ReadSeeker
	io.ReaderAt
}

ReadSeekerAt is the combinaton of ReadSeeker and ReaderAt interfaces

Directories

Path Synopsis
Package chunks provides the basic structure for a pair of the weak and strong checksums.
Package chunks provides the basic structure for a pair of the weak and strong checksums.
cmd
gosync
gosync is a command-line implementation of the gosync package functionality, primarily as a demonstration of usage but supposed to be functional in itself.
gosync is a command-line implementation of the gosync package functionality, primarily as a demonstration of usage but supposed to be functional in itself.
package comparer is responsible for using a FileChecksumGenerator (filechecksum) and an index to move through a file and compare it to the index, producing a FileDiffSummary
package comparer is responsible for using a FileChecksumGenerator (filechecksum) and an index to move through a file and compare it to the index, producing a FileDiffSummary
package filechecksum provides the FileChecksumGenerator, whose main responsibility is to read a file, and generate both weak and strong checksums for every block.
package filechecksum provides the FileChecksumGenerator, whose main responsibility is to read a file, and generate both weak and strong checksums for every block.
Package index provides the functionality to describe a reference 'file' and its contents in terms of the weak and strong checksums, in such a way that you can check if a weak checksum is present, then check if there is a strong checksum that matches.
Package index provides the functionality to describe a reference 'file' and its contents in terms of the weak and strong checksums, in such a way that you can check if a weak checksum is present, then check if there is a strong checksum that matches.
Package indexbuilder provides a few shortbuts to building a checksum index by generating and then loading the checksums, and building an index from that.
Package indexbuilder provides a few shortbuts to building a checksum index by generating and then loading the checksums, and building an index from that.
Package patcher follows a pattern established by hash, which defines the interface in the top level package, and then provides implementations below it.
Package patcher follows a pattern established by hash, which defines the interface in the top level package, and then provides implementations below it.
sequential
Sequential Patcher will stream the patched version of the file to output, since it works strictly in order, it cannot patch the local file directly (since it might overwrite a block needed later), so there would have to be a final copy once the patching was done.
Sequential Patcher will stream the patched version of the file to output, since it works strictly in order, it cannot patch the local file directly (since it might overwrite a block needed later), so there would have to be a final copy once the patching was done.
rollsum provides an implementation of a rolling checksum - a checksum that's efficient to advance a byte or more at a time.
rollsum provides an implementation of a rolling checksum - a checksum that's efficient to advance a byte or more at a time.
util
readers
util/readers exists to provide convenient and composable io.Reader compatible streams to allow testing without having to check in large binary files.
util/readers exists to provide convenient and composable io.Reader compatible streams to allow testing without having to check in large binary files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL