gobox

module
v0.0.0-...-dac2ecc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 24, 2021 License: Apache-2.0

README

gobox

tests

A file sync utility, similar to rsync, written in go.

Algorithm

gobox uses a simplified version of the rsync algorithm. All files are split into 1 KB chunks. A map of the md5 checksum of each transferred chunk is kept in memory on the client side in the form of a map[md5sum]ChunkID, where ChunkID is made up of a (FileID,ChunkNumber) tuple.

The checksum of each chunk is computed before being sent, and if it has already been transmitted, the ChunkID of the first occurrence is sent instead, without the contents of the chunk. The server will then read the previous chunk from disk and write that to the new file instead.

Flow

At startup, the client recursively walks the watched directory, and calls the SendFileInfo RPC method for each file and folder it encounters. This sends the file name, its type, size and modification time. The servers checks that against the files it has on disk, and if there's a mismatch, it replies with SendChunkIds=true.

The client will send the Chunks for all the files the server has requested. In the case of chunks whose md5hash was already known, just the metadata is sent (FileID, ChunkNumber), and the server retrieves the bytes from previously sent files.

Once all the initial files have been sent, the client calls InitialSyncComplete. This tells the server that any files that are on disk that haven't been sent by the client need to be deleted.

Next, the client uses the fsnotify library to monitor for file system events on the watched directory. This library was chosen because of its cross-platform support.

Any event in the watched directory will trigger a SendFileInfo method call, the rest of the flow being the same as before. The only difference is the DeleteFile method which will be called when files are deleted.

Protocol

gobox uses protobuf to communicate between the client and the server. It was chosen because it can easily handle binary data, and asynchronous streams.

The protobuf RPC methods are defined in proto/gobox.proto.

Testing

The only dependencies required to run the test are pytest and the go compiler. With those installed, running make test from the root directory should run the integration tests. In addition, the tests also run on every push on github actions.

Future improvements

  • A production-ready version of this utility would use mTLS to authenticate the server and the client using server and client certificates.

  • The rsync algorithm uses a rolling checksum which can identify identical blocks at any arbitrary position in the file, thus making it more efficient. For example, with the current implementation, adding 1 byte at the beginning of a large file would mean the entire file needs to be resent, since none of the 1 KB chunks would still match. A future version of this utility could possibly implement an algorithm more similar to the one used by rsync.

  • When transferring large files with non-repeating chunks (such as a several GB zip file), the 1 KB chunks are probably inefficient. Instead of having a constant chunk size, it should be able to be determined for each file individually, while keeping the 1 KB minimum chunk size.

    For example, chunkSize = max(1024, fileSize/100)

  • There are some edge-cases not covered by the current tests which would break syncing. Identical chunks are identified by the chunkID of the first occurrence. If the file containing the first occurrence were to be deleted, the server would have no way of writing that chunk again. This can be solved by creating a way for the server to request chunks it doesn't have the data for.

  • rsync also supports a --checksum mode where the checksum of the files are verified on both the client and server side. This could also be added to gobox, although the file size and modification times should be enough in the vast majority of cases to determine if a file needs to be uploaded.

  • In addition to the current integration tests, unit tests should be written where appropriate.

Directories

Path Synopsis
cmd
pkgs

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL