Overview
bk is a tool for backing things up--both raw data streams and directory
hierarchies. I wrote it because I wanted to have personal responsibility
for my data's integrity, up to and including being responsible for data
loss due to bugs in the backup system. You should probably use something
else to back up your data--bup is a great
choice.
That said, thanks to Google for letting me open source it.
Features
My goal was to implement the absolute minimum number of features necessary
for my needs; the idea was that a minimal feature set (and in turn, a
minimal number of lines of code) would reduce the probability of bugs (and
in turn, the probability of data corruption).
- Data de-duplication (using a rolling
hash)
- Compression (gzip)
- Optional encryption (using Go's AES
implementation).
- Data integrity (and corruption recovery) using Reed-Solomon encoding.
- Direct backups to cloud storage.
- Ability to access backups via FUSE.
Usage
Set your BK_DIR environment variable either to a local directory or to a
Google Cloud Storage bucket name of the form "gs://somebucketname".
To set up a backup repository, run:
% bk init
It's assumed that the target directory exists but is empty. To backup to
Google Cloud Storage:
% env BK_GCS_PROJECT_ID=myproject-1234 bk init
For an encrypted repository,
% env BK_PASSPHRASE=yolo bk init --encrypt
Though don't do it like that, since you don't want your passphrase in your
shell command history.
To back up a directory hierarchy (e.g., your home directory):
% bk backup home ~
(BK_PASSPHRASE must be set if the repository is encrypted.)
Here, the backup is named "home". bk adds the current date and time to
the name of the backup; all available backups can be listed with "bk
list".
Backups can be referred to via their full name and time as provided by "bk
list"--e.g. "home@20170413104506". If just the base backup name is given
("home"), the the most recent backup with that base name used.
Incremental backups can be performed using the --base argument; the
following uses the most recent backup from the set named "home" as the
baseline.
% bk backup --base home home ~
Note that incremental backups only make backups run faster (by not scanning
the contents of every file); there is no space benefit, since bk applies
low-level deduplication to the data it stores.
To restore from a backup:
% bk restore home /tmp/restored
To mount all backups as a FUSE directory (if you have FUSE installed):
% bk mount /mnt
The resulting hierarchy has the structure
"backup_name/year/month/day/hhmmss".
Run "bk help" for more information and additional commands.
Influences
- Venti: A New Approach to Archival
Storage,
Sean Quinlan and Sean Dorward. Hash-based archival storage, from the
Plan 9 project.
- A Low-bandwidth Network File
System, Athicha
Muthitacharoen, Benjie Chen, and David Mazieres: rolling hashes to break
up bitstreams.
- bup: rolling hashes, hash-based archival
storage, all wrapped up in git packfiles. bk's rolling hash code comes
from bup.
- Foundation:
hash-based archival storage, revisiting some of Venti's design decisions,
showed that rolling hashes (versus block-based archiving) weren't a big
win.
In general, bup and foundation both go through some effort to provide
efficient access to hash-addressed data without loading an entire index
that goes from hashes to storage locations into memory. For my use of
bk, the indices are a few hundred MB, so they're just all loaded at
startup time. Note that this isn't an ideal approach when using cloud
storage; something along the lines of Foundation's approach (or keeping a
local cache of the index) would probably be better.
FAQs that no one has asked
Q: Wouldn't it be easier to just buy a Time Capsule?
A: Enjoy your "sparse bundle in
use" errors
that leave all of your backups corrupt and irrecoverable but aren't
reported until you try to restore.
Q: Isn't most of this functionality provided by
upspin?
A: It looks like it, especially as they implement the rest of the
infrastructure for some of their key use
cases.
Q: Why did you invent your own packfile format rather than using git's?
A: bk's pack files are simpler than git's (but don't have many of their
advantages, like efficient lookups after just few seeks in index files,
without needing to read them all into memory.) OTOH, bk uses
SHAKE256 to hash data blobs into 32
bytes of hash. git's choice of SHA-1 now looks somewhat
unfortunate,
though for personal backups, this probably isn't something to worry much
about.
Q: Why not use bup?
A: You should use bup. It has lots of users, which makes it less likely to
have subtle bugs. I wrote bk for fun (Go is fun) and because I wanted to
own responsibility for my bits. Also, bup doesn't directly support
encryption or uploading directly to
GCS.
Q: Your use of Check() and CheckError() isn't idiomatic Go error handling.
A: That's not a question. For a backup system, I believe that most errors
should cause the system to immediately stop and fail obviously rather than
make an attempt to recover (since the recovery code paths won't be well
exercised and are thus likely to be buggy). Given this decision, I'd rather
have those checks take a single line of code rather than three lines to
test the error against nil and then panic.