s3gof3r

package module
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2014 License: MIT Imports: 21 Imported by: 0

README

s3gof3r

s3gof3r provides fast, parallelized, pipelined streaming access to Amazon S3. It includes a command-line interface: gof3r.

It is optimized for high speed transfer of large objects into and out of Amazon S3. Streaming support allows for usage like:

  $ tar -czf - <my_dir/> | gof3r put -b <s3_bucket> -k <s3_object>    
  $ gof3r get -b <s3_bucket> -k <s3_object> | tar -zx

Speed Benchmarks

On an EC2 instance, gof3r can exceed 1 Gbps for both puts and gets:

  $ gof3r get -b test-bucket -k 8_GB_tar | pv -a | tar -x
  Duration: 53.201632211s
  [ 167MB/s]
  

  $ tar -cf - test_dir/ | pv -a | gof3r put -b test-bucket -k 8_GB_tar
  Duration: 1m16.080800315s
  [ 119MB/s]

These tests were performed on an m1.xlarge EC2 instance with a virtualized 1 Gigabit ethernet interface. See Amazon EC2 Instance Details for more information.

Features

  • Speed: Especially for larger s3 objects where parallelism can be exploited, s3gof3r will saturate the bandwidth of an EC2 instance. See the Benchmarks above.

  • Streaming Uploads and Downloads: As the above examples illustrate, streaming allows the gof3r command-line tool to be used with linux/unix pipes. This allows transformation of the data in parallel as it is uploaded or downloaded from S3.

  • End-to-end Integrity Checking: s3gof3r calculates the md5 hash of the stream in parallel while uploading and downloading. On upload, a file containing the md5 hash is saved in s3. This is checked against the calculated md5 on download. On upload, the content-md5 of each part is calculated and sent with the header to be checked by AWS. s3gof3r also checks the 'hash of hashes' returned by S3 in the Etag field on completion of a multipart upload. See the S3 API Reference for details.

  • Retry Everything: All http requests and every part is retried on both uploads and downloads. Requests to S3 frequently time out, especially under high load, so this is essential to complete large uploads or downloads.

  • Memory Efficiency: Memory used to upload and download parts is recycled. For an upload with the default concurrency of 10 and part size of 20 MB, the maximum memory usage is less than 250 MB and does not depend on the size of the upload. For downloads with the same default configuration, maximum memory usage will not exceed 450 MB. The additional memory usage vs. uploading is due to the need to reorder parts before adding them to the stream.

Installation

s3gof3r is written in Go and requires a Go installation. It can be installed with go get to download and compile it from source. To install the command-line tool, gof3r:

$ go get github.com/rlmcpherson/s3gof3r/gof3r

To install just the package for use in other Go programs:

$ go get github.com/rlmcpherson/s3gof3r

Command-line Interface Usage:

  To stream up to S3:
     $  <input_stream> | gof3r put -b <bucket> -k <s3_path>
  To stream down from S3:
     $ gof3r get -b <bucket> -k <s3_path> | <output_stream>
  To upload a file to S3:
     $ gof3r  put --path=<local_path> --bucket=<bucket> --key=<s3_path> --header=<http_header1> --header=<http_header2>...
  To download a file from S3:
     $ gof3r  get --bucket=<bucket> --key=<s3_path>

Set AWS keys as environment Variables (required):

  $ export AWS_ACCESS_KEY_ID=<access_key>
  $ export AWS_SECRET_ACCESS_KEY=<secret_key>

Try the gof3r command-line tool with statically-linked binaries
linux amd/64 binary (go1.2)
mac osx binary (go1.2)

Examples:

$ tar -cf - /foo_dir/ | gof3r put -b my_s3_bucket -k bar_dir/s3_object -m x-amz-meta-custom-metadata:abc123 -m x-amz-server-side-encryption:AES256
$ gof3r get -b my_s3_bucket -k bar_dir/s3_object | tar -x

Complete Usage: get command:

  gof3r [OPTIONS] get [get-OPTIONS]

  get (download) from S3

  Help Options:
  -h, --help          Show this help message

  get (download) from S3:
  -p, --path=         Path to file. Defaults to standard output for streaming. (/dev/stdout)
  -k, --key=          key of s3 object
  -b, --bucket=       s3 bucket
  --md5Check-off      Do not use md5 hash checking to ensure data integrity.
   	       By default, the md5 hash of is calculated concurrently
   	       during puts, stored at <bucket>.md5/<key>.md5, and verified on gets.
  -c, --concurrency=  Concurrency of transfers (20)
  -s, --partsize=     initial size of concurrent parts, in bytes (20 MB)
  --debug             Print debug statements and dump stacks.

  Help Options:
  -h, --help          Show this help message

Complete Usage: put command:

  gof3r [OPTIONS] put [put-OPTIONS]

  put (upload)to S3

  Help Options:
    -h, --help          Show this help message

  put (upload) to S3:
    -p, --path=         Path to file. Defaults to standard input for streaming. (/dev/stdin)
    -m, --header=       HTTP headers
    -k, --key=          key of s3 object
    -b, --bucket=       s3 bucket
    --md5Check-off  Do not use md5 hash checking to ensure data integrity. By default, the md5 hash of is calculated concurrently
   		 during puts, stored at <bucket>.md5/<key>.md5, and verified on gets.
    -c, --concurrency=  Concurrency of transfers (20)
    -s, --partsize=     initial size of concurrent parts, in bytes (20 MB)
    --debug         Print debug statements and dump stacks.

  Help Options:
    -h, --help          Show this help message

See godoc.org for more documentation, including the s3gof3r package api:

s3gof3r package: http://godoc.org/github.com/rlmcpherson/s3gof3r

command-line interface: http://godoc.org/github.com/rlmcpherson/s3gof3r/gof3r

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultConfig = &Config{
	Concurrency: 10,
	PartSize:    20 * mb,
	NTry:        10,
	Md5Check:    true,
	Scheme:      "https",
}

Defaults

View Source
var DefaultDomain = "s3.amazonaws.com"

Functions

func ClientWithTimeout

func ClientWithTimeout(timeout time.Duration) *http.Client

func NewBufferPool

func NewBufferPool(bufsz int64) (np *bp)

Types

type Bucket

type Bucket struct {
	*S3
	Name string
}

func (*Bucket) GetReader

func (b *Bucket) GetReader(path string, c *Config) (r io.ReadCloser, h http.Header, err error)

Provides a reader and downloads data using parallel ranged get requests. Data from the requests is reordered and written sequentially.

Data integrity is verified via the option specified in c. Header data from the downloaded object is also returned, useful for reading object metadata.

func (*Bucket) PutWriter

func (b *Bucket) PutWriter(path string, h http.Header, c *Config) (w io.WriteCloser, err error)

Provides a writer to upload data as multipart upload requests.

Each header in h is added to the HTTP request header. This is useful for specifying options such as server-side encryption in metadata as well as custom user metadata. DefaultConfig is used if c is nil.

func (*Bucket) Sign

func (b *Bucket) Sign(req *http.Request)

func (*Bucket) Url

func (b *Bucket) Url(path string, c *Config) url.URL

Returns a parsed url to the given path, using the scheme specified in Config.Scheme

type Config

type Config struct {
	*http.Client        // nil to use s3gof3r default client
	Concurrency  int    // number of parts to get or put concurrently
	PartSize     int64  //  initial  part size in bytes to use for multipart gets or puts
	NTry         int    // maximum attempts for each part
	Md5Check     bool   // the md5 hash of the object is stored in <bucket>/.md5/<object_key> and verified on gets
	Scheme       string // url scheme, defaults to 'https'
}

type Keys

type Keys struct {
	AccessKey string
	SecretKey string
}

Keys for an Amazon Web Services account. Used for signing http requests.

type S3

type S3 struct {
	Domain string // The s3-compatible service domain. Defaults to "s3.amazonaws.com"
	Keys
}

func New

func New(domain string, keys Keys) *S3

Returns a new S3 domain defaults to DefaultDomain if empty

func (*S3) Bucket

func (s3 *S3) Bucket(name string) *Bucket

Returns a bucket on s3j

Directories

Path Synopsis
Command gof3r is a command-line interface for s3gof3r: fast, concurrent, streaming access to Amazon S3.
Command gof3r is a command-line interface for s3gof3r: fast, concurrent, streaming access to Amazon S3.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL