s3tar

package module
v1.0.6-pre Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 26, 2023 License: Apache-2.0 Imports: 27 Imported by: 0

README

Amazon S3 Tar Tool

s3tar is utility tool to create a tarball of existing objects in Amazon S3.

s3tar allows customers to group existing Amazon S3 objects into TAR files without having to download the files. This cli tool leverages existing Amazon S3 APIs to create the archives on Amazon S3 that can be later transitioned to any of the cold storage tiers. The files generated follow the tar file format and can be extracted with standard tar tools.

Using the Multipart Uploads API, in particular UploadPartCopy API, we can copy existing objects into one object. This utility will create the intermediate TAR header files that go between each file and then concatenate all of the objects into a single tarball.

Usage

The tool follows the tar syntax for creation and extraction of tarballs with a few additions to support Amazon S3 operations.

flag description required
-c create yes, unless using -x
-x extract yes, unless using -c
-C destination to extract yes when using -x
-f file that will be generated or extracted: s3://bucket/prefix/file.tar yes
-t list files in archive no
--extended to use with -t to extend the output to filename,loc,length,etag no
-m manifest input no
--region aws region where the bucket is yes
-v, -vv, -vvv level of verbose no
--format Tar format PAX or GNU, default is PAX no
--endpointUrl specify an Amazon S3 endpoint no

The syntax for creating and extracting tarballs remains similar to traditional tar tools:

   s3tar --region region [-c --create] | [-x --extract] [-v] -f s3://bucket/prefix/file.tar s3://bucket/prefix
Examples

To create a tarball s3://bucket/prefix/archive.tar from all the objects located under s3://bucket/files/

s3tar --region us-west-2 -cvf s3://bucket/prefix/archive.tar s3://bucket/files/

The tool supports an input manifest -m. The manifest is a comma-separated-value (csv) file with bucket,key,content-length. Content-length is the size in bytes of the object. For example:

$ cat manifest.input.csv
my-bucket,prefix/file.0001.exr,68365312
my-bucket,prefix/file.0002.exr,50172928
my-bucket,prefix/file.0003.exr,67663872

$ s3tar --region us-west-2 -cvf s3://bucket/prefix/archive.tar -m /Users/bolyanko/manifest.input.csv

# the manifest file can be a local file or an object in Amazon S3

$ s3tar --region us-west-2 -cvf s3://bucket/prefix/archive.tar -m s3://bucket/prefix/manifest.input.csv


TOC & Extract

Tarballs created with this tool generate a Table of Contents (TOC). This TOC file is at the beginning of the archive and it contains a csv line per file with the name, byte location, content-length, Etag. This added functionality allows archives that are created this way to also be extracted without having to download the tar object.

You can extract a tarball from Amazon S3 into another Amazon S3 location with the following command:

s3tar --region us-west-2 -xvf s3://bucket/prefix/archive.tar -C s3://bucket/destination/

To extract a single file in a tar, or a prefix

s3tar --region us-west-2 -xvf s3://bucket/prefix/archive.tar -C s3://bucket/destination/ folder/image1.jpg 
# or a dir
s3tar --region us-west-2 -xvf s3://bucket/prefix/archive.tar -C s3://bucket/destination/ folder/ 
List

If you want to list the files in a tar

s3tar --region us-west-2 -tf s3://bucket/prefix/archive.tar 
folder/image1.jpg
folder/image2.jpg
folder/image3.jpg
other-folder/image1.jpg
other-folder/image2.jpg
other-folder/image3.jpg
Performance

The tool's performance is bound by the API calls limitations. The table below has a few tests with files of different sizes.

Number of Files Final archive size Average Object Size Creation Time Extraction Time Estimated Cost (us-west-2)
41,593 20 GB 512 KB 6m10s 3m11s $0.4159
124,779 61 GB 512 KB 18m24s 10m5s $1.2478
249,558 123 GB 512 KB 40m56s 21m42s $2.4956
499,116 246 GB 512 KB 1h34m 38m58s $4.9912
748,674 369 GB 512 KB 2h36m $7.48674
14,400 73 GB 70 MB 2m15s 1m20s $0.1440
69,121 3.75 TB 70 MB 1h11m30s 32m20s $0.6912

The application is configured to retry every Amazon S3 operation up to 10 times with a Max backoff time of 20 seconds. If you get a timeout error, try reducing the number of files.

Installation

A make file is included that helps building the application for darwin-arm64 linux-arm64 linux-amd64. Place the resulting s3tar binary in your PATH.

How the tool works

This tools utilizes Amazon S3 Multipart Upload (MPU). MPU allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.

Multipart upload is a three-step process: You initiate the upload, you upload the object parts or copy from an existing Amazon S3 Object, and after you have all the parts, you complete the multipart upload. Upon receiving the complete multipart upload request, Amazon S3 constructs the object from all the parts, and you can then access the object just as you would any other object in your bucket. You can learn more about Multipart Upload on the MPU Overview

There are two Amazon S3 API Operations that allow adding data to a Multipart Upload. UploadPart and UploadPartCopy. This tool generates TAR header files and uses s3.UploadPart to upload the header data into a MPU, and then it uses s3.UploadPartCopy to copy your existing Amazon S3 Object into the newly created object.

Currently Multipart Uploads have a minimum requirement of 5MB per part and each part can go up to 5GiB. The total maximum MPU object size is 5TiB.

s3tar automatically detects the size of the objects it needs to tar. The total size of all the files must be greater than 5MB. If the individual files are smaller than the 5MB multipart limitation the tool will recursively concatenate groups of files into 10MB S3 objects. The tool generates an empty 5MB file (zeros) and everything gets appended to this file, on the last file of the group a CopySourceRange is performed removing the 5MB pad. As a last step the tool will merge all the objects together creating the final tar.

Group1 = remove5MB([(((((5MB File) + header1) + file1) + header2) + file2)...])
Group2 = remove5MB([(((((5MB File) + header1) + file1) + header2) + file2)...])
NewObject = Concat(Group1, Group2)

If the files being tar-ed are larger than 5MB then it will create pairs of (file + next header) and then merge. The first file will have a 5MB padding, this will be removed at the end:

NewS3Object = [(5MB Zeroes + tar_header1) + (S3 Existing Object 1) + tar_header2 + (S3 Existing Object 1) ... (EOF 2x512 blocks)]

Testing & Validation

We encourage the end-user to write validation workflows to verify the data has been properly tared. If objects being tared are smaller than 5GB, users can use Amazon S3 Batch Operations to generate checksums for the individual objects. After the creation of the tar, users can extract the data into a separate bucket/folder and run the same batch operations job on the new data and verify that the checksums match. To learn more about using checksums for data validation, along with some demos, please watch Get Started With Checksums in Amazon S3 for Data Integrity Checking.

Pricing

It's important to understand that Amazon S3's API has costs associated with it. In particular PUT, COPY, POST are charged at a higher rate than GET. The majority of requests performed by this tool are COPY and PUT operations. Please refer to the Amazon S3 Pricing page for a breakdown of the API costs. You can also use the AWS Cost Calculator to help you price your operations.

During the build process the tool uses Amazon S3 Standard to work on files. If you are aggregating 1,000 objects, then it will require at least 1,000 COPY operations and 1,000 PUT operations for the tar headers.

Example: If we want to aggregate 10,000 files

$0.005 PUT, COPY, POST, LIST requests (per 1,000 requests)
To Copy the 10,000 files to the archive we will do at least 10,000 COPY operations

10,000 / 1,000 * $0.005 = $0.05 
We need to generate at least 10,000 header files, that's 10,000 PUT opeartions

10,000 / 1,000 * $0.005 = $0.05 
There are other intermidiate operations of creating multipart
It would cost a little over $0.1 to create an archive of 10,000 files

The cost example above only prices the cost of performing the operation. It doesn't include how much it would cost to store the final object.

Limitations of the tool

This tool still has the same limitations of Multipart Object sizes:

  • The cumulative size of the TAR must be over 5MB
  • The final size cannot be larger than 5TB

Security

See CONTRIBUTING for more information.


Frequently Asked Questions (FAQ)

Does the tool download any files?

No, all files are copied from their current location in Amazon S3 to their destination using the s3.UploadPartCopy API call.


Does the tool upload any files?

We are using the go archive/tar library to generate the Tar headers that go in between the files. These files are uploaded to Amazon S3 and concatenated with the Multipart Upload.


Does the tool delete any files?

No, the original files will remain untouched. The user is responsible for the lifecycle of the objects.


Is compression supported?

No, the tool is only copying existing data from Amazon S3 to another Amazon S3 location. To compress the objects it would require the tool to download the data, compress and then re-upload to Amazon S3.


Are Amazon S3 tags and meta-data copied to the tarball

No. Currently we're storing the Etag in the TOC, there is a possibility that could allow us to expand this.


What size of files are supported?

Any size that is within the Amazon S3 Multipart Object limitations. On the small side they can be as small as a few bytes, as long as the total archive at the end is over 5MB. On the large side the max size per object is 5GB, and the total archive is 5TB.


Can I open the resulting tar anywhere?

Yes, the tarballs are created with either PAX (default) or GNU headers. You can download the tar file generated and extract it using the same tools you use to operate on tar files.


What type of compute resources do I need to run the tool?

Since the tool is only doing API calls, any compute that can reach the Amazon S3 API should suffice. This could run on a t4g.nano or Lambda, as long as the number of files is low enough for the 15 minute window.

License

This project is licensed under the Apache-2.0 License.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrUnableToAccess = errors.New("unable to access")

Functions

func Debugf

func Debugf(ctx context.Context, format string, v ...interface{})

func DeleteAllMultiparts

func DeleteAllMultiparts(client *s3.Client, bucket string) error

DeleteAllMultiparts helper function to clear ALL MultipartUploads in a bucket. This will delete all incomplete (or in progress) MPUs for a bucket.

func Errorf added in v1.0.3

func Errorf(ctx context.Context, format string, v ...interface{})

Errorf, always log regardless of log level, but don't stop the application

func Extract

func Extract(ctx context.Context, svc *s3.Client, prefix string, opts *S3TarS3Options) error

Extract will unpack the tar file from source to target without downloading the archive locally. The archive has to be created with the manifest option.

func ExtractBucketAndPath

func ExtractBucketAndPath(s3url string) (bucket string, path string)

ExtractBucketAndPath helper function to extract bucket and key from s3://bucket/prefix/key URLs

func Fatalf

func Fatalf(ctx context.Context, format string, v ...interface{})

func GenerateToc added in v1.0.6

func GenerateToc(tarFile, outputToc string, opts *S3TarS3Options) error

GenerateToc creates a TOC csv of an existing TAR file (not created by s3tar) tar file MUST NOT have compression. tar file must be on the local file system to. TODO: It should be possible to generate a TOC from an existing TAR already by only reading the headers and skipping the data.

func GetS3Client

func GetS3Client(ctx context.Context) *s3.Client

func Infof

func Infof(ctx context.Context, format string, v ...interface{})

func ServerSideTar

func ServerSideTar(incoming context.Context, svc *s3.Client, opts *S3TarS3Options)

func SetLogLevel

func SetLogLevel(ctx context.Context, level int) context.Context

func SetupLogger

func SetupLogger(incoming context.Context) context.Context

func StringToInt64

func StringToInt64(s string) (int64, error)

func Warnf

func Warnf(ctx context.Context, format string, v ...interface{})

func WithBucketAndKey

func WithBucketAndKey(bucket, key string) func(*S3Obj)

func WithClient

func WithClient(client *s3.Client, optFns ...func(*RecursiveConcatOptions)) func(*RecursiveConcatOptions)

func WithSize

func WithSize(size int64) func(*S3Obj)

Types

type FileMetadata

type FileMetadata struct {
	Filename string
	Start    int64
	Size     int64
	Etag     string
}

type Index

type Index struct {
	Start int
	End   int
	Size  int
}

type Logger

type Logger struct {
	Level int
}

type PartsMessage

type PartsMessage struct {
	Parts   []*S3Obj
	PartNum int
}

type RecursiveConcat

type RecursiveConcat struct {
	Client      *s3.Client
	Region      string
	EndpointUrl string
	Bucket      string
	DstPrefix   string
	DstKey      string
	// contains filtered or unexported fields
}

func NewRecursiveConcat

func NewRecursiveConcat(ctx context.Context, options RecursiveConcatOptions, optFns ...func(*RecursiveConcatOptions)) (*RecursiveConcat, error)

func (*RecursiveConcat) ConcatObjects

func (r *RecursiveConcat) ConcatObjects(ctx context.Context, objectList []*S3Obj, bucket, key string) (*S3Obj, error)

func (*RecursiveConcat) CreateFirstBlock

func (r *RecursiveConcat) CreateFirstBlock(ctx context.Context)

type RecursiveConcatOptions

type RecursiveConcatOptions struct {
	Client      *s3.Client
	Region      string
	EndpointUrl string
	Bucket      string
	DstPrefix   string
	DstKey      string
}

func (RecursiveConcatOptions) Copy

Copy creates a clone where the APIOptions list is deep copied.

type S3Obj

type S3Obj struct {
	types.Object
	Bucket           string
	PartNum          int
	Data             []byte
	NoHeaderRequired bool
}

func LoadCSV

func LoadCSV(ctx context.Context, svc *s3.Client, fpath string, skipHeader bool) ([]*S3Obj, error)

func NewS3Obj

func NewS3Obj() *S3Obj

func NewS3ObjFromObject

func NewS3ObjFromObject(o types.Object) *S3Obj

func NewS3ObjOptions

func NewS3ObjOptions(options ...func(*S3Obj)) *S3Obj

func (*S3Obj) AddData

func (s *S3Obj) AddData(data []byte)

type S3TarS3Options

type S3TarS3Options struct {
	SrcManifest        string
	SkipManifestHeader bool
	SrcBucket          string
	SrcPrefix          string
	SrcKey             string
	DstBucket          string
	DstPrefix          string
	DstKey             string
	Threads            uint
	DeleteSource       bool
	SmallFiles         bool
	Region             string
	EndpointUrl        string
	TarFormat          string
	ExternalToc        string
}

S3TarS3Options options to create an archive

type TOC added in v1.0.2

type TOC []*FileMetadata

func List added in v1.0.2

func List(ctx context.Context, svc *s3.Client, bucket, key string, opts *S3TarS3Options) (TOC, error)

List will print out the contents in a tar, we do this by just printing from the TOC.

type VirtualArchive

type VirtualArchive []*S3Obj

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL