s3tar

package module

v1.0.6-pre Latest Latest Go to latest Published: Mar 26, 2023 License: Apache-2.0 Imports: 27 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/awslabs/amazon-s3-tar-tool

README ¶

Amazon S3 Tar Tool

s3tar is utility tool to create a tarball of existing objects in Amazon S3.

s3tar allows customers to group existing Amazon S3 objects into TAR files without having to download the files. This cli tool leverages existing Amazon S3 APIs to create the archives on Amazon S3 that can be later transitioned to any of the cold storage tiers. The files generated follow the tar file format and can be extracted with standard tar tools.

Using the Multipart Uploads API, in particular UploadPartCopy API, we can copy existing objects into one object. This utility will create the intermediate TAR header files that go between each file and then concatenate all of the objects into a single tarball.

Usage

The tool follows the tar syntax for creation and extraction of tarballs with a few additions to support Amazon S3 operations.

flag	description	required
-c	create	yes, unless using -x
-x	extract	yes, unless using -c
-C	destination to extract	yes when using -x
-f	file that will be generated or extracted: s3://bucket/prefix/file.tar	yes
-t	list files in archive	no
--extended	to use with -t to extend the output to filename,loc,length,etag	no
-m	manifest input	no
--region	aws region where the bucket is	yes
-v, -vv, -vvv	level of verbose	no
--format	Tar format PAX or GNU, default is PAX	no
--endpointUrl	specify an Amazon S3 endpoint	no

The syntax for creating and extracting tarballs remains similar to traditional tar tools:

   s3tar --region region [-c --create] | [-x --extract] [-v] -f s3://bucket/prefix/file.tar s3://bucket/prefix

Examples

To create a tarball s3://bucket/prefix/archive.tar from all the objects located under s3://bucket/files/

s3tar --region us-west-2 -cvf s3://bucket/prefix/archive.tar s3://bucket/files/

The tool supports an input manifest -m. The manifest is a comma-separated-value (csv) file with bucket,key,content-length. Content-length is the size in bytes of the object. For example:

$ cat manifest.input.csv
my-bucket,prefix/file.0001.exr,68365312
my-bucket,prefix/file.0002.exr,50172928
my-bucket,prefix/file.0003.exr,67663872

$ s3tar --region us-west-2 -cvf s3://bucket/prefix/archive.tar -m /Users/bolyanko/manifest.input.csv

# the manifest file can be a local file or an object in Amazon S3

$ s3tar --region us-west-2 -cvf s3://bucket/prefix/archive.tar -m s3://bucket/prefix/manifest.input.csv

TOC & Extract

Tarballs created with this tool generate a Table of Contents (TOC). This TOC file is at the beginning of the archive and it contains a csv line per file with the name, byte location, content-length, Etag. This added functionality allows archives that are created this way to also be extracted without having to download the tar object.

You can extract a tarball from Amazon S3 into another Amazon S3 location with the following command:

s3tar --region us-west-2 -xvf s3://bucket/prefix/archive.tar -C s3://bucket/destination/

To extract a single file in a tar, or a prefix

s3tar --region us-west-2 -xvf s3://bucket/prefix/archive.tar -C s3://bucket/destination/ folder/image1.jpg 
# or a dir
s3tar --region us-west-2 -xvf s3://bucket/prefix/archive.tar -C s3://bucket/destination/ folder/

List

If you want to list the files in a tar

s3tar --region us-west-2 -tf s3://bucket/prefix/archive.tar 
folder/image1.jpg
folder/image2.jpg
folder/image3.jpg
other-folder/image1.jpg
other-folder/image2.jpg
other-folder/image3.jpg

Performance

The tool's performance is bound by the API calls limitations. The table below has a few tests with files of different sizes.

Number of Files	Final archive size	Average Object Size	Creation Time	Extraction Time	Estimated Cost (us-west-2)
41,593	20 GB	512 KB	6m10s	3m11s	$0.4159
124,779	61 GB	512 KB	18m24s	10m5s	$1.2478
249,558	123 GB	512 KB	40m56s	21m42s	$2.4956
499,116	246 GB	512 KB	1h34m	38m58s	$4.9912
748,674	369 GB	512 KB	2h36m		$7.48674
14,400	73 GB	70 MB	2m15s	1m20s	$0.1440
69,121	3.75 TB	70 MB	1h11m30s	32m20s	$0.6912

The application is configured to retry every Amazon S3 operation up to 10 times with a Max backoff time of 20 seconds. If you get a timeout error, try reducing the number of files.

Installation

A make file is included that helps building the application for darwin-arm64 linux-arm64 linux-amd64. Place the resulting s3tar binary in your PATH.

How the tool works

This tools utilizes Amazon S3 Multipart Upload (MPU). MPU allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.

Multipart upload is a three-step process: You initiate the upload, you upload the object parts or copy from an existing Amazon S3 Object, and after you have all the parts, you complete the multipart upload. Upon receiving the complete multipart upload request, Amazon S3 constructs the object from all the parts, and you can then access the object just as you would any other object in your bucket. You can learn more about Multipart Upload on the MPU Overview

There are two Amazon S3 API Operations that allow adding data to a Multipart Upload. UploadPart and UploadPartCopy. This tool generates TAR header files and uses s3.UploadPart to upload the header data into a MPU, and then it uses s3.UploadPartCopy to copy your existing Amazon S3 Object into the newly created object.

Currently Multipart Uploads have a minimum requirement of 5MB per part and each part can go up to 5GiB. The total maximum MPU object size is 5TiB.

s3tar automatically detects the size of the objects it needs to tar. The total size of all the files must be greater than 5MB. If the individual files are smaller than the 5MB multipart limitation the tool will recursively concatenate groups of files into 10MB S3 objects. The tool generates an empty 5MB file (zeros) and everything gets appended to this file, on the last file of the group a CopySourceRange is performed removing the 5MB pad. As a last step the tool will merge all the objects together creating the final tar.

Group1 = remove5MB([(((((5MB File) + header1) + file1) + header2) + file2)...])
Group2 = remove5MB([(((((5MB File) + header1) + file1) + header2) + file2)...])
NewObject = Concat(Group1, Group2)

If the files being tar-ed are larger than 5MB then it will create pairs of (file + next header) and then merge. The first file will have a 5MB padding, this will be removed at the end:

NewS3Object = [(5MB Zeroes + tar_header1) + (S3 Existing Object 1) + tar_header2 + (S3 Existing Object 1) ... (EOF 2x512 blocks)]

Testing & Validation

We encourage the end-user to write validation workflows to verify the data has been properly tared. If objects being tared are smaller than 5GB, users can use Amazon S3 Batch Operations to generate checksums for the individual objects. After the creation of the tar, users can extract the data into a separate bucket/folder and run the same batch operations job on the new data and verify that the checksums match. To learn more about using checksums for data validation, along with some demos, please watch Get Started With Checksums in Amazon S3 for Data Integrity Checking.

Pricing

It's important to understand that Amazon S3's API has costs associated with it. In particular PUT, COPY, POST are charged at a higher rate than GET. The majority of requests performed by this tool are COPY and PUT operations. Please refer to the Amazon S3 Pricing page for a breakdown of the API costs. You can also use the AWS Cost Calculator to help you price your operations.

During the build process the tool uses Amazon S3 Standard to work on files. If you are aggregating 1,000 objects, then it will require at least 1,000 COPY operations and 1,000 PUT operations for the tar headers.

Example: If we want to aggregate 10,000 files

$0.005 PUT, COPY, POST, LIST requests (per 1,000 requests)
To Copy the 10,000 files to the archive we will do at least 10,000 COPY operations

10,000 / 1,000 * $0.005 = $0.05 
We need to generate at least 10,000 header files, that's 10,000 PUT opeartions

10,000 / 1,000 * $0.005 = $0.05 
There are other intermidiate operations of creating multipart
It would cost a little over $0.1 to create an archive of 10,000 files

The cost example above only prices the cost of performing the operation. It doesn't include how much it would cost to store the final object.

Limitations of the tool

This tool still has the same limitations of Multipart Object sizes:

The cumulative size of the TAR must be over 5MB
The final size cannot be larger than 5TB

Security

See CONTRIBUTING for more information.

Frequently Asked Questions (FAQ)

Does the tool download any files?

No, all files are copied from their current location in Amazon S3 to their destination using the s3.UploadPartCopy API call.

Does the tool upload any files?

We are using the go archive/tar library to generate the Tar headers that go in between the files. These files are uploaded to Amazon S3 and concatenated with the Multipart Upload.

Does the tool delete any files?

No, the original files will remain untouched. The user is responsible for the lifecycle of the objects.

Is compression supported?

No, the tool is only copying existing data from Amazon S3 to another Amazon S3 location. To compress the objects it would require the tool to download the data, compress and then re-upload to Amazon S3.

Are Amazon S3 tags and meta-data copied to the tarball

No. Currently we're storing the Etag in the TOC, there is a possibility that could allow us to expand this.

What size of files are supported?

Any size that is within the Amazon S3 Multipart Object limitations. On the small side they can be as small as a few bytes, as long as the total archive at the end is over 5MB. On the large side the max size per object is 5GB, and the total archive is 5TB.

Can I open the resulting tar anywhere?

Yes, the tarballs are created with either PAX (default) or GNU headers. You can download the tar file generated and extract it using the same tools you use to operate on tar files.

What type of compute resources do I need to run the tool?

Since the tool is only doing API calls, any compute that can reach the Amazon S3 API should suffice. This could run on a t4g.nano or Lambda, as long as the number of files is low enough for the 15 minute window.

License

This project is licensed under the Apache-2.0 License.

Documentation ¶

Index ¶

Variables
func Debugf(ctx context.Context, format string, v ...interface{})
func DeleteAllMultiparts(client *s3.Client, bucket string) error
func Errorf(ctx context.Context, format string, v ...interface{})
func Extract(ctx context.Context, svc *s3.Client, prefix string, opts *S3TarS3Options) error
func ExtractBucketAndPath(s3url string) (bucket string, path string)
func Fatalf(ctx context.Context, format string, v ...interface{})
func GenerateToc(tarFile, outputToc string, opts *S3TarS3Options) error
func GetS3Client(ctx context.Context) *s3.Client
func Infof(ctx context.Context, format string, v ...interface{})
func ServerSideTar(incoming context.Context, svc *s3.Client, opts *S3TarS3Options)
func SetLogLevel(ctx context.Context, level int) context.Context
func SetupLogger(incoming context.Context) context.Context
func StringToInt64(s string) (int64, error)
func Warnf(ctx context.Context, format string, v ...interface{})
func WithBucketAndKey(bucket, key string) func(*S3Obj)
func WithClient(client *s3.Client, optFns ...func(*RecursiveConcatOptions)) func(*RecursiveConcatOptions)
func WithSize(size int64) func(*S3Obj)
type FileMetadata
type Index
type Logger
type PartsMessage
type RecursiveConcat
- func NewRecursiveConcat(ctx context.Context, options RecursiveConcatOptions, ...) (*RecursiveConcat, error)
- func (r *RecursiveConcat) ConcatObjects(ctx context.Context, objectList []*S3Obj, bucket, key string) (*S3Obj, error)
- func (r *RecursiveConcat) CreateFirstBlock(ctx context.Context)
type RecursiveConcatOptions
- func (o RecursiveConcatOptions) Copy() RecursiveConcatOptions
type S3Obj
- func LoadCSV(ctx context.Context, svc *s3.Client, fpath string, skipHeader bool) ([]*S3Obj, error)
- func NewS3Obj() *S3Obj
- func NewS3ObjFromObject(o types.Object) *S3Obj
- func NewS3ObjOptions(options ...func(*S3Obj)) *S3Obj
- func (s *S3Obj) AddData(data []byte)
type S3TarS3Options
type TOC
- func List(ctx context.Context, svc *s3.Client, bucket, key string, opts *S3TarS3Options) (TOC, error)
type VirtualArchive

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrUnableToAccess = errors.New("unable to access")

Functions ¶

func Debugf ¶

func Debugf(ctx context.Context, format string, v ...interface{})

func DeleteAllMultiparts ¶

func DeleteAllMultiparts(client *s3.Client, bucket string) error

DeleteAllMultiparts helper function to clear ALL MultipartUploads in a bucket. This will delete all incomplete (or in progress) MPUs for a bucket.

func Errorf ¶ added in v1.0.3

func Errorf(ctx context.Context, format string, v ...interface{})

Errorf, always log regardless of log level, but don't stop the application

func Extract ¶

func Extract(ctx context.Context, svc *s3.Client, prefix string, opts *S3TarS3Options) error

Extract will unpack the tar file from source to target without downloading the archive locally. The archive has to be created with the manifest option.

func ExtractBucketAndPath ¶

func ExtractBucketAndPath(s3url string) (bucket string, path string)

ExtractBucketAndPath helper function to extract bucket and key from s3://bucket/prefix/key URLs

func Fatalf ¶

func Fatalf(ctx context.Context, format string, v ...interface{})

func GenerateToc ¶ added in v1.0.6

func GenerateToc(tarFile, outputToc string, opts *S3TarS3Options) error

GenerateToc creates a TOC csv of an existing TAR file (not created by s3tar) tar file MUST NOT have compression. tar file must be on the local file system to. TODO: It should be possible to generate a TOC from an existing TAR already by only reading the headers and skipping the data.

func GetS3Client ¶

func GetS3Client(ctx context.Context) *s3.Client

func Infof ¶

func Infof(ctx context.Context, format string, v ...interface{})

func ServerSideTar ¶

func ServerSideTar(incoming context.Context, svc *s3.Client, opts *S3TarS3Options)

func SetLogLevel ¶

func SetLogLevel(ctx context.Context, level int) context.Context

func SetupLogger ¶

func SetupLogger(incoming context.Context) context.Context

func StringToInt64 ¶

func StringToInt64(s string) (int64, error)

func Warnf ¶

func Warnf(ctx context.Context, format string, v ...interface{})

func WithBucketAndKey ¶

func WithBucketAndKey(bucket, key string) func(*S3Obj)

func WithClient ¶

func WithClient(client *s3.Client, optFns ...func(*RecursiveConcatOptions)) func(*RecursiveConcatOptions)

func WithSize ¶

func WithSize(size int64) func(*S3Obj)

Types ¶

type FileMetadata ¶

type FileMetadata struct {
	Filename string
	Start    int64
	Size     int64
	Etag     string
}

type Index ¶

type Index struct {
	Start int
	End   int
	Size  int
}

type Logger ¶

type Logger struct {
	Level int
}

type PartsMessage ¶

type PartsMessage struct {
	Parts   []*S3Obj
	PartNum int
}

type RecursiveConcat ¶

type RecursiveConcat struct {
	Client      *s3.Client
	Region      string
	EndpointUrl string
	Bucket      string
	DstPrefix   string
	DstKey      string
	// contains filtered or unexported fields
}

func NewRecursiveConcat ¶

func NewRecursiveConcat(ctx context.Context, options RecursiveConcatOptions, optFns ...func(*RecursiveConcatOptions)) (*RecursiveConcat, error)

func (*RecursiveConcat) ConcatObjects ¶

func (r *RecursiveConcat) ConcatObjects(ctx context.Context, objectList []*S3Obj, bucket, key string) (*S3Obj, error)

func (*RecursiveConcat) CreateFirstBlock ¶

func (r *RecursiveConcat) CreateFirstBlock(ctx context.Context)

type RecursiveConcatOptions ¶

type RecursiveConcatOptions struct {
	Client      *s3.Client
	Region      string
	EndpointUrl string
	Bucket      string
	DstPrefix   string
	DstKey      string
}

func (RecursiveConcatOptions) Copy ¶

func (o RecursiveConcatOptions) Copy() RecursiveConcatOptions

Copy creates a clone where the APIOptions list is deep copied.

type S3Obj ¶

type S3Obj struct {
	types.Object
	Bucket           string
	PartNum          int
	Data             []byte
	NoHeaderRequired bool
}

func LoadCSV ¶

func LoadCSV(ctx context.Context, svc *s3.Client, fpath string, skipHeader bool) ([]*S3Obj, error)

func NewS3Obj ¶

func NewS3Obj() *S3Obj

func NewS3ObjFromObject ¶

func NewS3ObjFromObject(o types.Object) *S3Obj

func NewS3ObjOptions ¶

func NewS3ObjOptions(options ...func(*S3Obj)) *S3Obj

func (*S3Obj) AddData ¶

func (s *S3Obj) AddData(data []byte)

type S3TarS3Options ¶

type S3TarS3Options struct {
	SrcManifest        string
	SkipManifestHeader bool
	SrcBucket          string
	SrcPrefix          string
	SrcKey             string
	DstBucket          string
	DstPrefix          string
	DstKey             string
	Threads            uint
	DeleteSource       bool
	SmallFiles         bool
	Region             string
	EndpointUrl        string
	TarFormat          string
	ExternalToc        string
}

S3TarS3Options options to create an archive

type TOC ¶ added in v1.0.2

type TOC []*FileMetadata

func List ¶ added in v1.0.2

func List(ctx context.Context, svc *s3.Client, bucket, key string, opts *S3TarS3Options) (TOC, error)

List will print out the contents in a tar, we do this by just printing from the TOC.

type VirtualArchive ¶

type VirtualArchive []*S3Obj

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
s3tar

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL