aws-s3-stream

command module
v0.0.0-...-81023b6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 11, 2021 License: MIT Imports: 14 Imported by: 0

README

s3stream - Merge lines from S3 objects to stdout


Read large amounts of text files from Amazon S3 line-by-line and dump their merged contents to standard output so you can use a pipeline of Unix/Linux filters (e.g. sed, grep, gzip, jq, AWS CLI, etc.) to process them.


Features

  • Read S3 objects line-by-line and write their lines to standard output.
  • Download and merge up to 32 S3 objects in parallel.
  • Automatically unzip GZIP-ed text files.

Install

$ go get github.com/gogama/aws-s3-stream

Build

$ go build -o s3stream

Run

Read a single S3 object and dump its lines to standard output.

$ AWS_REGION=<region> ./s3stream s3://your-bucket/path/to/object 

Read multiple objects and dump them to stdout.

$ AWS_REGION=<region> ./s3stream -p s3://your-bucket/some/prefix obj1 obj2 obj3 obj4 obj5

Read all objects under a prefix with maximum concurrency, dump them to stdout, and GZIP and upload the merged object back to S3 (uses AWS CLI).

$ aws --region <region> s3 ls --recursive input-bucket/prefix |
    awk '{print $4}' |
    AWS_REGION=<region> ./s3stream -p s3://input-bucket/prefix -c 32 |
    gzip --best |
    aws s3 cp - s3://output-bucket/merged.gz

License

MIT

Backlog

  • Profile performance, assess bottlenecks and whether increasing/reducing parallelism would help in places.
  • Support other ways to provide AWS region other AWS_REGION environment variable.
  • Add option, on by default, to detect and ignore objects that aren't text.
  • Add option, off by default, to unpack and look inside archives.
  • Support S3 ARN format as well as S3 URLs.

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL