About
A pretty simple tool to search for text in objects stored in AWS's S3.
Usage
Installation
To install you can simply:
go get github.com/joboscribe/s3grep
Then inside the source directory:
go install
Now (assuming your GOPATH
is in your PATH
) you should be able to run s3grep
.
AWS Credentials
Since this is built on the AWS SDK it will use credentials in the same order of preference as laid out in the SDK documents. I've tested it with environment variables and a credentials file as generated by running aws configure
.
Options and Arguments
s3grep [-i] [-e pattern] [-k path] [-n num] [--ignore-case] [--keep=path] [--num-workers=num] [--regexp=pattern] [pattern] [bucket] [key] [region]
The only required arguments are:
pattern
: a regex that object contents will be matched against (should probably be in quotes)
bucket
: the name of the bucket containing the objects to be search
key
: a regex used to find objects in which the search will be performed (you can avoid shell expansion by putting this in quotes)
region
: the AWS region where the bucket is located
The optional arguments and flags are:
-i
or --ignore-case
: performs a case-insensitive match
-e [pattern]
or --regexp [pattern]
: addition regex that will be matched against, can be used multiple times
-k [path]
or --keep [path]
: objects containing matches will be stored locally at the file path indicated by [path]
-n [num]
or --num-workers [num]
: how many S3 operations to perform in parallel (WARNING: on a 'nix OS you can pretty easily run out of file descriptors if you set this much higher than 1000)
Examples
Let's say you want to search for the string "super-duper"
in all the objects in the bucket named my-wonderful-secret-stuff
in the us-east-1
region:
s3grep "super-duper" my-wonderful-secret-stuff ".*" us-east-1
Then you realize that there are photos and mp3s and various other file formats in that bucket so you cancel that search and perform it again but this time only in objects with keys ending in .txt
:
s3grep "super-duper" my-wonderful-secret-stuff ".*\.txt" us-east-1
Then you realize that you wanted to find all occurrences of "super-duper"
regardless of case:
s3grep -i "super-duper" my-wonderful-secret-stuff ".*\.txt" us-east-1
This takes forever. You remember that you have several thousand files to look through and doing it 10 at a time means you'll be here all day, so you increase the number of workers to 500:
s3grep -i -n 500 "super-duper" my-wonderful-secret-stuff ".*\.txt" us-east-1
Suddenly you remember that you want not just "super-duper"
but also "awesome-possum"
:
s3grep -i -e "awesome-possum" "super-duper" my-wonderful-secret-stuff ".*\.txt" us-east-1
And that's when you think maybe it'd be a good idea to hang onto all those matching objects locally in your ~/literature
directory so you can read them at your leisure:
s3grep -i -k "~/literature" -e "awesome-possum" "super-duper" my-wonderful-secret-stuff ".*\.txt" us-east-1
Possible FAQs
Q: Why would anyone be searching inside text files stored on S3?
A: Maybe you work at a company that uses S3 to store things like, oh, i don't know, logs or digitized documents, and now you need to find all the files containing certain entries. I've had to do almost exactly that, hence my inspiration to create this tool.
Q: Is using this going to cost me money?
A: Probably, i mean, almost assuredly this is going to cost you something, though you would have to consult the S3 pricing guide to get an estimate of how much. I figure one GET
request per 1,000 objects in the bucket in order to get a list of all the relevant keys and then at least one GET
request per relevant key. The math is left as an exercise for the reader.
Q: Little presumptuous, don't you think, naming your work after the venerable grep
?
A: I know right? Believe me, i don't feel entirely comfortable with it, but i decided to go with that name and then try to make it look and behave as similarly as possible even though there are some obvious differences since grep
doesn't need to know things like "region" or "bucket".
Q: OMGOSH it's taking forever!
A: Lots of big files, huh? Bummer. In terms of improvements there are certainly some low-hanging fruit, which i plan to address (grab? pick?). In the meantime, i guess it makes for a good excuse for the user to go get something to drink, do a little burst of exercise, read an article or what have you.
Q: What's all this about file headers?
A: If you use the -n
or --num-workers
option with a high enough number (say 1024 on Linux) without increasing the default number of file descriptors per process then s3grep
will (assuming there are enough objects in the relevant bucket) bump up against that limit because every HTTP request uses up a file descriptor. That said, if you are looking for snippets of text in millions of files this might not be the best tool for the job.