Sequins!!!
Sequins is a dead-simple static database. It indexes and serves SequenceFiles
over HTTP, so it's perfect for serving data created with Hadoop.
Installing
There are tarballs on the releases page.
There's also a docker image,
if you're into that.
Building
To create a sequins
binary (you'll need go
on your path):
$ git clone https://github.com/stripe/sequins
$ cd sequins
$ make
Or, to install a binary to $GOPATH/bin
:
$ make install
Usage
$ sequins -b ':9599' -cr 1m hdfs://namenode:8020/path/to/mydata
That tells sequins to load your data from HDFS, and check every minute for new
versions, and then bind to the port 9599 and listen for requests. The URL can
point to HDFS, or s3, or just be a local path.
Sequins expects your data to be versioned. Inside the top-level directory you
you specify, you should have subdirectories, like this:
/mydata/
version0/
part-00000
part-00001
...
version1/
...
The versions can be timestamps, dates, or anything - sequins will automatically
choose whichever version is the greatest, in lexicographical order.
This may seem a little weird, but it works really well for aggregates that you
produce perodically, and it allows sequins to easily hotload new data (see the
corresponding section, below).
Once sequins has started and built the index, you can get the value for a given
key over HTTP. The body of the response will be the result, or if the key
doesn't exist, it'll give you a 404. For example:
$ http localhost:9599/foo
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 3
Content-Type: text/plain; charset=utf-8
Date: Thu, 04 Sep 2014 11:42:01 GMT
Last-Modified: Thu, 04 Sep 2014 11:39:38 GMT
bar
$ http localhost:9599/baz
HTTP/1.1 404 Not Found
Content-Length: 0
Content-Type: text/plain; charset=utf-8
Date: Thu, 04 Sep 2014 11:42:20 GMT
Note the Last-Modified
header: this corresponds to the last time sequins was
given new data (see 'hotloading', below). Sequins will also happily
(and correctly) respond to requests with a Range
header.
Hotloading
Sequins knows how to hotload new data without dropping any requests. After
you've dropped a new, lexicographically-greater version into your top-level
directory, just send SIGHUP to the running process:
kill -HUP <pid>
and it'll download the files (if necessary), build an index in the background,
then switch when it's done. If it fails while building the new index for some
reason, it'll continue to serve the current one.
You can also tell sequins to automatically look for new versions with the
--refresh-period
option.
If you're working with hadoop output, hotloading might accidentally load a
partial result, because Hadoop creates directories when it starts a job. To
mitigate this, you can pass in --check-for-success
, which will tell sequins to
only load versions with a _SUCCESS file in them (Hadoop creates these files
automatically when it's done running a job).
Status
Sending a plain GET request to /
will make sequins dump out its current
status, like so:
$ http localhost:9599/ | python -m json.tool
{
"count": 3,
"path": "path/to/stuff/1401490544",
"started": 1409830778,
"updated": 1409830778
}
Miscellany, Caveats
- Here's a Scalding sink for generating sequins-compatible SequenceFiles.
It works for anything that can be converted to a JSON value.
- The HDFS code uses this hdfs library, which currently only supports
Hadoop 2.0.0 and up (including CDH5).
- SequenceFiles don't strictly enforce that you have only one value for each
key; if your data has multiple values for a key, sequins will load it without
complaint, but only index one value for the key (probably
nondeterministically).
- Currently, there's no support for compressed SequenceFiles, or for key/value
serializations other than
BytesWritable
.