s4

package module
v0.0.0-...-15eb671 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2024 License: MIT Imports: 11 Imported by: 0

README

s4

why

s3 is awesome, but can be expensive, slow, and doesn't expose data local compute or efficient shuffle.

what

an s3 cli compatible storage cluster that is cheap and fast, with data local compute and efficient shuffle.

data local compute maps arbitrary commands over immutable keys in 1:1, n:1 and 1:n operations.

data shuffle is implicit in 1:n mappings.

server placement is based on the hash of basename or a numeric prefix.

key method placement
s4://bucket/dir/name.txt int(hash("name.txt")) ?
s4://bucket/dir/000_bucket0.txt int("000") 0
s4://bucket/dir/000 int("000") 0

keys are strongly consistent and cannot be updated unless first deleted.

when

use this for efficiently processing ephemeral data.

keep durable inputs, outputs, and checkpoints in s3.

how

a ring of servers store files on disk.

a metadata controller on each server orchestrates out of process operations for data transfer and local compute.

a cli client coordinates cluster activity.

non goals

high availability. every key lives on one and only one server.

high durability. data lives on a single disk, and is as durable as that disk.

security. data transfers are checked for integrity, but not encrypted. service access is unauthenticated. secure the network with wireguard if needed.

fine granularity. data should be medium to coarse granularity.

safety for all inputs. service access should be considered to be at the level of root ssh. any user input should be escaped for shell.

cluster resizing. clusters should be short lived and data ephemeral. instead of resizing create a new cluster.

pagination of list results. data layout and partitioning must be considered.

install

go install:

go install github.com/nathants/s4/cmd/s4@latest
go install github.com/nathants/s4/cmd/s4_server@latest
sudo mv -f $(go env GOPATH)/bin/s4 /usr/local/bin/s4
sudo mv -f $(go env GOPATH)/bin/s4_server /usr/local/bin/s4-server

git clone:

git clone https://github.com/nathants/s4
cd s4
git clone go
make -j
sudo mv -fv bin/s4 bin/s4-server /usr/local/bin/

test

>> tox

automatic deployment

cd s4
name=s4-cluster
bash scripts/new_cluster.sh $name

manual deployment

deploy

ssh $server1 "curl -s https://raw.githubusercontent.com/nathants/s4/go/scripts/install_archlinux.sh | bash"
ssh $server2 "curl -s https://raw.githubusercontent.com/nathants/s4/go/scripts/install_archlinux.sh | bash"

configure

echo $server1:8080 >  ~/.s4.conf
echo $server2:8080 >> ~/.s4.conf
scp ~/.s4.conf $server1:
scp ~/.s4.conf $server2:

start

ssh $server1 s4-server
ssh $server2 s4-server

usage

echo hello world | s4 cp - s4://bucket/data.txt
s4 cp s4://bucket/data.txt -
s4 ls s4://bucket --recursive
s4 --help

examples

structured analysis of nyc taxi data with bsv and hive

adhoc exploration of nyc taxi data with python

bsv - a simple and efficient data format for easily manipulating chunks of rows of columns while minimizing allocations and copies.

optimizing a bsv data processing pipeline

performant batch processing with bsv, s4, and presto

discovering a baseline for data processing performance

refactoring common distributed data patterns into s4

scaling python data processing horizontally

scaling python data processing vertically

api

name description
s4 rm delete data from s4
s4 eval eval a bash cmd with key data as stdin
s4 ls list keys
s4 cp copy data to or from s4
s4 map process data
s4 map-to-n shuffle data
s4 map-from-n merge shuffled data
s4 config list the server addresses
s4 health health check every server

usage

s4 rm
usage: s4 rm [-h] [-r] prefix

    delete data from s4.

    - recursive to delete directories.


positional arguments:
  prefix           -

optional arguments:
  -h       show this help message and exit
  -r       False
s4 eval
usage: s4 eval [-h] key cmd

    eval a bash cmd with key data as stdin


positional arguments:
  key         -
  cmd         -

optional arguments:
  -h  show this help message and exit
s4 ls
usage: s4 ls [-h] [-r] [prefix]

    list keys


positional arguments:
  prefix           -

optional arguments:
  -h, --help       show this help message and exit
  -r, --recursive  False
s4 cp
usage: s4 cp [-h] [-r] src dst

    copy data to or from s4.

    - paths can be:
      - remote:       "s4://bucket/key.txt"
      - local:        "./dir/key.txt"
      - stdin/stdout: "-"
    - use recursive to copy directories.
    - keys cannot be updated, but can be deleted and recreated.
    - note: to copy from s4, the local machine must be reachable by the cluster, otherwise use `s4 eval`.


positional arguments:
  src              -
  dst              -

optional arguments:
  -h       show this help message and exit
  -r       False
s4 map
usage: s4 map [-h] indir outdir cmd

    process data.

    - map a bash cmd 1:1 over every key in indir putting result in outdir.
    - cmd receives data via stdin and returns data via stdout.
    - every key in indir will create a key with the same name in outdir.
    - indir will be listed recursively to find keys to map.


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit
s4 map-to-n
usage: s4 map-to-n [-h] indir outdir cmd

    shuffle data.

    - map a bash cmd 1:n over every key in indir putting results in outdir.
    - cmd receives data via stdin, writes files to disk, and returns file paths via stdout.
    - every key in indir will create a directory with the same name in outdir.
    - outdir directories contain zero or more files output by cmd.
    - cmd runs in a tempdir which is deleted on completion.


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit
s4 map-from-n
usage: s4 map-from-n [-h] indir outdir cmd

    merge shuffled data.

    - map a bash cmd n:1 over every key in indir putting result in outdir.
    - indir will be listed recursively to find keys to map.
    - cmd receives file paths via stdin and returns data via stdout.
    - each cmd receives all keys with the same name or numeric prefix
    - output name is that name


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit
s4 config
usage: s4 config [-h]

    list the server addresses


optional arguments:
  -h  show this help message and exit
s4 health
usage: s4 health [-h]

    health check every server


optional arguments:
  -h  show this help message and exit

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Err409 = errors.New("409")

Functions

func Cp

func Cp(src string, dst string, recursive bool, servers []lib.Server) error

func Eval

func Eval(key string, cmd string, servers []lib.Server) (string, error)

func GetFile

func GetFile(src string, dst string, servers []lib.Server) error

func GetWriter

func GetWriter(src string, dst io.Writer, servers []lib.Server) error

func List

func List(prefix string, recursive bool, servers []lib.Server) ([][]string, error)

func ListBuckets

func ListBuckets(servers []lib.Server) ([][]string, error)

func Map

func Map(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error

func MapFromN

func MapFromN(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error

func MapToN

func MapToN(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error

func PutFile

func PutFile(src string, dst string, servers []lib.Server) error

func PutReader

func PutReader(src io.Reader, dst string, servers []lib.Server) error

func Rm

func Rm(prefix string, recursive bool, servers []lib.Server) error

Types

This section is empty.

Directories

Path Synopsis
cmd
s4

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL