s4

package module

v0.0.0-...-15eb671 Latest Latest Go to latest Published: Jun 15, 2024 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/nathants/s4

Links

Open Source Insights

README ¶

s4

why

s3 is awesome, but can be expensive, slow, and doesn't expose data local compute or efficient shuffle.

what

an s3 cli compatible storage cluster that is cheap and fast, with data local compute and efficient shuffle.

data local compute maps arbitrary commands over immutable keys in 1:1, n:1 and 1:n operations.

data shuffle is implicit in 1:n mappings.

server placement is based on the hash of basename or a numeric prefix.

key	method	placement
s4://bucket/dir/name.txt	int(hash("name.txt"))	?
s4://bucket/dir/000_bucket0.txt	int("000")	0
s4://bucket/dir/000	int("000")	0

keys are strongly consistent and cannot be updated unless first deleted.

when

use this for efficiently processing ephemeral data.

keep durable inputs, outputs, and checkpoints in s3.

how

a ring of servers store files on disk.

a metadata controller on each server orchestrates out of process operations for data transfer and local compute.

a cli client coordinates cluster activity.

non goals

high availability. every key lives on one and only one server.

high durability. data lives on a single disk, and is as durable as that disk.

security. data transfers are checked for integrity, but not encrypted. service access is unauthenticated. secure the network with wireguard if needed.

fine granularity. data should be medium to coarse granularity.

safety for all inputs. service access should be considered to be at the level of root ssh. any user input should be escaped for shell.

cluster resizing. clusters should be short lived and data ephemeral. instead of resizing create a new cluster.

pagination of list results. data layout and partitioning must be considered.

install

go install:

go install github.com/nathants/s4/cmd/s4@latest
go install github.com/nathants/s4/cmd/s4_server@latest
sudo mv -f $(go env GOPATH)/bin/s4 /usr/local/bin/s4
sudo mv -f $(go env GOPATH)/bin/s4_server /usr/local/bin/s4-server

git clone:

git clone https://github.com/nathants/s4
cd s4
git clone go
make -j
sudo mv -fv bin/s4 bin/s4-server /usr/local/bin/

test

>> tox

automatic deployment

cd s4
name=s4-cluster
bash scripts/new_cluster.sh $name

manual deployment

deploy

ssh $server1 "curl -s https://raw.githubusercontent.com/nathants/s4/go/scripts/install_archlinux.sh | bash"
ssh $server2 "curl -s https://raw.githubusercontent.com/nathants/s4/go/scripts/install_archlinux.sh | bash"

configure

echo $server1:8080 >  ~/.s4.conf
echo $server2:8080 >> ~/.s4.conf
scp ~/.s4.conf $server1:
scp ~/.s4.conf $server2:

start

ssh $server1 s4-server
ssh $server2 s4-server

usage

echo hello world | s4 cp - s4://bucket/data.txt
s4 cp s4://bucket/data.txt -
s4 ls s4://bucket --recursive
s4 --help

examples

structured analysis of nyc taxi data with bsv and hive

adhoc exploration of nyc taxi data with python

bsv - a simple and efficient data format for easily manipulating chunks of rows of columns while minimizing allocations and copies.

optimizing a bsv data processing pipeline

performant batch processing with bsv, s4, and presto

discovering a baseline for data processing performance

refactoring common distributed data patterns into s4

scaling python data processing horizontally

scaling python data processing vertically

api

name	description
s4 rm	delete data from s4
s4 eval	eval a bash cmd with key data as stdin
s4 ls	list keys
s4 cp	copy data to or from s4
s4 map	process data
s4 map-to-n	shuffle data
s4 map-from-n	merge shuffled data
s4 config	list the server addresses
s4 health	health check every server

usage

s4 rm

usage: s4 rm [-h] [-r] prefix

    delete data from s4.

    - recursive to delete directories.


positional arguments:
  prefix           -

optional arguments:
  -h       show this help message and exit
  -r       False

s4 eval

usage: s4 eval [-h] key cmd

    eval a bash cmd with key data as stdin


positional arguments:
  key         -
  cmd         -

optional arguments:
  -h  show this help message and exit

s4 ls

usage: s4 ls [-h] [-r] [prefix]

    list keys


positional arguments:
  prefix           -

optional arguments:
  -h, --help       show this help message and exit
  -r, --recursive  False

s4 cp

usage: s4 cp [-h] [-r] src dst

    copy data to or from s4.

    - paths can be:
      - remote:       "s4://bucket/key.txt"
      - local:        "./dir/key.txt"
      - stdin/stdout: "-"
    - use recursive to copy directories.
    - keys cannot be updated, but can be deleted and recreated.
    - note: to copy from s4, the local machine must be reachable by the cluster, otherwise use `s4 eval`.


positional arguments:
  src              -
  dst              -

optional arguments:
  -h       show this help message and exit
  -r       False

s4 map

usage: s4 map [-h] indir outdir cmd

    process data.

    - map a bash cmd 1:1 over every key in indir putting result in outdir.
    - cmd receives data via stdin and returns data via stdout.
    - every key in indir will create a key with the same name in outdir.
    - indir will be listed recursively to find keys to map.


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit

s4 map-to-n

usage: s4 map-to-n [-h] indir outdir cmd

    shuffle data.

    - map a bash cmd 1:n over every key in indir putting results in outdir.
    - cmd receives data via stdin, writes files to disk, and returns file paths via stdout.
    - every key in indir will create a directory with the same name in outdir.
    - outdir directories contain zero or more files output by cmd.
    - cmd runs in a tempdir which is deleted on completion.


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit

s4 map-from-n

usage: s4 map-from-n [-h] indir outdir cmd

    merge shuffled data.

    - map a bash cmd n:1 over every key in indir putting result in outdir.
    - indir will be listed recursively to find keys to map.
    - cmd receives file paths via stdin and returns data via stdout.
    - each cmd receives all keys with the same name or numeric prefix
    - output name is that name


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit

s4 config

usage: s4 config [-h]

    list the server addresses


optional arguments:
  -h  show this help message and exit

s4 health

usage: s4 health [-h]

    health check every server


optional arguments:
  -h  show this help message and exit

Documentation ¶

Index ¶

Variables
func Cp(src string, dst string, recursive bool, servers []lib.Server) error
func Eval(key string, cmd string, servers []lib.Server) (string, error)
func GetFile(src string, dst string, servers []lib.Server) error
func GetWriter(src string, dst io.Writer, servers []lib.Server) error
func List(prefix string, recursive bool, servers []lib.Server) ([][]string, error)
func ListBuckets(servers []lib.Server) ([][]string, error)
func Map(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error
func MapFromN(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error
func MapToN(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error
func PutFile(src string, dst string, servers []lib.Server) error
func PutReader(src io.Reader, dst string, servers []lib.Server) error
func Rm(prefix string, recursive bool, servers []lib.Server) error

Constants ¶

This section is empty.

Variables ¶

View Source

var Err409 = errors.New("409")

Functions ¶

func Cp ¶

func Cp(src string, dst string, recursive bool, servers []lib.Server) error

func Eval ¶

func Eval(key string, cmd string, servers []lib.Server) (string, error)

func GetFile ¶

func GetFile(src string, dst string, servers []lib.Server) error

func GetWriter ¶

func GetWriter(src string, dst io.Writer, servers []lib.Server) error

func List ¶

func List(prefix string, recursive bool, servers []lib.Server) ([][]string, error)

func ListBuckets ¶

func ListBuckets(servers []lib.Server) ([][]string, error)

func Map ¶

func Map(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error

func MapFromN ¶

func MapFromN(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error

func MapToN ¶

func MapToN(indir string, outdir string, cmd string, servers []lib.Server, progress func()) error

func PutFile ¶

func PutFile(src string, dst string, servers []lib.Server) error

func PutReader ¶

func PutReader(src io.Reader, dst string, servers []lib.Server) error

func Rm ¶

func Rm(prefix string, recursive bool, servers []lib.Server) error

Types ¶

This section is empty.

Source Files ¶

View all Source files

s4.go

Directories ¶

Path	Synopsis
cmd
s4
s4_server
lib

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

s4

why

what

when

how

non goals

install

test

automatic deployment

manual deployment

usage

examples

related projects

related posts

api

usage

s4 rm

s4 eval

s4 ls

s4 cp

s4 map

s4 map-to-n

s4 map-from-n

s4 config

s4 health

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Cp ¶

func Eval ¶

func GetFile ¶

func GetWriter ¶

func List ¶

func ListBuckets ¶

func Map ¶

func MapFromN ¶

func MapToN ¶

func PutFile ¶

func PutReader ¶

func Rm ¶

Types ¶

Source Files ¶

Directories ¶