cedana-cli

command module
v0.2.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2023 License: AGPL-3.0 Imports: 1 Imported by: 0

README

cedana-cli

Cedana is a framework for the democritization and (eventually) commodification of compute. We achieve this by leveraging checkpoint/restore to seamlessly migrate work across machines, clouds and beyond.

This repo contains a self serve CLI tool to allow developers to experiment with our system. With it, you can:

  • Launch instances anywhere, with guaranteed price and capacity optimization. We look across your configured providers (AWS, Paperspace, etc.) to select the optimal instance defined in a provided job spec. This abstracts away cloud infra burdens.
  • Leverage our client code (installed on every launched instance) to checkpoint/restore across instances, unlocking reliability, increased performance gains and decreased price.
  • Deploy and manage any kind of job, whether a pyTorch training job, a webservice or a multibody physics simulation.

To access our managed service, contact founders@cedana.ai

Usage

Cedana consists of the client code (found here) running on compute in the cloud (or anywhere else) and the orchestration/daemon, which runs on your local machine.

To build from source: go build

To run: ./cedana-cli

If you prefer to install from a package manager, we push to packagecloud and have a homebrew tap. Check out the documentation for instructions.

Documentation

You can view the official documentation here.

Demo

https://www.youtube.com/watch?v=KC4STzSQ_DU

(Note: The video is sped up for brevity to show how a CPU-bound PyTorch training job can be migrated between instances in realtime).

Todos

We're working on building out a public roadmap. Until then, here's a few of the highest priority todos:

  • Add more cloud providers to arbitrage between
  • runc container checkpointing
  • Advanced optimizaiton strategies to pick and migrate work between clouds
  • Way more tests
  • GPU checkpointing
  • Simulation environment for rapid checkpoint/migrate
  • Kubernetes and cluster formation support
  • Batch compute paradigms
  • SLURM feature parity

For checkpoint/restore specific work, refer to the README in the client code repo.

Contributing

See CONTRIBUTING.md for guidelines.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL