Documentation ¶
Index ¶
Constants ¶
View Source
const PodLevelJobDir = "/job"
PodLevelJobDir represents the place to store the job state indicator files, as well as the $BREAK_FILE and $EXITCODE_FILE.
View Source
const PodLevelLogDir = PodLevelJobDir + "/logs"
PodLevelLogDir represents the place to store the per-learner logs.
Variables ¶
View Source
var ( //NativeFrameworks which support native distribution NativeFrameworks = []string{"tensorflow", "caffe2", "mxnet", "horovod", "pytorch"} )
Functions ¶
func GetVolumeClaim ¶
func GetVolumeClaim(volumeSize int64) (*v1core.PersistentVolumeClaim, error)
GetVolumeClaim returns a PersistentVolumeClaim struct for the given volume size (specified in bytes).
Types ¶
type Service ¶
type Service interface { service.LifecycleManagerServer service.LifecycleHandler StopLCM() }
Service LCM manages the lifecycle of the entire distributed deep learning job
type Training ¶
type Training interface {
Start() error
}
Training ...
func NewTraining ¶
func NewTraining(ctx context.Context, k8sClient kubernetes.Interface, req *service.JobDeploymentRequest, log *logger.LocLoggingEntry) Training
NewTraining ...
Source Files ¶
Click to show internal directories.
Click to hide internal directories.