Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type JobCondition ¶
type JobCondition struct { // Type of job condition. Type JobConditionType `json:"type"` // Status of the condition, one of True, False, Unknown. Status corev1.ConditionStatus `json:"status"` // The reason for the condition's last transition. Reason string `json:"reason,omitempty"` // A human readable message indicating details about the transition. Message string `json:"message,omitempty"` // The last time this condition was updated. LastUpdateTime metav1.Time `json:"lastUpdateTime,omitempty"` // Last time the condition transitioned from one status to another. LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` }
JobCondition describes the state of the job at a certain point.
type JobConditionType ¶
type JobConditionType string
JobConditionType defines all kinds of types of JobStatus.
const ( // JobCreated means the job has been accepted by the system, // but one or more of the pods/services has not been started. // This includes time before pods being scheduled and launched. JobCreated JobConditionType = "Created" // JobRunning means all sub-resources (e.g. services/pods) of this job // have been successfully scheduled and launched. // The training is running without error. JobRunning JobConditionType = "Running" // JobRestarting means one or more sub-resources (e.g. services/pods) of this job // reached phase failed but maybe restarted according to it's restart policy // which specified by user in v1.PodTemplateSpec. // The training is freezing/pending. JobRestarting JobConditionType = "Restarting" // JobSucceeded means all sub-resources (e.g. services/pods) of this job // reached phase have terminated in success. // The training is complete without error. JobSucceeded JobConditionType = "Succeeded" // JobFailed means one or more sub-resources (e.g. services/pods) of this job // reached phase failed with no restarting. // The training has failed its execution. JobFailed JobConditionType = "Failed" )
type JobStatus ¶
type JobStatus struct { // Conditions is an array of current observed job conditions. Conditions []JobCondition `json:"conditions"` // ReplicaStatuses is map of ReplicaType and ReplicaStatus, // specifies the status of each replica. ReplicaStatuses map[ReplicaType]*ReplicaStatus `json:"replicaStatuses"` // Represents time when the job was acknowledged by the job controller. // It is not guaranteed to be set in happens-before order across separate operations. // It is represented in RFC3339 form and is in UTC. StartTime *metav1.Time `json:"startTime,omitempty"` // Represents time when the job was completed. It is not guaranteed to // be set in happens-before order across separate operations. // It is represented in RFC3339 form and is in UTC. CompletionTime *metav1.Time `json:"completionTime,omitempty"` // Represents last time when the job was reconciled. It is not guaranteed to // be set in happens-before order across separate operations. // It is represented in RFC3339 form and is in UTC. LastReconcileTime *metav1.Time `json:"lastReconcileTime,omitempty"` }
JobStatus represents the current observed state of the training Job.
type MPIJobSpec ¶
type MPIJobSpec struct { SlotsPerWorker *int32 `json:"slotsPerWorker,omitempty"` RunPolicy RunPolicy `json:"runPolicy,omitempty"` MainContainer string `json:"mainContainer,omitempty"` MPIReplicaSpecs map[operationv1.MPIReplicaType]*operationv1.KFReplicaSpec `json:"mpiReplicaSpecs"` }
MPIJobSpec resource definiton.
type MXJobSpec ¶
type MXJobSpec struct { RunPolicy RunPolicy `json:"runPolicy,omitempty"` JobMode operationv1.MXJobModeType `json:"jobMode,omitempty"` MXReplicaSpecs map[operationv1.MXReplicaType]*operationv1.KFReplicaSpec `json:"mxReplicaSpecs"` }
MXJobSpec is a desired state description of the MXNetJob.
type PaddleJobSpec ¶
type PaddleJobSpec struct { RunPolicy RunPolicy `json:"runPolicy,omitempty"` ElasticPolicy *operationv1.PaddleElasticPolicy `json:"elasticPolicy,omitempty"` PaddleReplicaSpecs map[operationv1.PaddleReplicaType]*operationv1.KFReplicaSpec `json:"paddleReplicaSpecs"` }
PaddleJobSpec is a desired state description of the TFJob.
type PyTorchJobSpec ¶
type PyTorchJobSpec struct { RunPolicy RunPolicy `json:"runPolicy,omitempty"` ElasticPolicy *operationv1.PytorchElasticPolicy `json:"elasticPolicy,omitempty"` NprocPerNode *string `json:"nprocPerNode,omitempty"` PyTorchReplicaSpecs map[operationv1.PyTorchReplicaType]*operationv1.KFReplicaSpec `json:"pytorchReplicaSpecs"` }
PyTorchJobSpec is a desired state description of the PyTorchJob.
type ReplicaStatus ¶
type ReplicaStatus struct { // The number of actively running pods. Active int32 `json:"active,omitempty"` // The number of pods which reached phase Succeeded. Succeeded int32 `json:"succeeded,omitempty"` // The number of pods which reached phase Failed. Failed int32 `json:"failed,omitempty"` }
ReplicaStatus represents the current observed state of the replica.
type ReplicaType ¶
type ReplicaType string
ReplicaType represents the type of the replica. Each operator needs to define its own set of ReplicaTypes.
type RunPolicy ¶
type RunPolicy struct { // CleanPodPolicy defines the policy to kill pods after the job completes. // Default to Running. CleanPodPolicy *operationv1.CleanPodPolicy `json:"cleanPodPolicy,omitempty"` // TTLSecondsAfterFinished is the TTL to clean up jobs. // It may take extra ReconcilePeriod seconds for the cleanup, since // reconcile gets called periodically. // Default to infinite. TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"` // Specifies the duration in seconds relative to the startTime that the job may be active // before the system tries to terminate it; value must be positive integer. // +optional ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` // Optional number of retries before marking this job failed. // +optional BackoffLimit *int32 `json:"backoffLimit,omitempty"` // SchedulingPolicy defines the policy related to scheduling, e.g. gang-scheduling // +optional SchedulingPolicy *operationv1.SchedulingPolicy `json:"schedulingPolicy,omitempty"` }
type TFJobSpec ¶
type TFJobSpec struct { RunPolicy RunPolicy `json:"runPolicy,omitempty"` EnableDynamicWorker bool `json:"enableDynamicWorker,omitempty"` SuccessPolicy *operationv1.TFSuccessPolicy `json:"successPolicy,omitempty"` TFReplicaSpecs map[operationv1.TFReplicaType]*operationv1.KFReplicaSpec `json:"tfReplicaSpecs"` }
TFJobSpec is a desired state description of the TFJob.
type XGBoostJobSpec ¶
type XGBoostJobSpec struct { RunPolicy RunPolicy `json:"runPolicy,omitempty"` XGBReplicaSpecs map[operationv1.XGBReplicaType]*operationv1.KFReplicaSpec `json:"xgbReplicaSpecs"` }
XGBoostJobSpec is a desired state description of the XGBoostJob.