Documentation ¶
Overview ¶
Package trainer is to manage TensorFlow training jobs.
Index ¶
- Constants
- type ClusterSpec
- type KubernetesLabels
- type TFConfig
- type TFReplicaSet
- func (s *TFReplicaSet) CreatePodWithIndex(index int32) (*v1.Pod, error)
- func (s *TFReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)
- func (s *TFReplicaSet) Delete() error
- func (s *TFReplicaSet) GetSingleReplicaStatus(index int32) tfv1alpha1.ReplicaState
- func (s *TFReplicaSet) GetStatus() (tfv1alpha1.TFReplicaStatus, error)
- func (s *TFReplicaSet) Labels() KubernetesLabels
- func (s *TFReplicaSet) LabelsByIndex(index int32) KubernetesLabels
- func (s *TFReplicaSet) SyncPods() error
- func (s *TFReplicaSet) SyncServices() error
- type TFReplicaSetInterface
- type TaskSpec
- type TrainingJob
- func (j *TrainingJob) ClusterSpec() ClusterSpec
- func (j *TrainingJob) CreatePdb(nrReplicas int32) (*v1beta1.PodDisruptionBudget, error)
- func (j *TrainingJob) Delete()
- func (j *TrainingJob) GetStatus() (tfv1alpha1.State, []*tfv1alpha1.TFReplicaStatus, error)
- func (j *TrainingJob) Reconcile(config *tfv1alpha1.ControllerConfig, enableGangScheduling bool) error
- func (j *TrainingJob) SchedulerName() string
- func (j *TrainingJob) UID() types.UID
- func (j *TrainingJob) Update(newJob *tfv1alpha1.TFJob)
Constants ¶
const ( SuccessfulCreateReason = "SuccessfulCreate" FailedCreateReason = "FailedCreate" )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ClusterSpec ¶
ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.
type KubernetesLabels ¶
KubernetesLabels represents a set of labels to apply to a Kubernetes resources.
func (KubernetesLabels) ToSelector ¶
func (l KubernetesLabels) ToSelector() (string, error)
ToSelector converts the labels to a selector matching the labels.
type TFConfig ¶
type TFConfig struct { // Cluster represents a TensorFlow ClusterSpec. // See: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpechttps://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec Cluster ClusterSpec `json:"cluster"` Task TaskSpec `json:"task"` // Environment is used by tensorflow.contrib.learn.python.learn in versions <= 1.3 // TODO(jlewi): I don't think it is used in versions TF >- 1.4. So we can eventually get rid of it. Environment string `json:"environment"` }
TFConfig is a struct representing the TensorFlow config. This struct is turned into an environment which is used by TensorFlow processes to configure themselves.
type TFReplicaSet ¶
type TFReplicaSet struct { ClientSet kubernetes.Interface // Job is a pointer to the TrainingJob to which this replica belongs. Job *TrainingJob Spec tfv1alpha1.TFReplicaSpec // contains filtered or unexported fields }
TFReplicaSet is a set of TF processes all acting as the same role (e.g. worker
func NewTFReplicaSet ¶
func NewTFReplicaSet(clientSet kubernetes.Interface, recorder record.EventRecorder, tfReplicaSpec tfv1alpha1.TFReplicaSpec, job *TrainingJob) (*TFReplicaSet, error)
NewTFReplicaSet returns TFReplicaSet object for existing replica
func (*TFReplicaSet) CreatePodWithIndex ¶
func (s *TFReplicaSet) CreatePodWithIndex(index int32) (*v1.Pod, error)
CreatePodWithIndex will create a new pod with specify index
func (*TFReplicaSet) CreateServiceWithIndex ¶
func (s *TFReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)
CreateServiceWithIndex will create a new service with specify index
func (*TFReplicaSet) GetSingleReplicaStatus ¶
func (s *TFReplicaSet) GetSingleReplicaStatus(index int32) tfv1alpha1.ReplicaState
GetSingleReplicaStatus returns status for a single replica
func (*TFReplicaSet) GetStatus ¶
func (s *TFReplicaSet) GetStatus() (tfv1alpha1.TFReplicaStatus, error)
GetStatus returns the status of the replica set.
func (*TFReplicaSet) Labels ¶
func (s *TFReplicaSet) Labels() KubernetesLabels
Labels returns the labels for this replica set.
func (*TFReplicaSet) LabelsByIndex ¶
func (s *TFReplicaSet) LabelsByIndex(index int32) KubernetesLabels
LabelsByIndex returns the labels for a pod in this replica set.
func (*TFReplicaSet) SyncPods ¶
func (s *TFReplicaSet) SyncPods() error
SyncPods will try to check current pods for this TFReplicaSet and try to make it as desired.
func (*TFReplicaSet) SyncServices ¶
func (s *TFReplicaSet) SyncServices() error
SyncServices will try to check current services for this TFReplicaSet and try to make it as desired.
type TFReplicaSetInterface ¶
type TFReplicaSetInterface interface { Create() error Delete() error GetStatus() (tfv1alpha1.TFReplicaStatus, error) }
TFReplicaSetInterface is an interface for managing a set of replicas.
type TrainingJob ¶
type TrainingJob struct { KubeCli kubernetes.Interface Replicas []*TFReplicaSet // contains filtered or unexported fields }
TODO(jlewi): We should switch a New pattern and make trainingJob private so we can TrainingJob represents a training job specification. ensure correctness on creation.
func NewJob ¶
func NewJob(kubeCli kubernetes.Interface, tfJobClient tfjobclient.Interface, recorder record.EventRecorder, job *tfv1alpha1.TFJob, config *tfv1alpha1.ControllerConfig) (*TrainingJob, error)
NewJob method invokes the initJob and check for error
func (*TrainingJob) ClusterSpec ¶
func (j *TrainingJob) ClusterSpec() ClusterSpec
ClusterSpec returns the cluster specification for the training job provided
func (*TrainingJob) CreatePdb ¶ added in v0.3.0
func (j *TrainingJob) CreatePdb(nrReplicas int32) (*v1beta1.PodDisruptionBudget, error)
func (*TrainingJob) Delete ¶
func (j *TrainingJob) Delete()
Delete methods deletes the pods and frees the allocated resources
func (*TrainingJob) GetStatus ¶
func (j *TrainingJob) GetStatus() (tfv1alpha1.State, []*tfv1alpha1.TFReplicaStatus, error)
GetStatus returns the status of training job provided
func (*TrainingJob) Reconcile ¶
func (j *TrainingJob) Reconcile(config *tfv1alpha1.ControllerConfig, enableGangScheduling bool) error
Reconcile tries to get the job into the desired state.
func (*TrainingJob) SchedulerName ¶
func (j *TrainingJob) SchedulerName() string
SchedulerName returns the scheduler name for the job.
func (*TrainingJob) UID ¶
func (j *TrainingJob) UID() types.UID
UID returns the user ID of the requesting user
func (*TrainingJob) Update ¶
func (j *TrainingJob) Update(newJob *tfv1alpha1.TFJob)
Update replaces the TFJob corresponding to TrainingJob with the provided job. This function is used when the Spec/Status of the job is modified outside the controller. For example, if the user issues a delete request. This will update the metadata on the object so we need to replace the spec.