trainer

package
v0.2.0-rc1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2018 License: Apache-2.0 Imports: 23 Imported by: 0

Documentation

Overview

Package trainer is to manage TensorFlow training jobs.

Index

Constants

View Source
const (
	SuccessfulCreateReason = "SuccessfulCreate"
	FailedCreateReason     = "FailedCreate"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type ClusterSpec

type ClusterSpec map[string][]string

ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.

type KubernetesLabels

type KubernetesLabels map[string]string

KubernetesLabels represents a set of labels to apply to a Kubernetes resources.

func (KubernetesLabels) ToSelector

func (l KubernetesLabels) ToSelector() (string, error)

ToSelector converts the labels to a selector matching the labels.

type TFConfig

type TFConfig struct {
	// Cluster represents a TensorFlow ClusterSpec.
	// See: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpechttps://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
	Cluster ClusterSpec `json:"cluster"`
	Task    TaskSpec    `json:"task"`
	// Environment is used by tensorflow.contrib.learn.python.learn in versions <= 1.3
	// TODO(jlewi): I don't think it is used in versions TF >- 1.4. So we can eventually get rid of it.
	Environment string `json:"environment"`
}

TFConfig is a struct representing the TensorFlow config. This struct is turned into an environment which is used by TensorFlow processes to configure themselves.

type TFReplicaSet

type TFReplicaSet struct {
	ClientSet kubernetes.Interface

	// Job is a pointer to the TrainingJob to which this replica belongs.
	Job  *TrainingJob
	Spec tfv1alpha1.TFReplicaSpec
	// contains filtered or unexported fields
}

TFReplicaSet is a set of TF processes all acting as the same role (e.g. worker

func NewTFReplicaSet

func NewTFReplicaSet(clientSet kubernetes.Interface, recorder record.EventRecorder, tfReplicaSpec tfv1alpha1.TFReplicaSpec, job *TrainingJob) (*TFReplicaSet, error)

NewTFReplicaSet returns TFReplicaSet object for existing replica

func (*TFReplicaSet) CreatePodWithIndex

func (s *TFReplicaSet) CreatePodWithIndex(index int32) (*v1.Pod, error)

CreatePodWithIndex will create a new pod with specify index

func (*TFReplicaSet) CreateServiceWithIndex

func (s *TFReplicaSet) CreateServiceWithIndex(index int32) (*v1.Service, error)

CreateServiceWithIndex will create a new service with specify index

func (*TFReplicaSet) Delete

func (s *TFReplicaSet) Delete() error

Delete deletes the replicas

func (*TFReplicaSet) GetSingleReplicaStatus

func (s *TFReplicaSet) GetSingleReplicaStatus(index int32) tfv1alpha1.ReplicaState

GetSingleReplicaStatus returns status for a single replica

func (*TFReplicaSet) GetStatus

func (s *TFReplicaSet) GetStatus() (tfv1alpha1.TFReplicaStatus, error)

GetStatus returns the status of the replica set.

func (*TFReplicaSet) Labels

func (s *TFReplicaSet) Labels() KubernetesLabels

Labels returns the labels for this replica set.

func (*TFReplicaSet) LabelsByIndex

func (s *TFReplicaSet) LabelsByIndex(index int32) KubernetesLabels

LabelsByIndex returns the labels for a pod in this replica set.

func (*TFReplicaSet) SyncPods

func (s *TFReplicaSet) SyncPods() error

SyncPods will try to check current pods for this TFReplicaSet and try to make it as desired.

func (*TFReplicaSet) SyncServices

func (s *TFReplicaSet) SyncServices() error

SyncServices will try to check current services for this TFReplicaSet and try to make it as desired.

type TFReplicaSetInterface

type TFReplicaSetInterface interface {
	Create() error
	Delete() error
	GetStatus() (tfv1alpha1.TFReplicaStatus, error)
}

TFReplicaSetInterface is an interface for managing a set of replicas.

type TaskSpec

type TaskSpec struct {
	Type  string `json:"type"`
	Index int    `json:"index"`
}

TaskSpec represents a Task specification.

type TrainingJob

type TrainingJob struct {
	KubeCli kubernetes.Interface

	Replicas []*TFReplicaSet
	// contains filtered or unexported fields
}

TODO(jlewi): We should switch a New pattern and make trainingJob private so we can TrainingJob represents a training job specification. ensure correctness on creation.

func NewJob

NewJob method invokes the initJob and check for error

func (*TrainingJob) ClusterSpec

func (j *TrainingJob) ClusterSpec() ClusterSpec

ClusterSpec returns the cluster specification for the training job provided

func (*TrainingJob) CreatePdb added in v0.3.0

func (j *TrainingJob) CreatePdb(nrReplicas int32) (*v1beta1.PodDisruptionBudget, error)

func (*TrainingJob) Delete

func (j *TrainingJob) Delete()

Delete methods deletes the pods and frees the allocated resources

func (*TrainingJob) GetStatus

GetStatus returns the status of training job provided

func (*TrainingJob) Reconcile

func (j *TrainingJob) Reconcile(config *tfv1alpha1.ControllerConfig, enableGangScheduling bool) error

Reconcile tries to get the job into the desired state.

func (*TrainingJob) SchedulerName

func (j *TrainingJob) SchedulerName() string

SchedulerName returns the scheduler name for the job.

func (*TrainingJob) UID

func (j *TrainingJob) UID() types.UID

UID returns the user ID of the requesting user

func (*TrainingJob) Update

func (j *TrainingJob) Update(newJob *tfv1alpha1.TFJob)

Update replaces the TFJob corresponding to TrainingJob with the provided job. This function is used when the Spec/Status of the job is modified outside the controller. For example, if the user issues a delete request. This will update the metadata on the object so we need to replace the spec.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL