Documentation ¶
Index ¶
- Constants
- func InsertTfConfigToEnv(container *corev1.Container, cluster SparseClusterSpec, ...)
- type ClusterSpec
- type SparseClusterSpec
- type SparseTFConfig
- type TFConfig
- type TaskManager
- func (m *TaskManager) HandleFaultPods(client runtime_client.Client, job *elasticv1alpha1.ElasticJob) error
- func (m *TaskManager) ReconcilePods(client runtime_client.Client, job *elasticv1alpha1.ElasticJob, ...) error
- func (m *TaskManager) StopRunningPods(client runtime_client.Client, job *elasticv1alpha1.ElasticJob) error
- func (m *TaskManager) SyncJobState(client runtime_client.Client, job *elasticv1alpha1.ElasticJob) error
- type TaskSpec
Constants ¶
const ( // ReplicaTypeWorker is the type for training worker replica. ReplicaTypeWorker commonv1.ReplicaType = "worker" // ReplicaTypePS is the type for training parameter server replica ReplicaTypePS commonv1.ReplicaType = "ps" // ReplicaTypeChief is the type for training chief replica of TensorFlow PS. ReplicaTypeChief commonv1.ReplicaType = "chief" // ReplicaTypeEvaluator is the type for elaluator replica ReplicaTypeEvaluator commonv1.ReplicaType = "evaluator" // LabelRestartCount is the count to relaunch failed nodes LabelRestartCount = "restart-count" // EnvTfConfigName is the environment variable name of TensorFlow cluster spec. EnvTfConfigName = "TF_CONFIG" // PSServicePort is the port of service PSServicePort int = 2222 // WorkerServicePort is the port of service WorkerServicePort int = 3333 )
Variables ¶
This section is empty.
Functions ¶
func InsertTfConfigToEnv ¶
func InsertTfConfigToEnv( container *corev1.Container, cluster SparseClusterSpec, taskType commonv1.ReplicaType, rankIndex int, )
InsertTfConfigToEnv inserts TFConfig to envs
Types ¶
type ClusterSpec ¶
type ClusterSpec struct { Worker []string `json:"worker,omitempty"` PS []string `json:"ps"` Chief []string `json:"chief"` Evaluator []string `json:"evaluator,omitempty"` }
ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.
type SparseClusterSpec ¶
type SparseClusterSpec struct { Worker map[int]string `json:"worker,omitempty"` PS []string `json:"ps"` Chief map[int]string `json:"chief"` Evaluator map[int]string `json:"evaluator,omitempty"` }
SparseClusterSpec enables a server to be configured without needing to know the identity of (for example) all other worker tasks. https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
type SparseTFConfig ¶
type SparseTFConfig struct { // Cluster represents a TensorFlow ClusterSpec. // See: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec Cluster SparseClusterSpec `json:"sparseCluster"` Task TaskSpec `json:"task"` }
SparseTFConfig is a struct representing the distributed TensorFlow config.
type TFConfig ¶
type TFConfig struct { // Cluster represents a TensorFlow ClusterSpec. // See: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec Cluster ClusterSpec `json:"cluster"` Task TaskSpec `json:"task"` }
TFConfig is a struct representing the distributed TensorFlow config.
type TaskManager ¶
type TaskManager struct {
// contains filtered or unexported fields
}
TaskManager generates Pods for task in a distributed PS job.
func (*TaskManager) HandleFaultPods ¶
func (m *TaskManager) HandleFaultPods( client runtime_client.Client, job *elasticv1alpha1.ElasticJob, ) error
HandleFaultPods processes fault Pods
func (*TaskManager) ReconcilePods ¶
func (m *TaskManager) ReconcilePods( client runtime_client.Client, job *elasticv1alpha1.ElasticJob, scalePlan *elasticv1alpha1.ScalePlan, ) error
ReconcilePods creates a Pod on a K8s cluster
func (*TaskManager) StopRunningPods ¶
func (m *TaskManager) StopRunningPods( client runtime_client.Client, job *elasticv1alpha1.ElasticJob, ) error
StopRunningPods stops all running Pods
func (*TaskManager) SyncJobState ¶
func (m *TaskManager) SyncJobState( client runtime_client.Client, job *elasticv1alpha1.ElasticJob, ) error
SyncJobState synchronize the job status by replicas
type TaskSpec ¶
type TaskSpec struct { Type commonv1.ReplicaType `json:"type"` Index int `json:"index"` }
TaskSpec is the specification for a task (PS or worker) of the ElasticJob using ParameterServerStrategy.