training

package
v0.0.0-...-d9dc1c1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 1, 2024 License: Apache-2.0, BSD-3-Clause Imports: 15 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// ReplicaTypeWorker is the type for training worker replica.
	ReplicaTypeWorker commonv1.ReplicaType = "worker"

	// ReplicaTypePS is the type for training parameter server replica
	ReplicaTypePS commonv1.ReplicaType = "ps"

	// ReplicaTypeChief is the type for training chief replica of TensorFlow PS.
	ReplicaTypeChief commonv1.ReplicaType = "chief"

	// ReplicaTypeEvaluator is the type for elaluator replica
	ReplicaTypeEvaluator commonv1.ReplicaType = "evaluator"

	// LabelRestartCount is the count to relaunch failed nodes
	LabelRestartCount = "restart-count"

	// EnvTfConfigName is the environment variable name of TensorFlow cluster spec.
	EnvTfConfigName = "TF_CONFIG"

	// PSServicePort is the port of service
	PSServicePort int = 2222

	// WorkerServicePort is the port of service
	WorkerServicePort int = 3333
)

Variables

This section is empty.

Functions

func InsertTfConfigToEnv

func InsertTfConfigToEnv(
	container *corev1.Container,
	cluster SparseClusterSpec,
	taskType commonv1.ReplicaType,
	rankIndex int,
)

InsertTfConfigToEnv inserts TFConfig to envs

Types

type ClusterSpec

type ClusterSpec struct {
	Worker    []string `json:"worker,omitempty"`
	PS        []string `json:"ps"`
	Chief     []string `json:"chief"`
	Evaluator []string `json:"evaluator,omitempty"`
}

ClusterSpec represents a cluster TensorFlow specification. https://www.tensorflow.org/deploy/distributed#create_a_tftrainclusterspec_to_describe_the_cluster It is a map from job names to network addresses.

type SparseClusterSpec

type SparseClusterSpec struct {
	Worker    map[int]string `json:"worker,omitempty"`
	PS        []string       `json:"ps"`
	Chief     map[int]string `json:"chief"`
	Evaluator map[int]string `json:"evaluator,omitempty"`
}

SparseClusterSpec enables a server to be configured without needing to know the identity of (for example) all other worker tasks. https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec

type SparseTFConfig

type SparseTFConfig struct {
	// Cluster represents a TensorFlow ClusterSpec.
	// See: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
	Cluster SparseClusterSpec `json:"sparseCluster"`
	Task    TaskSpec          `json:"task"`
}

SparseTFConfig is a struct representing the distributed TensorFlow config.

type TFConfig

type TFConfig struct {
	// Cluster represents a TensorFlow ClusterSpec.
	// See: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
	Cluster ClusterSpec `json:"cluster"`
	Task    TaskSpec    `json:"task"`
}

TFConfig is a struct representing the distributed TensorFlow config.

type TaskManager

type TaskManager struct {
	// contains filtered or unexported fields
}

TaskManager generates Pods for task in a distributed PS job.

func (*TaskManager) HandleFaultPods

func (m *TaskManager) HandleFaultPods(
	client runtime_client.Client, job *elasticv1alpha1.ElasticJob,
) error

HandleFaultPods processes fault Pods

func (*TaskManager) ReconcilePods

func (m *TaskManager) ReconcilePods(
	client runtime_client.Client,
	job *elasticv1alpha1.ElasticJob,
	scalePlan *elasticv1alpha1.ScalePlan,
) error

ReconcilePods creates a Pod on a K8s cluster

func (*TaskManager) StopRunningPods

func (m *TaskManager) StopRunningPods(
	client runtime_client.Client,
	job *elasticv1alpha1.ElasticJob,
) error

StopRunningPods stops all running Pods

func (*TaskManager) SyncJobState

func (m *TaskManager) SyncJobState(
	client runtime_client.Client,
	job *elasticv1alpha1.ElasticJob,
) error

SyncJobState synchronize the job status by replicas

type TaskSpec

type TaskSpec struct {
	Type  commonv1.ReplicaType `json:"type"`
	Index int                  `json:"index"`
}

TaskSpec is the specification for a task (PS or worker) of the ElasticJob using ParameterServerStrategy.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL