README
¶
TorchElastic Controller for Kubernetes
IMPORTANT This CRD is no longer being actively maintained. We encourage you to look at TorchX to launch PyTorch jobs on Kubernetes. The two relevant sections are:
- TorchX-Kubernetes setup: https://pytorch.org/torchx/latest/schedulers/kubernetes.html
- TorchX dist.ddp builtin (uses torchelastic under the hood): https://pytorch.org/torchx/latest/components/distributed.html
Overview
TorchElastic Controller for Kubernetes manages a Kubernetes custom resource ElasticJob
and makes it easy to
run Torch Elastic workloads on Kubernetes.
Prerequisites
NOTE:
- (recommended) create a cluster with GPU instances as some examples (e.g. imagenet) only work on GPU.
- If you provision instances with a single GPU you will only be able to run a single worker per node.
- Our examples assume 1 GPU per node so you will have to adjust
--nproc_per_node
to be equal to the number of CUDA devices on the instance you are using if you want to run multiple workers per container
(Optional) Setup
Here we provide the instructions to create an Amazon EKS cluster or a Microsoft AKS cluster. If you are not using AWS or Azure, please refer to your cloud/infrastructure provider's manual to setup a kubernetes cluster.
NOTE: EKS/AKS is not required to run this controller, you can use other Kubernetes clusters.
Create your Kubernetes Cluster
AWS: EKS
Use eksctl
to create an Amazon EKS cluster. This process takes ~15 minutes.
eksctl create cluster \
--name=torchelastic \
--node-type=p3.2xlarge \
--region=us-west-2 \
--version=1.15 \
--ssh-access \
--ssh-public-key=~/.ssh/id_rsa.pub \
--nodes=2
Azure: AKS
Use az aks
to create a Microsoft AKS cluster. This process takes ~4 minutes.
az aks create
--resource-group myResourceGroup \
--name torchelastic \
--node-vm-size Standard_NC6 \
--node-count 3 \
--location westus2 \
--kubernetes-version 1.15
--generate-ssh-keys
Google Cloud: GKE
Use gcloud
to create a Google GKE cluster. This process takes ~3 minutes.
gcloud container clusters create torchelastic \
--accelerator type=nvidia-tesla-v100,count=1\
--machine-type=n1-standard-4 \
--zone=us-west1-b \
--cluster-version=1.15 \
--num-nodes=2
Add --preemptible
to use pre-emptive instances.
Install Nvidia device plugin to enable GPU support on your cluster.
Deploy the following Daemonset:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
Install ElasticJob
controller and CRD
git clone https://github.com/pytorch/elastic.git
cd elastic/kubernetes
kubectl apply -k config/default
# or
# kustomize build config/default | kubectl apply -f -
You will see logs like following
namespace/elastic-job created
customresourcedefinition.apiextensions.k8s.io/elasticjobs.elastic.pytorch.org created
role.rbac.authorization.k8s.io/leader-election-role created
clusterrole.rbac.authorization.k8s.io/manager-role created
rolebinding.rbac.authorization.k8s.io/leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/elastic-job-k8s-controller-rolebinding created
deployment.apps/elastic-job-k8s-controller created
Verify that the ElasticJob
custom resource is installed
kubectl get crd
The output should include elasticjobs.elastic.pytorch.org
NAME CREATED AT
...
elasticjobs.elastic.pytorch.org 2020-03-18T07:40:53Z
...
Verify controller is ready
kubectl get pods -n elastic-job
NAME READY STATUS RESTARTS AGE
elastic-job-k8s-controller-6d4884c75b-z22cm 1/1 Running 0 15s
Check logs of controller
kubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job
2020-03-19T10:13:43.532Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"}
2020-03-19T10:13:43.534Z INFO controller-runtime.controller Starting EventSource {"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z INFO controller-runtime.controller Starting EventSource {"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z INFO controller-runtime.controller Starting EventSource {"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z INFO setup starting manager
2020-03-19T10:13:43.534Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2020-03-19T10:13:43.822Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"elastic-job","name":"controller-leader-election-helper","uid":"50269b8b-69ca-11ea-b995-0653198c16be","apiVersion":"v1","resourceVersion":"2107564"}, "reason": "LeaderElection", "message": "elastic-job-k8s-controller-6d4884c75b-z22cm_4cf549b7-3289-4285-8e64-647d067178bf became leader"}
2020-03-19T10:13:44.021Z INFO controller-runtime.controller Starting Controller {"controller": "elasticjob"}
2020-03-19T10:13:44.121Z INFO controller-runtime.controller Starting workers {"controller": "elasticjob", "worker count": 1}
Deploy a ElasticJob
-
Deploy an etcd server. This will expose a Kubernetes service
etcd-service
with port2379
.kubectl apply -f config/samples/etcd.yaml
-
Get the etcd server endpoint
$ kubectl get svc -n elastic-job NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE etcd-service ClusterIP 10.100.104.168 <none> 2379/TCP 5m5s
-
Update
config/samples/imagenet.yaml
:- set
rdzvEndpoint
(e.g.10.100.104.168:2379
) to the etcd server you just provisioned. - set
minReplicas
andmaxReplicas
to the desired min and max num nodes (max should not exceed your cluster capacity) - set
Worker.replicas
to the number of nodes to start with (you may modify this later to scale the job in/out) - set the correct
--nproc_per_node
incontainer.args
based on the instance you are running on.
NOTE the
ENTRYPOINT
totorchelastic/examples
ispython -m torchelastic.distributed.launch <args...>
. Notice that you do not have to specify certainlaunch
options such as--rdzv_endpoint
, and--rdzv_id
. These are set automatically by the controller.IMPORTANT a
Worker
in the context of kubernetes refers toNode
intorchelastic.distributed.launch
. Each kubernetesWorker
can run multiple trainers processes (a.k.aworker
intorchelastic.distributed.launch
). - set
-
Submit the training job.
kubectl apply -f config/samples/imagenet.yaml
As you can see, training pod and headless services have been created.
$ kubectl get pods -n elastic-job NAME READY STATUS RESTARTS AGE elastic-job-k8s-controller-6d4884c75b-z22cm 1/1 Running 0 11m imagenet-worker-0 1/1 Running 0 5s imagenet-worker-1 1/1 Running 0 5s
-
You can scale the number of nodes by adjusting
.spec.replicaSpecs[Worker].replicas
and applying the change.kubectl apply -f config/samples/imagenet.yaml
NOTE since you are scaling the containers, you will be scaling in increments of
nproc_per_node
trainers. In our case--nproc_per_node=1
For better performance consider using an instance with multiple GPUs and setting--nproc_per_node=$NUM_CUDA_DEVICES
.WARNING the name of the job is used as
rdzv_id
, which is used to uniquely identify a job run instance. Hence to run multiple parallel jobs with the same spec you need to change.spec.metadata.name
to give it a unique run id (e.g.imagenet_run_0
). Otherwise the new nodes will attempt to join the membership of a different run.
Monitoring jobs
You can describe the job to check job status and job related events.
In following example, imagenet
job is created in elastic-job
namespace, change to use your job name and namespace in your command.
$ kubectl describe elasticjob imagenet -n elastic-job
Name: imagenet
Namespace: elastic-job
<... OMITTED ...>
Status:
Conditions:
Last Transition Time: 2020-03-19T10:30:55Z
Last Update Time: 2020-03-19T10:30:55Z
Message: ElasticJob imagenet is running.
Reason: ElasticJobRunning
Status: True
Type: Running
<... OMITTED ...>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-0
Tail the logs of a worker:
$ kubectl logs -f -n elastic-job imagenet-worker-0
Next Steps
We have included other sample job specs in the config/samples
directory
(e.g. config/samples/classy_vision.yaml
), try them out by
replacing imagenet.yaml
with the appropariate spec filename in the
instructions above.
To use your own script, build a docker image containing your script.
You can use torchelastic/examples
as your base image. Then point your
job specs to use your container by editing Worker.template.spec.containers.image
.
Our examples save checkpoints and models in the container hence the trained model and checkpoints are not accessible after the job is complete. In your scripts use a persistent store like AWS S3 or Azure Blob Storage.
Trouble Shooting
Please check TROUBLESHOOTING.md
Documentation
¶
There is no documentation for this package.
Directories
¶
Path | Synopsis |
---|---|
api
|
|
v1alpha1
Package v1alpha1 contains API Schema definitions for the elastic v1alpha1 API group +kubebuilder:object:generate=true +groupName=elastic.pytorch.org
|
Package v1alpha1 contains API Schema definitions for the elastic v1alpha1 API group +kubebuilder:object:generate=true +groupName=elastic.pytorch.org |