pytorch-operator

module
v1.0.0-rc.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2019 License: Apache-2.0

README

Kubernetes Custom Resource and Operator for PyTorch jobs

Build Status Go Report Card

Overview

This repository contains the specification and implementation of PyTorchJob custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes. See CRD definition

Prerequisites

Installing PyTorch Operator

Please refer to the installation instructions in the Kubeflow user guide. This installs pytorchjob CRD and pytorch-operator controller to manage the lifecycle of PyTorch jobs.

Creating a PyTorch Job

You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

cat examples/mnist/v1/pytorch_job_mnist_gloo.yaml

Deploy the PyTorchJob resource to start training:

kubectl create -f examples/mnist/v1/pytorch_job_mnist_gloo.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo

Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

PODNAME=$(kubectl get pods -l pytorch-job-name=pytorch-dist-mnist-gloo,pytorch-replica-type=master -o name)
kubectl logs -f ${PODNAME}

Monitoring a PyTorch Job

kubectl get -o yaml pytorchjobs pytorch-dist-mnist-gloo

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: v1
items:
- apiVersion: kubeflow.org/v1
  kind: PyTorchJob
  metadata:
    creationTimestamp: 2019-01-11T00:51:48Z
    generation: 1
    name: pytorch-dist-mnist-gloo
    namespace: default
    resourceVersion: "2146573"
    selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/pytorchjobs/pytorch-dist-mnist-gloo
    uid: 13ad0e7f-153b-11e9-b5c1-42010a80001e
  spec:
    pytorchReplicaSpecs:
      Master:
        replicas: 1
        restartPolicy: OnFailure
        template:
          spec:
            containers:
            - args:
              - --backend
              - gloo
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
              name: pytorch
              resources:
                limits:
                  nvidia.com/gpu: "1"
      Worker:
        replicas: 1
        restartPolicy: OnFailure
        template:
          spec:
            containers:
            - args:
              - --backend
              - gloo
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
              name: pytorch
              resources:
                limits:
                  nvidia.com/gpu: "1"
  status:
    completionTime: 2019-01-11T01:03:15Z
    conditions:
    - lastTransitionTime: 2019-01-11T00:51:48Z
      lastUpdateTime: 2019-01-11T00:51:48Z
      message: PyTorchJob pytorch-dist-mnist-gloo is created.
      reason: PyTorchJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2019-01-11T00:57:22Z
      lastUpdateTime: 2019-01-11T00:57:22Z
      message: PyTorchJob pytorch-dist-mnist-gloo is running.
      reason: PyTorchJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: 2019-01-11T01:03:15Z
      lastUpdateTime: 2019-01-11T01:03:15Z
      message: PyTorchJob pytorch-dist-mnist-gloo is successfully completed.
      reason: PyTorchJobSucceeded
      status: "True"
      type: Succeeded
    replicaStatuses:
      Master:
        succeeded: 1
      Worker:
        succeeded: 1
    startTime: 2019-01-11T00:57:22Z

Contributing

Please refer to the developer_guide.

Directories

Path Synopsis
cmd
pkg
apis/pytorch/v1
Package v1 is the v1 version of the API.
Package v1 is the v1 version of the API.
apis/pytorch/v1beta2
Package v1beta2 is the v1beta2 version of the API.
Package v1beta2 is the v1beta2 version of the API.
client/clientset/versioned
This package has the automatically generated clientset.
This package has the automatically generated clientset.
client/clientset/versioned/fake
This package has the automatically generated fake clientset.
This package has the automatically generated fake clientset.
client/clientset/versioned/scheme
This package contains the scheme of the automatically generated clientset.
This package contains the scheme of the automatically generated clientset.
client/clientset/versioned/typed/pytorch/v1
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
client/clientset/versioned/typed/pytorch/v1/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.
client/clientset/versioned/typed/pytorch/v1beta2
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
client/clientset/versioned/typed/pytorch/v1beta2/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.
common/util/v1/unstructured
Package unstructured is the package for unstructured informer, which is from https://github.com/argoproj/argo/blob/master/util/unstructured/unstructured.go This is a temporary solution for https://github.com/kubeflow/tf-operator/issues/561
Package unstructured is the package for unstructured informer, which is from https://github.com/argoproj/argo/blob/master/util/unstructured/unstructured.go This is a temporary solution for https://github.com/kubeflow/tf-operator/issues/561
common/util/v1beta2/unstructured
Package unstructured is the package for unstructured informer, which is from https://github.com/argoproj/argo/blob/master/util/unstructured/unstructured.go This is a temporary solution for https://github.com/kubeflow/tf-operator/issues/561
Package unstructured is the package for unstructured informer, which is from https://github.com/argoproj/argo/blob/master/util/unstructured/unstructured.go This is a temporary solution for https://github.com/kubeflow/tf-operator/issues/561
controller.v1/pytorch
Package controller provides a Kubernetes controller for a PyTorchJob resource.
Package controller provides a Kubernetes controller for a PyTorchJob resource.
controller.v1beta2/pytorch
Package controller provides a Kubernetes controller for a PyTorchJob resource.
Package controller provides a Kubernetes controller for a PyTorchJob resource.
util
Package util provides various helper routines.
Package util provides various helper routines.
test

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL