caffe2-operator

module

v0.0.0-...-2c5db8d Latest Latest Go to latest Published: Jun 18, 2018 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/carmark/caffe2-operator

Links

Open Source Insights

README ¶

caffe2-operator

Experimental repository for a caffe2 operator

Motivation

Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/Gloo to find the other nodes to communicate.

Build

$ make
mkdir -p _output/bin
go build -o _output/bin/caffe2-operator ./cmd/caffe2-operator/
$ _output/bin/caffe2-operator --help

Custom Resource Definition

The custom resource submitted to the Kubernetes API would look something like this:

apiVersion: "kubeflow.org/v1alpha1"
kind: "Caffe2Job"
metadata:
  name: "example-job"
spec:
  backend: "redis"
  replicaSpecs:
      replicas: 2
      template:
        spec:
          hostNetwork: true
          containers:
          - image: kubeflow/caffe2:py2-cuda9.0-cudnn7-ubuntu16.04
            name: caffe2
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /usr/local/caffe2/caffe2/python/examples/
            command: ["python", "resnet50_trainer.py"]

A full resnet50 trainer example is here.

This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of backend options.

backend Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.

For redis backend, you need a working Redis server to serve for workers communication.

Resulting Worker

apiVersion: v1
kind: Pod
metadata:
  name: caffe2-worker-${job_id}
  labels:
      caffe2_job_key: default-example-job
      caffe2_replica_index: "0"
      caffe2_replica_type: worker
      group_name: kubeflow.org
      runtime_id: "1529307087"
spec:
  containers:
    image: kubeflow/caffe2:py2-cuda9.0-cudnn7-ubuntu16.04
    imagePullPolicy: IfNotPresent
    name: caffe2
    env:
      - name: SHARD_ID
        value: "0"
      - name: NUM_SHARDS
        value: "1"
      - name: RUN_ID
        value: "1529307087"
      - name: CAFFE2_CONFIG
        value: '{"cluster":{"worker":["default-example-job-worker-0:2222"]},"task":{"type":"worker","index":0}}'
    ...

The worker spec generates a pod. They will communicate to the master through the redis's service name.

NOTE: There are three additional environments which are generated based on worker role, such as index for SHARD_ID, replicas for NUM_SHARDS and running ID for RUN_ID.

Design

This is an implementaion of the Caffe2 distributed design patterns, found here.

Other backends

Form here, Caffe2 also support NFS backend, however, we do not test the nfs backend now.

How to setup

Setup kubernetes

A full function kubernetes.
Open the features-gate if you want to use GPU

Create a CRD for kuberntes

# kubectl apply -f https://raw.githubusercontent.com/kubeflow/caffe2-operator/master/examples/crd.yaml
customresourcedefinition.apiextensions.k8s.io "caffe2jobs.kubeflow.org" created

Start the caffe2-operator

# ./caffe2-operator -alsologtostderr -v 4 -controller-config-file /root/admin.conf

Prepare the dataset

In the example, we use handwritten. You need to convert it to levelDB type by using make_mnist_db.

$ make_mnist_db --channel_first --db leveldb --image_file data/mnist/train-images-idx3-ubyte --label_file data/mnist/train-labels-idx1-ubyte --output_file data/mnist/mnist-train-nchw-leveldb 

$ make_mnist_db --channel_first --db leveldb --image_file data/mnist/t10k-images-idx3-ubyte --label_file data/mnist/t10k-labels-idx1-ubyte --output_file data/mnist/mnist-test-nchw-leveldb

Run the job

$ kubectl apply -f ./examples/resnet50.yaml
$ kubectl get caffe2jobs
NAME          AGE
example-job   29m
$ kubectl get pods
NAME                READY     STATUS    RESTARTS   AGE
example-job-pdcbs   1/1       Running   0          29m

In this example, we use hostNetwork = true, it is not the better solution, but it will train more quickly. Because the overlay network will reduce some performance.

Directories ¶

Path	Synopsis
cmd
caffe2-operator
caffe2-operator/app
caffe2-operator/app/options
pkg
apis/caffe2/helper
apis/caffe2/v1alpha1 Package v1alpha1 is the v1alpha1 version of the API.	Package v1alpha1 is the v1alpha1 version of the API.
apis/caffe2/validation
client/clientset/versioned This package has the automatically generated clientset.	This package has the automatically generated clientset.
client/clientset/versioned/fake This package has the automatically generated fake clientset.	This package has the automatically generated fake clientset.
client/clientset/versioned/scheme This package contains the scheme of the automatically generated clientset.	This package contains the scheme of the automatically generated clientset.
client/clientset/versioned/typed/kubeflow/v1alpha1 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
client/clientset/versioned/typed/kubeflow/v1alpha1/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
client/informers/externalversions
client/informers/externalversions/internalinterfaces
client/informers/externalversions/kubeflow
client/informers/externalversions/kubeflow/v1alpha1
client/listers/kubeflow/v1alpha1
controller Package controller provides a Kubernetes controller for a Caffe2Job resource.	Package controller provides a Kubernetes controller for a Caffe2Job resource.
trainer Package trainer is to manage Caffe2 training jobs.	Package trainer is to manage Caffe2 training jobs.
util Package util provides various helper routines.	Package util provides various helper routines.
util/k8sutil
util/retryutil
version

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL