k8sgpusharing

module
v0.0.0-...-78ba78e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2020 License: Apache-2.0

README

Please refer to NTHU-LSALAB/KubeShare

K8sGPUSharing

Make GPU shareable in Kubernetes

Features

  1. Compatible with native K8s resource management
  2. Fine-grained resource definition
  3. Support GPU compute and memory limit
  4. Support multiple GPU in a node
  5. Avoid GPU fragmentation problem
  6. Support GPU namespace

Interested?

Follow this link

Directories

  • pkg: golang packages for K8s generated by code-generator release-1.14 (6c2a4329ac29)

Limitation

  • Only support Nvidia GPU device plugin with nvidia-docker2 in K8s.
    Not compatible with docker (version>=19) using newer GPU resource API.
  • Currently only support cuda 9.0
  • Require Kubernetes version >= 1.10 (device plugin & CRD)

Prerequisite

  • A K8s cluster with Nvidia GPU device plugin.
  • kubectl with admin permissions.

Installation

kubectl create -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/crd.yaml
kubectl create -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/controller.yaml
kubectl create -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/daemonset.yaml

Using Shareable GPU

In order to add extra information (some environment variables, volume mounts needed by shareable GPU are immutable after Pod had been created) to Pods created by user, we create a new CustomResourceDefinition (CRD) named MtgpuPod (Multi-tenant GPU Pod) as the basic execution unit which is originally represented by Pod.

An example for MtgpuPod spec:

apiVersion: lsalab.nthu/v1
kind: MtgpuPod
metadata:
  name: pod1
  annotations:
    "lsalab.nthu/gpu_request": "0.5"
    "lsalab.nthu/gpu_limit": "1.0"
    "lsalab.nthu/gpu_mem": "1073741824" # 1Gi, in bytes
    "lsalab.nthu/GPUID": "abc"
spec:
  nodeName: node1 # must be assigned
  containers:
  - name: sleep
    image: nvidia/cuda:9.0-base
    command: ["sh", "-c"]
    args:
    - 'nvidia-smi -L'
    resources:
      requests:
        cpu: "1"
        memory: "500Mi"
      limits:
        cpu: "1"
        memory: "500Mi"

Because floating point custom device requests is forbidden by K8s, we move GPU resource usage definitions to Annotations.

  • lsalab.nthu/gpu_request: guaranteed GPU usage of Pod, gpu_request <= "1.0".
  • lsalab.nthu/gpu_limit: maximum extra usage if GPU still has free resources, gpu_request <= gpu_limit <= "1.0".
  • lsalab.nthu/gpu_mem: maximum GPU memory usage of Pod.
  • lsalab.nthu/GPUID: described in section Controlling everything of shareable GPU.
  • spec is a normal PodSpec definition to be deployed.
  • spec.nodeName must be assigned (a deployed MtgpuPod must be scheduled). More information described in section Cluster resources accounting.

Controlling everything of shareable GPU

A lsalab.nthu/GPUID (abbrv. GPUID) value is generated by random string of length 5, which temporary represents a physical GPU card. A GPUID value is available only if there is at least one Pod using the GPUID on the Node. The same GPUID can be assigned by multiple Pods simultaneously if the Pods want to use the same physical GPU card.

GPUID must be unique in Node scope.

An example for controlling GPU sharing: (one node with two physical GPU)

GPU1             GPU2
+--------------+ +--------------+
|              | |              |
|              | |              |
|              | |              |
|              | |              |
|              | |              |
+--------------+ +--------------+

randomString(5): zxcvb (wants a new physical GPU)
Create Pod1 gpu_request:0.2 GPUID:zxcvb

GPU1             GPU2(zxcvb)
+--------------+ +--------------+
|              | |   Pod1:0.2   |
|              | |              |
|              | |              |
|              | |              |
|              | |              |
+--------------+ +--------------+

randomString(5): qwert (don't want to share with Pod1, wants a new physical GPU)
Create Pod2 gpu_request:0.3 GPUID:qwert

GPU1(qwert)      GPU2(zxcvb)
+--------------+ +--------------+
|   Pod2:0.3   | |   Pod1:0.2   |
|              | |              |
|              | |              |
|              | |              |
|              | |              |
+--------------+ +--------------+

Create Pod3 gpu_request:0.4 GPUID:zxcvb (wants to share with Pod1)

GPU1(qwert)      GPU2(zxcvb)
+--------------+ +--------------+
|   Pod2:0.3   | |   Pod1:0.2   |
|              | |   Pod3:0.4   |
|              | |              |
|              | |              |
|              | |              |
+--------------+ +--------------+

Delete Pod2 (GPUID qwert is no longer available)

GPU1             GPU2(zxcvb)
+--------------+ +--------------+
|              | |   Pod1:0.2   |
|              | |   Pod3:0.4   |
|              | |              |
|              | |              |
|              | |              |
+--------------+ +--------------+

randomString(5): asdfg (don't want to share with Pod1 and Pod3, wants a new physical GPU)
Create Pod4 gpu_request:0.5 GPUID:asdfg

GPU1(asdfg)      GPU2(zxcvb)
+--------------+ +--------------+
|   Pod4:0.5   | |   Pod1:0.2   |
|              | |   Pod3:0.4   |
|              | |              |
|              | |              |
|              | |              |
+--------------+ +--------------+

Compatible with Nvidia device plugin

The Occupy-Pod

To prevent K8s default-scheduler cannot recognize MtgpuPods, causing that default-scheduler schedules Pods to Node whose physcial GPUs are used by MtgpuPod, we run an Occupy-Pod with 1 Nvidia device plugin GPU request (spec.containers[0].resources.requests: "nvidia.com/gpu": 1) when a new GPUID was generated, telling default-scheduler that GPU is wanted by MtgpuPod system.

Occupy-Pods was created in kube-system namespace, in naming format: mtgpupod-occupypod-{NodeName}-{GPUID}.

Cluster resources accounting

For some reasons, K8s default-scheduler doesn't support shareable custom devices now. It's necessary to run a MtgpuPod-scheduler for automatically deploying MtgpuPod. Although we may provide a basic MtgpuPod-scheduler in future that only support resource request scheduling, we still describe the method (pseudo-code) for accounting shareable custom device resources.

GPUResources := list of free GPU of every Node

for each Node:
    availableGPU := available (total) GPU on Node
    allocatedGPUMap := map of GPUID=>usage

    for each Pod on Node:
        if ! Pod.Name.Contains("mtgpupod-occupypod") // avoid repeat calculate to MtgpuPod
            availableGPU -= sum of "nvidia.com/gpu" request of containers in Pod

    for each MtgpuPod on Node:
        if GPUID of MtgpuPod in allocatedGPUMap exists:
            allocatedGPUMap.Get(GPUID) -= GPU request of MtgpuPod
        else:
            availableGPU -= 1
            allocatedGPUMap.Add(GPUID)
            allocatedGPUMap.Get(GPUID) = 1.0 - GPU request of MtgpuPod

    GPUResources.Add(availableGPU, allocatedGPUMap)

Uninstallation

kubectl delete -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/crd.yaml
kubectl delete -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/controller.yaml
kubectl delete -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/daemonset.yaml

Issues

Currently the GPU memory usage control may not work properly.

Directories

Path Synopsis
pkg
client/clientset/versioned
This package has the automatically generated clientset.
This package has the automatically generated clientset.
client/clientset/versioned/fake
This package has the automatically generated fake clientset.
This package has the automatically generated fake clientset.
client/clientset/versioned/scheme
This package contains the scheme of the automatically generated clientset.
This package contains the scheme of the automatically generated clientset.
client/clientset/versioned/typed/mtgpupod/v1
This package has the automatically generated typed clients.
This package has the automatically generated typed clients.
client/clientset/versioned/typed/mtgpupod/v1/fake
Package fake has the automatically generated clients.
Package fake has the automatically generated clients.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL