azurebatch

package

v0.7.4 Latest Latest Go to latest Published: Dec 3, 2018 License: Apache-2.0 Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sylabs/virtual-kubelet

Links

Open Source Insights

README ¶

Kubernetes Virtual Kubelet with Azure Batch

Azure Batch provides a HPC Computing environment in Azure for distributed tasks. Azure Batch handles scheduling of discrete jobs and tasks accross pools of VM's. It is commonly used for batch processing tasks such as rendering.

The Virtual kubelet integration allows you to take advantage of this from within Kubernetes. The primary usecase for the provider is to make it easy to use GPU based workload from normal Kubernetes clusters. For example, creating Kubernetes Jobs which train or execute ML models using Nvidia GPU's or using FFMPEG.

Azure Batch allows for low priority nodes which can also help to reduce cost for non-time sensitive workloads.

The ACI provider is the best option unless you're looking to utilise some specific features of Azure Batch.

Status: Experimental

This provider is currently in the exterimental stages. Contributions welcome!

Quick Start

The following Terraform template deploys an AKS cluster with the Virtual Kubelet, Azure Batch Account and GPU enabled Azure Batch pool. The Batch pool contains 1 Dedicated NC6 Node and 2 Low Priority NC6 Nodes.

Setup Terraform for Azure following this guide here
From the commandline move to the deployment folder cd ./providers/azurebatch/deployment then edit vars.example.tfvars adding in your Service Principal details
Download the latest version of the Community Kubernetes Provider for Terraform. Get the correct link from here and use it as follows: (Current official Terraform K8s provider doesn't support Deployments)

curl -L -o - PUT_RELASE_BINARY_LINK_YOU_FOUND_HERE | gunzip > terraform-provider-kubernetes
chmod +x ./terraform-provider-kubernetes

Use terraform init to initialize the template
Use terraform plan -var-file=./vars.example.tfvars and terraform apply -var-file=./vars.example.tfvars to deploy the template
Run kubectl describe deployment/vkdeployment to check the virtual kubelet is running correctly.
Run kubectl create -f examplegpupod.yaml
Run pods=$(kubectl get pods --selector=app=examplegpupod --show-all --output=jsonpath={.items..metadata.name}) then kubectl logs $pods to view the logs. Should see:

	[Vector addition of 50000 elements]
	Copy input data from the host memory to the CUDA device
	CUDA kernel launch with 196 blocks of 256 threads
	Copy output data from the CUDA device to the host memory
	Test PASSED
	Done

Tweaking the Quickstart

You can update main.tf to increase the number of nodes allocated to the Azure Batch pool or update ./aks/main.tf to increase the number of agent nodes allocated to your AKS cluster.

Advanced Setup

Prerequistes

An Azure Batch Account configurated
An Azure Batch Pool created with necessary VM spec. VM's in the pool must have:
- docker installed and correctly configured
- nvidia-docker and cuda drivers installed
K8s cluster
Azure Service Principal with access to the Azure Batch Account

Setup

The provider expects the following environment variables to be configured:

    ClientID:        AZURE_CLIENT_ID
	ClientSecret:    AZURE_CLIENT_SECRET
	ResourceGroup:   AZURE_RESOURCE_GROUP
	SubscriptionID:  AZURE_SUBSCRIPTION_ID
	TenantID:        AZURE_TENANT_ID
	PoolID:          AZURE_BATCH_POOLID
	JobID (optional):AZURE_BATCH_JOBID
	AccountLocation: AZURE_BATCH_ACCOUNT_LOCATION
	AccountName:     AZURE_BATCH_ACCOUNT_NAME

Running

The provider will assign pods to machines in the Azure Batch Pool. Each machine can, by default, process only one pod at a time running more than 1 pod per machine isn't currently supported and will result in errors.

Azure Batch queues tasks when no machines are available so pods will sit in podPending state while waiting for a VM to become available.

Documentation ¶

Index ¶

type Config
type ConfigError
- func (e *ConfigError) Error() string
type Provider
- func NewBatchProvider(configString string, rm *manager.ResourceManager, ...) (*Provider, error)
- func NewBatchProviderFromConfig(config *Config, rm *manager.ResourceManager, nodeName, operatingSystem string, ...) (*Provider, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	ClientID        string
	ClientSecret    string
	SubscriptionID  string
	TenantID        string
	ResourceGroup   string
	PoolID          string
	JobID           string
	AccountName     string
	AccountLocation string
}

Config - Basic azure config used to interact with ARM resources.

type ConfigError ¶

type ConfigError struct {
	CurrentConfig *Config
	ErrorDetails  string
}

ConfigError - Error when reading configuration values.

func (*ConfigError) Error ¶

func (e *ConfigError) Error() string

type Provider ¶

type Provider struct {
	// contains filtered or unexported fields
}

Provider the base struct for the Azure Batch provider

func NewBatchProvider ¶

func NewBatchProvider(configString string, rm *manager.ResourceManager, nodeName, operatingSystem string, internalIP string, daemonEndpointPort int32) (*Provider, error)

NewBatchProvider Creates a batch provider

func NewBatchProviderFromConfig ¶

func NewBatchProviderFromConfig(config *Config, rm *manager.ResourceManager, nodeName, operatingSystem string, internalIP string, daemonEndpointPort int32) (*Provider, error)

NewBatchProviderFromConfig Creates a batch provider

func (*Provider) Capacity ¶

func (p *Provider) Capacity(ctx context.Context) v1.ResourceList

Capacity returns a resource list containing the capacity limits

func (*Provider) CreatePod ¶

func (p *Provider) CreatePod(ctx context.Context, pod *v1.Pod) error

CreatePod accepts a Pod definition

func (*Provider) DeletePod ¶

func (p *Provider) DeletePod(ctx context.Context, pod *v1.Pod) error

DeletePod accepts a Pod definition

func (*Provider) ExecInContainer ¶

func (p *Provider) ExecInContainer(name string, uid types.UID, container string, cmd []string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize, timeout time.Duration) error

ExecInContainer executes a command in a container in the pod, copying data between in/out/err and the container's stdin/stdout/stderr. TODO: Implementation

func (*Provider) GetContainerLogs ¶

func (p *Provider) GetContainerLogs(ctx context.Context, namespace, podName, containerName string, tail int) (string, error)

GetContainerLogs returns the logs of a container running in a pod by name.

func (*Provider) GetPod ¶

func (p *Provider) GetPod(ctx context.Context, namespace, name string) (*v1.Pod, error)

GetPod returns a pod by name

func (*Provider) GetPodFullName ¶

func (p *Provider) GetPodFullName(namespace string, pod string) string

Get full pod name as defined in the provider context TODO: Implementation

func (*Provider) GetPodStatus ¶

func (p *Provider) GetPodStatus(ctx context.Context, namespace, name string) (*v1.PodStatus, error)

GetPodStatus retrieves the status of a given pod by name.

func (*Provider) GetPods ¶

func (p *Provider) GetPods(ctx context.Context) ([]*v1.Pod, error)

GetPods retrieves a list of all pods scheduled to run.

func (*Provider) NodeAddresses ¶

func (p *Provider) NodeAddresses(ctx context.Context) []v1.NodeAddress

NodeAddresses returns a list of addresses for the node status within Kubernetes.

func (*Provider) NodeConditions ¶

func (p *Provider) NodeConditions(ctx context.Context) []v1.NodeCondition

NodeConditions returns a list of conditions (Ready, OutOfDisk, etc), for updates to the node status within Kubernetes.

func (*Provider) NodeDaemonEndpoints ¶

func (p *Provider) NodeDaemonEndpoints(ctx context.Context) *v1.NodeDaemonEndpoints

NodeDaemonEndpoints returns NodeDaemonEndpoints for the node status within Kubernetes.

func (*Provider) OperatingSystem ¶

func (p *Provider) OperatingSystem() string

OperatingSystem returns the operating system for this provider.

func (*Provider) UpdatePod ¶

func (p *Provider) UpdatePod(ctx context.Context, pod *v1.Pod) error

UpdatePod accepts a Pod definition

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL