gke-tpu-env-injector

command module

v0.5.0 Latest Latest Go to latest Published: Sep 5, 2023 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/abatilo/gke-tpu-env-injector

Links

Open Source Insights

README ¶

gke-tpu-env-injector

Automatically inject the environment variables used by libtpu when running TPUs on GKE.

On August 31, 2023, Google officially released support for running their TPU VMs (v4 and v5e) on Google Kubernetes Engine.

The tpu_driver application, with the accompanying TPU device driver application that gets installed to GKE clusters with TPU support enabled actually require two interesting environment variables to be available to your applications when you run them. These environment variables are TPU_WORKER_ID and TPU_WORKER_HOSTNAMES.

Taken directly from the GCP documentation:

TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique
worker-id in the TPU slice. The supported values for this field range from zero
to the number of Pods minus one.

TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses
that need to communicate with each other within the slice. There should be a
hostname or IP address for each TPU VM in the slice. The list of IP addresses or
hostnames are ordered and zero indexed by the TPU_WORKER_ID.

These two environment variables require that you can dynamically inject the TPU_WORKER_ID into each application, and that TPU_WORKER_HOSTNAMES contains individually addressable DNS names for each specific worker, which will represent the pieces of a TPU PodSlice.

Conveniently, GKE will automatically inject these environment variables into pods for you BUT they will only do that under very specific conditions:

GKE automatically injects these environment variables by using a mutating
webhook when a Job is created with the completionMode: Indexed, subdomain,
parallelism > 1, and requesting google.com/tpu properties.

However, what if you're not launching a Kubernetes Job at all? What if you have your own applications to launch that still need these environment variables? That's what gke-tpu-env-injector is for.

gke-tpu-env-injector will do this same environment variable injection for Kubernetes StatefulSets, which can also leverage a Kubernetes headless service, in order to get predictable, addressable individual pod DNS addresses.

gke-tpu-env-injector does this through the Kubernetes native MutatingAdmissionWebhook functionality which will intercept all scheduled StatefulSets and Pods that are annotated with gke-tpu-env-injector.aaronbatilo.dev/inject: enabled.

Getting started

To install gke-tpu-env-injector, we've provided a helm chart that's hosted on the GitHub Container Registry as an OCI artifact:

helm upgrade --install gke-tpu-env-injector oci://ghcr.io/abatilo/gke-tpu-env-injector --set cert-manager.enabled=true

Setting cert-manager.enabled=true will both create certificate authority for self signed certificates as well as request the required TLS certificates from cert-manager and mount them for gke-tpu-env-injector to be able to receive encrypted webhooks from the Kubernetes control plane.

Configuration

CLI flag	Environment variable	Description	Default
`--tls-cert-file`	`GTEI_TLS_CERT_FILE`	The path to the file containing the default x509 certificate for HTTPS.	`/etc/tls/tls.crt`
`--tls-key-file`	`GTEI_TLS_KEY_FILE`	The path to the file containing the default x509 private key matching --tls-cert-file.	`/etc/tls/tls.key`

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL