README
¶
gke-tpu-env-injector
Automatically inject the environment variables used by libtpu when running TPUs on GKE.
On August 31, 2023, Google officially released support for running their TPU VMs (v4 and v5e) on Google Kubernetes Engine.
The tpu_driver
application, with the accompanying TPU device driver
application that gets installed to GKE clusters with TPU support enabled
actually require two interesting environment variables to be available to your
applications when you run them. These environment variables are TPU_WORKER_ID
and TPU_WORKER_HOSTNAMES
.
Taken directly from the GCP documentation:
TPU_WORKER_ID: A unique integer for each Pod. This ID denotes a unique
worker-id in the TPU slice. The supported values for this field range from zero
to the number of Pods minus one.
TPU_WORKER_HOSTNAMES: A comma-separated list of TPU VM hostnames or IP addresses
that need to communicate with each other within the slice. There should be a
hostname or IP address for each TPU VM in the slice. The list of IP addresses or
hostnames are ordered and zero indexed by the TPU_WORKER_ID.
These two environment variables require that you can dynamically inject the
TPU_WORKER_ID
into each application, and that TPU_WORKER_HOSTNAMES
contains
individually addressable DNS names for each specific worker, which will
represent the pieces of a TPU PodSlice.
Conveniently, GKE will automatically inject these environment variables into pods for you BUT they will only do that under very specific conditions:
GKE automatically injects these environment variables by using a mutating
webhook when a Job is created with the completionMode: Indexed, subdomain,
parallelism > 1, and requesting google.com/tpu properties.
However, what if you're not launching a Kubernetes Job
at all? What if you
have your own applications to launch that still need these environment
variables? That's what gke-tpu-env-injector
is for.
gke-tpu-env-injector
will do this same environment variable injection for
Kubernetes StatefulSet
s, which can also leverage a Kubernetes headless
service, in order to get predictable, addressable individual pod DNS addresses.
gke-tpu-env-injector
does this through the Kubernetes native
MutatingAdmissionWebhook
functionality which will intercept all scheduled StatefulSet
s and Pod
s that
are annotated with gke-tpu-env-injector.aaronbatilo.dev/inject:
enabled.
Getting started
To install gke-tpu-env-injector
, we've provided a helm chart that's hosted on
the GitHub Container Registry as an OCI artifact:
helm upgrade --install gke-tpu-env-injector oci://ghcr.io/abatilo/gke-tpu-env-injector --set cert-manager.enabled=true
Setting cert-manager.enabled=true
will both create certificate authority for
self signed certificates as well as request the required TLS certificates from
cert-manager
and mount them for gke-tpu-env-injector
to be able to receive
encrypted webhooks from the Kubernetes control plane.
Configuration
CLI flag | Environment variable | Description | Default |
---|---|---|---|
--tls-cert-file |
GTEI_TLS_CERT_FILE |
The path to the file containing the default x509 certificate for HTTPS. | /etc/tls/tls.crt |
--tls-key-file |
GTEI_TLS_KEY_FILE |
The path to the file containing the default x509 private key matching --tls-cert-file. | /etc/tls/tls.key |
Documentation
¶
There is no documentation for this package.