README ¶
zeropod - pod that scales down to zero
Zeropod is a Kubernetes runtime (more specifically a containerd shim) that automatically checkpoints containers to disk after a certain amount of time of the last TCP connection. While in scaled down state, it will listen on the same port the application inside the container was listening on and will restore the container on the first incoming connection. Depending on the memory size of the checkpointed program this happens in tens to a few hundred milliseconds, virtually unnoticable to the user. As all the memory contents are stored to disk during checkpointing, all state of the application is restored.
Use-cases
Only time will tell how useful zeropod actually is. Some made up use-cases that could work are:
- Low traffic sites
- Dev/Staging environments
- "Mostly static" sites that still need some server component
- Hopefully more to be found
How it works
First off, what is this containerd shim? The shim sits between containerd and the container sandbox. Each container has such a long-running process that calls out to runc to manage the lifecycle of a container.
show containerd architecture
There are several components that make zeropod work but here are the most important ones:
- Checkpointing is done using CRIU.
- After checkpointing, a userspace TCP proxy (activator) is created on a random port and an eBPF program is loaded to redirect packets destined to the checkpointed container to the activator. The activator then accepts the connection, restores the process, signals to disable the eBPF redirect and then proxies the initial request(s) to the restored application. See activation sequence for more details.
- All subsequent connections go directly to the application without any proxying and performance impact.
- An eBPF probe is used to track the last TCP activity on the running application. This helps zeropod delay checkpointing if there is recent activity. This avoids too much flapping on a service that is frequently used.
- To the container runtime (e.g. Kubernetes), the container appears to be running even though the process is technically not. This is required to prevent the runtime from trying to restart the container.
- When running
kubectl exec
on to the scaled down container, it will be restored and the exec should work just as with any normal Kubernetes container. - Metrics are recorded continuously within each shim and the zeropod-manager process that runs once per node (DaemonSet) is responsible to collect and merge all metrics from the different shim processes. The shim exposes a unix socket for the manager to connect. The manager exposes the merged metrics on an HTTP endpoint.
Activation sequence
This diagram shows what happens when a user initiates a connection to a checkpointed container.
show diagram
sequenceDiagram
actor User
participant Redirector
participant Activator
participant Container
Note over Container: checkpointed
Note over Activator: listening on port 41234
User->>Redirector: TCP connect to port 80
Note right of User: local port 12345
Redirector->>Redirector: redirect to port 41234
Redirector->>Activator: TCP connect
Activator->>Activator: TCP accept
Activator->>Container: restore
loop every millisecond
Activator->>Container: TCP connect to port 80
end
Note over Container: restored
Container-->>Activator: TCP accept
Activator-->>Redirector: TCP accept
Redirector-->>Redirector: redirect to port 12345
Redirector-->>User: TCP accept
Note right of User: connection between user<br>and container established
User->>Container: TCP connect to port 80
Note over Redirector: pass
Container-->>User: TCP accept
Note over Redirector: pass
Compatibility
Most programs should to just work with zeropod out of the box. The examples directory contains a variety of software that have been tested successfully. If something fails, the containerd logs can prove useful to figuring out what went wrong as it will output the CRIU log on checkpoint/restore failure. What has proven somewhat flaky sometimes are some arm64 workloads running in a linux VM on top of Mac OS. If you run into any issues with your software, please don't hesitate to create an issue.
Getting started
Requirements
- Kubernetes v1.23+
- Containerd 1.6+
As zeropod is implemented using a runtime
class, it
needs to install binaries to your cluster nodes (by default in /opt/zeropod
)
and also configure Containerd to load the shim. If you first test this, it's
probably best to use a kind cluster or something
similar that you can quickly setup and delete again. It uses a DaemonSet
called zeropod-node
for installing components on the node itself and also
runs the manager
component for attaching the eBPF programs and collecting
metrics.
Installation
The config directory comes with a few predefined manifests for use with different Kubernetes distributions.
⚠️ The installer will restart the Containerd systemd service on each targeted node on the first install to load the config changes that are required for the zeropod shim to load. This is usually non-disruptive as Containerd is designed to be restarted without affecting any workloads.
# install zeropod runtime and manager
# "default" installation:
kubectl apply -k https://github.com/ctrox/zeropod/config/production
# GKE:
kubectl apply -k https://github.com/ctrox/zeropod/config/gke
⚠️⚠️⚠️ For k3s and rke2, the initial installation needs to restart the k3s/k3s-agent or rke2-server/rke2-agent services, since it's not possible to just restart Containerd. This will lead to restarts of other workloads on each targeted node.
# k3s:
kubectl apply -k https://github.com/ctrox/zeropod/config/k3s
# rke2:
kubectl apply -k https://github.com/ctrox/zeropod/config/rke2
By default, zeropod will only be installed on nodes with the label
zeropod.ctrox.dev/node=true
. So after applying the manifest, label your
node(s) that should have it installed accordingly:
$ kubectl label node <node-name> zeropod.ctrox.dev/node=true
Once applied, check for node
pod(s) in the zeropod-system
namespace. If
everything worked it should be in status Running
:
$ kubectl -n zeropod-system wait --for=condition=Ready pod -l app.kubernetes.io/name=zeropod-node
pod/zeropod-node-wgzrv condition met
Now you can create workloads which make use of zeropod.
# create an example pod which makes use of zeropod
kubectl apply -f https://github.com/ctrox/zeropod/config/examples/nginx.yaml
Depending on your cluster setup, none of the predefined configs might not
match yours. In this case you need clone the repo and adjust the manifests in
config/
accordingly. If your setup is common, a PR to add your configuration
adjustments would be most welcome.
Uninstalling
To uninstall zeropod, you can apply the uninstall manifest to spawn a pod to do the cleanup on all labelled zeropod nodes. After all the uninstall pods have finished, we can delete all the manifests.
kubectl apply -k https://github.com/ctrox/zeropod/config/uninstall
kubectl -n zeropod-system wait --for=condition=Ready pod -l app.kubernetes.io/name=zeropod-node
kubectl delete -k https://github.com/ctrox/zeropod/config/production
Configuration
A pod can make use of zeropod only if the runtimeClassName
is set to
zeropod
. Apart from that there are two annotations that are currently
required. See this minimal example of a pod:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
runtimeClassName: zeropod
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Then there are also a few optional annotations that can be set on the pod to tweak the behaviour of zeropod:
# container-names of containers in the pod that should be considered for
# scaling to zero. If empty all containers will be considered.
zeropod.ctrox.dev/container-names: "nginx,sidecar"
# ports-map configures the ports our to be scaled down application(s) are
# listening on. As ports have to be matched with containers in a pod, the
# key is the container name and the value a comma-delimited list of ports
# any TCP connection on one of these ports will restore an application.
# If omitted, the zeropod will try to find the listening ports automatically,
# use this option in case this fails for your application.
zeropod.ctrox.dev/ports-map: "nginx=80,81;sidecar=8080"
# Configures long to wait before scaling down again after the last
# connnection. The duration is reset whenever a connection happens.
# Setting it to 0 means the application will be checkpointed as soon
# as possible after restore. Use with caution as this will cause lots
# of checkpoints/restores.
# Default is 1 minute.
zeropod.ctrox.dev/scaledown-duration: 10s
# Execute a pre-dump before the full checkpoint and process stop. This can
# reduce the checkpoint time in some cases but testing has shown that it also
# has a small impact on restore time so YMMV. The default is false.
# See https://criu.org/Memory_changes_tracking for details on what this does.
zeropod.ctrox.dev/pre-dump: "true"
# Disable checkpointing completely. This option was introduced for testing
# purposes to measure how fast some applications can be restored from a complete
# restart instead of from memory images. If enabled, the process will be
# killed on scale-down and all state is lost. This might be useful for some
# use-cases where the application is stateless and super fast to startup.
zeropod.ctrox.dev/disable-checkpointing: "true"
# Experimental:
# It's possible to reduce the resource usage further by grouping multiple pods
# into one shim process. The value of the annotation specifies the group id,
# each of which will result in a shim process. This is currently marked as
# experimental since not much testing has been done and new issues might
# surface when using grouping.
io.containerd.runc.v2.group: "zeropod"
zeropod-node
The zeropod-node Daemonset is scheduled on every node labelled
zeropod.ctrox.dev/node=true
. The individual components of the node daemon
are documented in this section.
Installer
The installer runs as an init-container and runs the binary
cmd/installer/main.go
with some distro-specific options to install the
runtime binaries, configure containerd and register the RuntimeClass
.
Manager
The manager component starts after the installer init-container has succeeded. It provides functionality that is needed on a node-level and is would bloat the shim otherwise. For example, loading eBPF programs can be quite memory intensive so they have been moved from the shim to the manager to keep the shim memory usage as minimal as possible.
These are the responsibilities of the manager:
- Loading eBPF programs that the shim(s) rely on.
- Collect metrics from all shim processes and expose them on HTTP for scraping.
- Subscribes to shim scaling events and adjusts Pod requests.
In-place Resource scaling (Experimental)
This makes use of the feature flag
InPlacePodVerticalScaling
to automatically update the pod resource requests to a minimum on scale down
events and revert them again on scale up. Once the Kubernetes feature flag is
enabled, it also needs to be enabled using the manager flag
-in-place-scaling=true
plus some additional permissions are required for the
node driver to patch pods. To deploy this, simply uncomment the
in-place-scaling
component in the config/production/kustomization.yaml
.
This will add the flag and the required permissions when building the
kustomization.
Status Labels
To reflect the container scaling status in the k8s API, the manager can set
status labels on a pod. This requires the flag -status-labels=true
, which is
set by default in the production deployment.
The resulting labels have the following structure:
status.zeropod.ctrox.dev/<container name>: <container status>
So if our pod has two containers, one of them running and one in scaled-down state, the labels would be set like this:
labels:
status.zeropod.ctrox.dev/container1: RUNNING
status.zeropod.ctrox.dev/container2: SCALED_DOWN
Flags
-metrics-addr=":8080" sets the address of the metrics server
-debug enables debug logging
-in-place-scaling=false enable in-place resource scaling, requires InPlacePodVerticalScaling feature flag
-status-labels=false update pod labels to reflect container status
Metrics
The zeropod-node pod exposes metrics on 0.0.0.0:8080/metrics
in Prometheus
format on each installed node. The metrics address can be configured with the
-metrics-addr
flag. The following metrics are currently available:
# HELP zeropod_checkpoint_duration_seconds The duration of the last checkpoint in seconds.
# TYPE zeropod_checkpoint_duration_seconds histogram
zeropod_checkpoint_duration_seconds_bucket{container="nginx",namespace="default",pod="nginx",le="+Inf"} 3
zeropod_checkpoint_duration_seconds_sum{container="nginx",namespace="default",pod="nginx"} 0.749254206
zeropod_checkpoint_duration_seconds_count{container="nginx",namespace="default",pod="nginx"} 3
# HELP zeropod_last_checkpoint_time A unix timestamp in nanoseconds of the last checkpoint.
# TYPE zeropod_last_checkpoint_time gauge
zeropod_last_checkpoint_time{container="nginx",namespace="default",pod="nginx"} 1.688065891505882e+18
# HELP zeropod_last_restore_time A unix timestamp in nanoseconds of the last restore.
# TYPE zeropod_last_restore_time gauge
zeropod_last_restore_time{container="nginx",namespace="default",pod="nginx"} 1.688065880496497e+18
# HELP zeropod_restore_duration_seconds The duration of the last restore in seconds.
# TYPE zeropod_restore_duration_seconds histogram
zeropod_restore_duration_seconds_bucket{container="nginx",namespace="default",pod="nginx",le="+Inf"} 4
zeropod_restore_duration_seconds_sum{container="nginx",namespace="default",pod="nginx"} 0.684013211
zeropod_restore_duration_seconds_count{container="nginx",namespace="default",pod="nginx"} 4
# HELP zeropod_running Reports if the process is currently running or checkpointed.
# TYPE zeropod_running gauge
zeropod_running{container="nginx",namespace="default",pod="nginx"} 0
Development
For iterating on shim development it's recommended to use
kind. Once installed and a cluster has been
created (kind create cluster --config=e2e/kind.yaml
) run make install-kind
to build and install everything on the kind cluster. After making code changes
the fastest way to update the shim is using make build-kind
, since this will
only build the binary and copy the updated binary to the cluster.
Developing on an M1+ Mac
It can be a bit hard to get this running on an arm Mac. First off, the shim itself does not run on MacOS at all as it requires linux. But we can run it inside a kind cluster using a podman machine. One important thing to note is that the podman machine needs to run rootful, else checkpointing (CRIU) does not seem to work. Also so far I have not been able to get this running with Docker desktop.
Dependencies:
podman machine init --rootful
podman machine start
kind create cluster --config=e2e/kind.yaml
make install-kind
Now your kind cluster should have a working zeropod installation. The e2e
tests can also be run but it's a bit more involved than just running go test
since that requires GOOS=linux
. You can use make docker-test-e2e
to run
the e2e tests within a docker container, so everything will be run on the
linux podman VM.
Directories ¶
Path | Synopsis |
---|---|
api
|
|
shim/v1
Code generated by protoc-gen-go-ttrpc.
|
Code generated by protoc-gen-go-ttrpc. |
cmd
|
|
Package manager contains most of the implementation of the zeropod-manager node daemon.
|
Package manager contains most of the implementation of the zeropod-manager node daemon. |
runc
|
|