coreos-reboot-operator

module

v0.0.0-...-2eff2a1 Latest Latest Go to latest Published: Jun 9, 2017 License: GPL-3.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jamiehannaford/coreos-reboot-operator

Links

Open Source Insights

README ¶

CoreOS reboot operator

NOTE: This codebase has been deprecated in favour of CoreOS's official operator.

A Kubernetes operator that manages the reboot cycle for CoreOS nodes. Normally when a node self-updates, it waits to be rebooted in order for the changes to be effected. This has been traditionally done either by manual intervention or by sync tools like locksmith. Although the latter works very well, it does not offer full programmatic extensibility that's needed by some orgs who require high availability for their Kubernetes clusters.

This project was inspired by Aaron Levy's KubeCon talk and is heavily based on his demo controller repository. Although this project has been verified to work, it's still very much in alpha so it's advised to use this in dev environments only.

How it works

The operator is composed of two components: the controller which synchronizes the reboots, ensuring that the cluster will not be negatively impacted; and the agent DaemonSet, which listens out for reboot requests on systemd and performs the reboot itself.

This is the lifecycle of a reboot:

The update engine detects a new update is available, then it downloads and installs. When the self-installation has completed, the engine notifies its completion by updating its status to UPDATE_STATUS_UPDATED_NEED_REBOOT.
The operator listens on a DBus interface for this state change. When it detects that a reboot is needed, it tags the Kubernetes node with a reboot-needed annotation.
The controller uses an informer to fire hooks when node resources are updated. When the controller sees that a node is marked for reboot (i.e. it has a specific annotation), it will perform a series of checks to make sure the operation is permitted - for example it will enforce a node quota, ensuring that only a specific number are rebooted at once. If these conditions pass, it permits the operation to go ahead and marks the node as reboot.
The agent also uses an informer to listen out for node state changes. Once this controller gives the green light, the agent cordons the Kubernetes node, preventing further pods being scheduled. It then gracefully deletes pods from the node. Once this is done, it sends a reboot command over DBus and the node is rebooted.
After the reboot, the agent re-marks the node as schedulable and removes any reboot annotations.

Further work

Allow better configuration through TPRs or ConfigMaps
Add some kind of E2E testing
Upgrade to client-go v3 when released
Support pod eviction if available
Improve pod filtering so that specific types are not force deleted

Prerequisites

The nodes must disable auto-reboots. You can do so by following the update strategy docs, or by disabling locksmith:

systemctl stop locksmithd

How to deploy

# Create reboot-operator ns
kubectl create -f manifests/namespace.yaml

# Create cluster roles and sa bindings
kubectl create -f manifests/cluster-role.yaml

# Create controller RS
kubectl create -f manifests/reboot-controller.yaml

# Create agent DS
kubectl create -f manifests/cluster-role.yaml

Building

Build agent and controller binaries:

make clean all

Build agent and controller Docker images:

make clean images

Directories ¶

Path	Synopsis
pkg
common
reboot-agent
reboot-controller

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL