Node Healthcheck Operator
Introduction
Hardware is imperfect, and software contains bugs. When node level failures such
as kernel hangs or dead NICs occur, the work required from the cluster does not
decrease - workloads from affected nodes need to be restarted somewhere.
However, some workloads, such as RWO volumes and StatefulSets, may require
at-most-one semantics. Failures affecting these kind of workloads risk data
loss and/or corruption if nodes (and the workloads running on them) are assumed
to be dead whenever we stop hearing from them. For this reason it is important
to know that the node has reached a safe state before initiating recovery of the
workload.
Unfortunately it is not always practical to require admin intervention in order
to confirm the node's true status. In order to automate the recovery of exclusive
workloads, the Medik8s project presents a collection of operators that can be installed on any
kubernetes-based cluster to automate failure detection and fencing / remediation.
For more information visit our homepage
Failure detection with the Node Healthcheck operator
Handling unhealthy nodes
A Node entering an unready state after 5 minutes is an obvious sign that a
failure occurred. However, there may be other criteria or thresholds that are
more appropriate based on your particular physical environment, workloads,
and tolerance for risk.
The Node Healthcheck operator
checks each Node's set of NodeConditions
against the criteria and thresholds defined in NodeHealthCheck (NHC) custom
resources (CRs).
If the Node is deemed to be in a failed state, and remediation is appropriate,
the controller will instantiate a remediation custom resources based on the
remediation template(s) as defined in the NHC CR. NHC offers to configure
a single remediation method, or a list of remediation methods which will be
used one after another with specified order and timeout.
This template based mechanism allows cluster admins to use the best remediator
for their environment, without NHC having to know them beforehand. Remediators
might use e.g. Kubernetes' ClusterAPI, OpenShift's MachineAPI, BMC, Watchdog
or software based reboots for fencing the workloads.
For more details see the remediation documentation.
When the Node recovers and gets healthy again, NHC will delete the
remediation CR for signalling that node recovery was successful.
Special cases
Control plane problems
Remediation is not always the correct response to a failure. Especially in
larger clusters, we want to protect against failures that appear to take out
large portions of compute capacity but are really the result of failures on or
near the control plane. For this reason, the NHC CR includes the ability to
define a minimum number of healthy nodes, by percentage or absolute number.
When the cluster is falling short of this threshold, no further remediation
will be started.
Cluster Upgrades
Cluster upgrades usually draw workers reboots, mainly to apply OS updates.
These nodes might get unhealthy for some time during these reboots.
This disruption can als cause other nodes to overload and appear unhealthy,
when compensating for the lost compute capacity. Making remediation decisions
at this moment may interfere with the upgrade and may even fail it completely.
For that reason NHC will stop remediating new unhealthy nodes in case it
detects that a cluster is upgrading.
At the moment this is only supported on OpenShift, by monitoring the
ClusterVersionOperator.
Manual pausing
Before running cluster upgrades on kubernetes, or for any other reason, cluster
admins can prevent new remediation by pausing the NHC CR.
For more details about using or contributing to Node Healthcheck, check out our
docs.
Help
Please join our Google group for asking
questions. When you find a bug, please open an issue in this repository.