node-healthcheck-operator

command module

v0.5.0-rc.1 Latest Latest Go to latest Published: Apr 5, 2023 License: Apache-2.0 Imports: 24 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/medik8s/node-healthcheck-operator

README ¶

Node Healthcheck Operator

Introduction

Hardware is imperfect, and software contains bugs. When node level failures such as kernel hangs or dead NICs occur, the work required from the cluster does not decrease - workloads from affected nodes need to be restarted somewhere.

However, some workloads, such as RWO volumes and StatefulSets, may require at-most-one semantics. Failures affecting these kind of workloads risk data loss and/or corruption if nodes (and the workloads running on them) are assumed to be dead whenever we stop hearing from them. For this reason it is important to know that the node has reached a safe state before initiating recovery of the workload.

Unfortunately it is not always practical to require admin intervention in order to confirm the node's true status. In order to automate the recovery of exclusive workloads, the Medik8s project presents a collection of operators that can be installed on any kubernetes-based cluster to automate failure detection and fencing / remediation. For more information visit our homepage

Failure detection with the Node Healthcheck operator

Handling unhealthy nodes

A Node entering an unready state after 5 minutes is an obvious sign that a failure occurred. However, there may be other criteria or thresholds that are more appropriate based on your particular physical environment, workloads, and tolerance for risk.

The Node Healthcheck operator checks each Node's set of NodeConditions against the criteria and thresholds defined in NodeHealthCheck (NHC) custom resources (CRs).

If the Node is deemed to be in a failed state, and remediation is appropriate, the controller will instantiate a remediation custom resources based on the remediation template(s) as defined in the NHC CR. NHC offers to configure a single remediation method, or a list of remediation methods which will be used one after another with specified order and timeout.

This template based mechanism allows cluster admins to use the best remediator for their environment, without NHC having to know them beforehand. Remediators might use e.g. Kubernetes' ClusterAPI, OpenShift's MachineAPI, BMC, Watchdog or software based reboots for fencing the workloads. For more details see the remediation documentation.

When the Node recovers and gets healthy again, NHC will delete the remediation CR for signalling that node recovery was successful.

Special cases

Control plane problems

Remediation is not always the correct response to a failure. Especially in larger clusters, we want to protect against failures that appear to take out large portions of compute capacity but are really the result of failures on or near the control plane. For this reason, the NHC CR includes the ability to define a minimum number of healthy nodes, by percentage or absolute number. When the cluster is falling short of this threshold, no further remediation will be started.

Cluster Upgrades

Cluster upgrades usually draw workers reboots, mainly to apply OS updates. These nodes might get unhealthy for some time during these reboots. This disruption can als cause other nodes to overload and appear unhealthy, when compensating for the lost compute capacity. Making remediation decisions at this moment may interfere with the upgrade and may even fail it completely. For that reason NHC will stop remediating new unhealthy nodes in case it detects that a cluster is upgrading.

At the moment this is only supported on OpenShift, by monitoring the ClusterVersionOperator.

Manual pausing

Before running cluster upgrades on kubernetes, or for any other reason, cluster admins can prevent new remediation by pausing the NHC CR.

Further information

For more details about using or contributing to Node Healthcheck, check out our docs.

Help

Please join our Google group for asking questions. When you find a bug, please open an issue in this repository.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
api
v1alpha1 Package v1alpha1 contains API Schema definitions for the remediation v1alpha1 API group +kubebuilder:object:generate=true +groupName=remediation.medik8s.io	Package v1alpha1 contains API Schema definitions for the remediation v1alpha1 API group +kubebuilder:object:generate=true +groupName=remediation.medik8s.io
controllers
cluster
console
defaults
initializer
mhc
rbac
resources
utils
e2e
utils
metrics
version

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL