README ¶
ingress-daemonset-controller
ingress-daemonset-controller
is a Kubernetes controller dedicated to manage ingress controllers.
It provides IngressDaemonSet
CRD to deploy an ingress controller so that:
- You can run two or more pods per node for high availability
- Deploy a dedicated "health-checker" daemonset for more reliable deployment
Adding replicas to DaemonSet for high-availability
IngressDaemonSet
has Spec.Replicas
to configure the number of per-node replicas.
You use a NodePort
service with externalTrafficPolicy: Local
OR a hostNetwork: true
pods to expose pods to receive and serve traffic from the external load balancer.
Note that in the latter case your applications running in the pods requires
REUSE_PORT
to bind the ports.
Note that for ingress deployment use-cases, you shouldn't need to use
hostPort
as prevents us from running two or more pods with the same host port and adds perf penalty due to SNAT and DNAT. See the CNI portmap plugin for more information.
This way, you can deploy an ingress controller or an ingress gateway like envoy
, nginx-ingress
, ambassador
, istio-gateway
and so on while balancing between availability due to multiple replicas per node, and efficiency due to less network hops.
Usually the former way of using node ports and external traffic policy would work. But in the default Kubernetes setup, a NodePort service may suffer from scalability limit due to iptables contrack table. If you need to care that, you shall use a CNI plugin that provides an alternative service implementation that doesn't relies on iptables.
For example, Cilium provives the services implementation backed by XDP. In addition to that, Cilium 1.7 or later implements DSR for pod-svc-pod communitation across nodes. So it might be a good idea to use Cilium anyway to reduce the total number of network hops from the external load balancer to the application pod and vice versa.
Theoretical Performance (The former is better):
hostNetwork: true
+REUSE_PORT
externalTrafficPolicy: Local
+ CiliumexternalTrafficPolicy: Local
+ iptables
Ease of Use (The former is better):
externalTrafficPolicy: Local
+ iptables (No extra component needed. Regular K8s manifests work)externalTrafficPolicy: Local
+ Cilium (Cilium needed. Regular K8s manifests work)hostNetwork: true
+REUSE_PORT
(Requires your app to support REUSE_PORT. The ingress solution of your choice might not provide official example manifests for this setup)
Reliable deployment
IngressDaemonSet
is smart enough to let you tell external load balancer to stop flowing traffic to the node before updating pods either by:
- Start failing health-checks and wait grace period
- Start explicitly detaching the node by annotating the node with a
node-detacher
annotation and wait grace period or wait untilnode-detacher
finishes the detachment, so that traffic stops before pod gets deleted and restarted.
The health-checker responds with 200 OK if and only if all the replicas in the node are ready AND there's no rolling update scheduled for the replicas in the node.
In other words, it enables traffic top stop before pod gets deleted and restarted for rolling update, which gives more reliability to your service while updating.
Example
Here's an example of IngressDaemonSet
:
kind: IngressDaemonSet
apiVersion: ingressdaemonsets.mumoshu.github.io/v1alpha1
metadata:
name: nginx-ingress
labels:
app: ingress-nginx
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
kubernetes.io/cluster-service: "true"
spec:
# NEW: 2 pods per node (See below)
podsPerNode: 2
# NEW: The host/node port to be used for responding health-check http requests from the external load balancer
healthCheckNodePorts:
- 10080
updateStrategy:
# Has different meaning and options than standard DaemonSet (See below)
type: RollingUpdate
rollingUpdate:
# If this is greater than or equal to 1, the controller communicates with node-detacher to detach the node
# and makes the node unavailable to the external lb before updating pods on the node
#
# If this is set, `maxUnavaiablePodsPerNode` is automatically set to `100%` and `maxDegradedNodes` must be `0`,
# which means the controller updates all the pods in a node concurrently after detaching the node from external lb, which is safe and can be faster than the other method.
#
# The default value is 0, which means it relies on maxUnavaiablePodsPerNode and maxDegaradedNodes only.
maxUnavailableNodes: 2
# The controller runs rolling-update of pods up to 3 pods on the node
maxUnavaiablePodsPerNode: 3
# maxSurgedPodsPerNode is the maxSurge for the per-node deployment
maxSurgedPodsPerNode: 1
# The controller runs rolling-updates of currently on and pods up to 3 nodes.
# The default is 1. You can only set either of `maxUnavaiableNodes` or `maxDegradedNodes` to 1 or greater.
maxDegradedNodes: 3
# The controller annotates the per-node deployment to start failing health-checks from the external lb
# even before any pod gets replaced.
#annotateDeploymentToDetach:
# key: "ingressdaemonsets.mumoshu.github.com/to-be-updated"
# gracePeriodSeconds: 10
# annotateNodeToDetach configures the controller to annotate the node before updating pods scheduled onto the node.
# It can either be (1) annotation key or (2) annotation key=value.
# When the first option is used, the controller annotate the node with the specified key, without an empty value.
annotateNodeToDetach:
key: "node-detacher.variant.run/detached"
value: "true"
gracePeriodSeconds: 10
# waitForDetachmentByAnnotatedTimestamp configures the controller to wait for the certain period since the detachment
# timestamp stored in the specified annotation.
# This depends on and requires configuring AnnotateNodeToDetach, too.
waitForDetachmentByAnnotatedTimestamp:
key: "node-detacher.variant.run/detachment-timestamp"
format: RFC3339
gracePeriodSeconds: 30
healthChecker:
image: mumoshu/ingress-daemonset-controller:latest
imagePullPolicy: Always
serviceAccountName: default
template:
metadata:
labels:
app: ingress-nginx
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
annotations:
prometheus.io/port: '10254'
prometheus.io/scrape: 'true'
spec:
nodeSelector:
ingress-controller-node: "true"
hostNetwork: true
terminationGracePeriodSeconds: 300
serviceAccountName: nginx-ingress-serviceaccount
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.26.1
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
lifecycle:
preStop:
exec:
command:
- /wait-shutdown
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-http-backend
- --configmap=$(POD_NAMESPACE)/nginx-configuration
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
- --udp-services-configmap=$(POD_NAMESPACE)/udp-services
- --publish-service=$(POD_NAMESPACE)/ingress-nginx
- --annotations-prefix=nginx.ingress.kubernetes.io
securityContext:
allowPrivilegeEscalation: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
# www-data -> 33
runAsUser: 33
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
hostPort: 80
- name: https
containerPort: 443
hostPort: 443
So spec-wise, the only difference between IngressDaemonSet
and standard DaemonSet
is that the former adds replicas
.
type: RollingUpdate
update strategy has the alternative meaning in IngressDaemonSet
. It now relies on a "health-checker daemonsetmanaged by the controller to stop the external lb to stop flowing traffc to the node while any of the pods on the node is not ready. For the above example, due to
healthCheckPort: 10080`, the health-checker daemonset pods bind the port 10080 on the host for responding health-check HTTP requests from the external loadbalancer.
It can optionally rely on node-detacher too, to stop the external lb to stop flowing traffic to the node BEFORE any of the pods on the node start stopping, which gives you extra grace period to stop the traffic compared to the case that you relied only on the health-checker daemonset.
How it works
The controller deploys a deployment per node and daemonset per cluster for each IngressDaemonSet
.
The per-node deployment is used to manage daemonset pod replicas for each node.
The generated deployment spec has the pod template that is mostly equivalent to IngressDaemonSet.Spec.Template
.
IngressDaemonSet.Spec.UpdateStrategy.RollingUpdate.MaxUnavaiablePodsPerNode
is literally copied to Deployment.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable
.
Deployment.Spec.Template.NodeName
is set to the node that the per-node deployment is responsible for, so that pods short-circuit the scheduler and gets scheduled onto the target node without stressing the scheduler.
The deployment's name, selector (matchLabels), and pod labels are configured so that they don't collide with another per-node deployment.
The deployment's name is <INGRESS_DS_NAME>-<NODE_NAME>
. The deployment labels contains IngressDaemonSet.Spec.Selector.MatchLabels
and ingress-daemon-for-node=NODE_NAME
.
The former is defined by the user and used by the ingress daemonset to differentiate pods across different ingress daemonsets.
The latter is automatically generated by the controller to differentiate deployments across different nodes.
The per-cluster daemonset is dedicated to serve externa loadbalancer health-checks.
The daemonset is named <INGRESS_DS_NAME>-<PORT NUMBER>
. It creates pods from one of the health-checker container images that are published as the part of the ingress-daemonset-controller's release.
Each pod uses the host network and binds the port designated by IngressDaemonSet.Spec.HealthCheckNodePort
to serve external loadbalancer health-checks.
You'll either:
- Point service loadbalancers
Service.Spec.HealthCheckNodePort
to ingerss daemonset's health-check node port, or - Configure the external loadbalancer via cloud-provider-specific API to target health-check requests to the health-check node port
If you're using one of public clouds that has a solid L4 and/or L7 loadbalancers, it is recommended to use the second option so that you can reuse the external loadbalancer beyond cluster replacements.
If you're using ingress daemonset to implement a load-balancer-as-a-service in an on-premise infrastructure, it can be correct to use the first due to various reasons, like there's no easy way to continually attach set of K8s nodes to a LB without Kubernetes.
You don't need to fine-tune prestop
hooks for your containers anymore, as the controller annotates the per-node deployment to make the health-check failing and wait for a configurable grace period, even before any pod gets deleted and recreated.
Please see annotateDeploymentToDetach
in the above example for more information.
Documentation ¶
There is no documentation for this package.
Directories ¶
Path | Synopsis |
---|---|
api
|
|
v1alpha1
Package v1alpha1 contains API Schema definitions for the ingressdaemonsets v1alpha1 API group +kubebuilder:object:generate=true +groupName=ingressdaemonsets.mumoshu.github.io
|
Package v1alpha1 contains API Schema definitions for the ingressdaemonsets v1alpha1 API group +kubebuilder:object:generate=true +groupName=ingressdaemonsets.mumoshu.github.io |