Red Hat OpenShift 4 is making an important and powerful change to the way pod evictions work. OpenShift has transitioned from using node conditions to using a Taint/Toleration based eviction process, which provides individual pods more control over how they are evicted. This new capability was added in Kubernetes 1.12 and enabled in OpenShift 4.1
How node condition based eviction works
Before we talk about what new capabilities this enables, let’s dive into the way it works today in OpenShift 3 and most other Kubernetes distributions. On each node in your Kubernetes cluster, the Kubelet is running as the local agent which starts, stops and monitors each pod assigned to be run. In addition to monitoring the pods, the Kubelet also reports health status of the node back to the scheduler, so it knows how to assign new pods to available capacity. These statuses are the “node conditions” and this is what they typically look like on a healthy node:
node conditions in the OpenShift Console
The Memory, Disk and PID pressure conditions will influence placement of new workloads on the node. For evictions, the Ready condition is the one to pay attention to. When the control plane (specifically the Controller Manager) notices a node transition to a NotReady state, a 5 minute timer starts. When this timer expires, the pods on that node will be rescheduled on other suitable nodes on the cluster. Evictions may also occur when specific resource pressure is present on the node.
This behavior is useful: our workloads find new places to run and we can diagnose the node. But what if we need more flexibility in this logic? What if I am running a workload that is tolerant of this failure? What if it changes on a per Namespace basis?
How Taint/Toleration based eviction works
The new Taint/Toleration based process in OpenShift 4 allows for these nuances and grants more control to each Namespace while remaining backwards compatible with the previous behavior.
Before we dig in, what is a “taint”? What is a “toleration”? A “taint” is a special attribute on a node, similar to how you might have a tainted pool of water. A farmer doesn’t want animals drinking from tainted water, but it is ok to use that water to spray on a road to keep down dust. The task of spraying is compatible with the tainted water, we say that it can “tolerate” it. OpenShift workloads can tolerate many different types of taints, which allow for many use-cases:
- Prevent GPU-enabled nodes from being used by pods that don’t require a GPU
- Prevent workloads with specific security classifications from running in certain areas
- Isolate special workloads like cluster monitoring or ingress/routing to dedicated nodes
- Prevent rescheduling of workloads that are tolerant of an unreachable node
An OpenShift 4 cluster automatically taints your nodes with the information that used to be communicated via the node conditions, in order to provide the ability to tolerate them on a per-Namespace or per-pod basis. A few examples of the new taints are:
node.kubernetes.io/not-ready node.kubernetes.io/unreachable node.kubernetes.io/out-of-disk node.kubernetes.io/memory-pressure node.kubernetes.io/disk-pressure
For example, it might be beneficial to tolerate an unreachable node if the workload is safe to remain running while a networking issue resolves (e.g. video transcoding). The most powerful part of this is that each engineering team utilizing the cluster can control this in a self-service manner, instead of it being at the node level which would evict everything running on the node.
Under the hood
In fact, you have even more control because Taints can have one of two effects: NoSchedule and NoExecute.
NoSchedule – A pod that does not have a NoSchedule toleration for the taint can not be scheduled to a node to which the taint is applied.
NoExecute – A pod that does not have a NoExecute toleration for the taint will be evicted from a node to which the taint is applied.
Pods can tolerate a taint for a specified time using tolerationSeconds which will allow you to tune the eviction per taint. Once this window expires, the pod is evicted from the node. The default remains at 5 minutes in order to preserve the existing behavior if you don’t customize the tolerations.
These default settings are enforced through a new admission controller. In addition, it adds a default NoSchedule toleration for node.kubernetes.io/memory-pressure taint if the pod is running in BestEffort QoS (no memory request).
Tolerations set on a pod by the admission controller