Most Red Hat OpenShift maintenance operations follow the same pattern: one or more nodes are temporarily taken off the cluster to perform the required maintenance and then they are re-added to the cluster when complete. This cycle repeats until the maintenance operation has been performed on all nodes..
In order to gracefully remote a node from the cluster, that node must first be drained. Draining the node means killing all of the pods in it, until the node is completely empty.
In this post, we will look at a set of design principles that facilitate applications coping with this necessary OpenShift maintenance pattern.
1: Always run at least 2 pods
This is an obvious HA principle, but it has some less obvious ramifications. We know that our pods will be killed, so we need to have more than one running. This means that the application needs to be designed to operate with more than one instance.
While typically acceptable for most web services, some apps may need to be tweaked in order to cope with working as multiple instances.
In some cases, it may be acceptable to take a temporary outage. We can count on the fact that when a pod gets killed, it will be scheduled somewhere else. It may take a few minutes for the pods to come back and be operational.
spec: replicas: 5
If you are running a horizontal pod autoscaler, set minReplicas to 2.
2: Spread your application pods evenly across nodes
We do not want to have all our pods on the same node, because that would make having multiple pods irrelevant when the node must be taken offline.
OpenShift does a good job already at this. The scheduler by default will try to spread your pods across all the nodes that can house them. The general recommendation here is to not interfere with OpenShift’s normal scheduling policy.
If, for some reason, you need to reinforce this behavior, one way this can be accomplished is to use a pod anti-affinity rule as shown in the following example:
metadata: name: mypod labels: app: mypod
podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - mypod topologyKey: kubernetes.io/hostname
3: Define a pod disruption budget
Even with the above techniques in place, there is still a chance that multiple pods of your app will land on the same node. Also, an administrator may decide to take multiple nodes offline which may correspond to the nodes the application is running on. We can inform OpenShift of how many pods an app can tolerate being down without affecting its normal execution through the PodDisruptionBidget object.
Openshift will try as much as it can to respect the budget. If necessary during the node drain operation, it will kill your pods sequentially, waiting for one pod to be rescheduled somewhere else before killing the next one.
Here is an example of a PodDisruptionBudget object.
apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: mypod spec: selector: matchLabels: app: mypod minAvailable: 2
4: Define a descheduler policy
Even with all the above techniques accounted for, once a maintenance event has been completed, pods may be scheduled in an unbalanced manner. To be better prepared for the next maintenance event, we may have to make sure pods are scheduled evenly.
Some customers have been doing this manually, but fortunately in 3.11 we can use the cluster descheduler to evict pods that do not respect the defined scheduling policies.
I recommend running the descheduler job periodically with at least the removeDuplicates policy turned on.
This policy will make the descheduler evict pods when there are multiple pods belong to the same app on the same node.
When deciding whether to use the descheduler, consider that the descheduler is in tech preview in 3.11.
Configuring the descheduler is an action performed by the cluster administrator and not the application team.
5: Do not use local/host path storage
In order to let your pods move around freely, your application should not use local or hostPath volumes. In fact, local storage will tie pods to the node where the storage exists, making it impossible for the pod to migrate. So during a maintenance operation, those pods will simply not be available.
There should not be a reason to use local storage for the average application. Administrative DaemonSets and applications with strict I/O performance requirements are an exception to this rule.
If your app needs to use local storage, then you need to be running multiple instances and you need to make sure that the maintenance process never takes down the nodes where the multiple instances are running at the same time. This requires coordination with the team that performs the maintenance operations.
6: Design your application so that it tolerate losing pods
Your application pods will be killed during the maintenance process. The application code must be designed with this in mind.
The way OpenShift kills pods is by sending a SIGTERM signal to the app.
At this point the app has some time to shutdown. If the pod does not die within a given amount of time, the OpenShift instance will send a SIGKILL to the app process, terminating it instantly.
You need to instrument your app to do any necessary clean up when it receives a SIGTERM. This phase has to be quick.
In the case of a web app, there is no time at this point to wait for existing client sessions to be concluded. This leads us to the next principle.
7: It shouldn’t matter which pod receives a request
Because there is no way to clean in-flight sessions when a pod gets killed, it may happen that subsequent requests of an inflight session go to a different pod, after the pod that was managing that session is killed.
Your application needs to be designed around this eventuality. It might be ok to lose the session in some cases. But in most circumstances, you will want to give your customer the best user experience, which means not losing the session.
If your server-side application is completely stateless, then you have no issues to worry about.
If your server-side application maintains a state in the form of a session, then you need a way to persist it to either a cache or a database, so that it can be retrieved by other pods.
8: Capacity considerations
Doing maintenance requires having some spare capacity. At a minimum, one needs to be able to take one node off the cluster. Performing cluster maintenance on one node at a time can take a long time for large cluster. One can obviously take multiple nodes offline, but that requires having more spare capacity. Find the right balance between the spare capacity reserved for maintenance, and the time the maintenance operation will take.
If you follow the set of practices contained in this post, your applications should be able to go through an upgrade process without interrupting service. Many of these practices are also useful to survive node failures, so you will get the added benefit of better overall application performance.