Eight Application Design Principles to Cope with OpenShift Maintenance Operations

January 14, 2019Raffaele Spazzoli

Most Red Hat OpenShift maintenance operations follow the same pattern: one or more nodes are temporarily taken off the cluster to perform the required maintenance and then they are re-added to the cluster when complete. This cycle repeats until the maintenance operation has been performed on all nodes..

In order to gracefully remote a node from the cluster, that node must first be drained. Draining the node means killing all of the pods in it, until the node is completely empty.

In this post, we will look at a set of design principles that facilitate applications coping with this necessary OpenShift maintenance pattern.

1: Always run at least 2 pods

This is an obvious HA principle, but it has some less obvious ramifications. We know that our pods will be killed, so we need to have more than one running. This means that the application needs to be designed to operate with more than one instance.

While typically acceptable for most web services, some apps may need to be tweaked in order to cope with working as multiple instances.

In some cases, it may be acceptable to take a temporary outage. We can count on the fact that when a pod gets killed, it will be scheduled somewhere else. It may take a few minutes for the pods to come back and be operational.

Running multiple pods can be simply done by setting the replica field of your Deployment or DeploymenConfig object and in general all the workload API objects.

spec:

  replicas: 5

If you are running a horizontal pod autoscaler, set minReplicas to 2.

2: Spread your application pods evenly across nodes

We do not want to have all our pods on the same node, because that would make having multiple pods irrelevant when the node must be taken offline.

OpenShift does a good job already at this. The scheduler by default will try to spread your pods across all the nodes that can house them. The general recommendation here is to not interfere with OpenShift’s normal scheduling policy.

If, for some reason, you need to reinforce this behavior, one way this can be accomplished is to use a pod anti-affinity rule as shown in the following example:

metadata:

  name: mypod

  labels:

    app: mypod

...

   podAntiAffinity:

      preferredDuringSchedulingIgnoredDuringExecution:

      - weight: 100

        podAffinityTerm:

          labelSelector:

            matchExpressions:

            - key: app

              operator: In

              values:

              - mypod

          topologyKey: kubernetes.io/hostname

3: Define a pod disruption budget

Even with the above techniques in place, there is still a chance that multiple pods of your app will land on the same node. Also, an administrator may decide to take multiple nodes offline which may correspond to the nodes the application is running on. We can inform OpenShift of how many pods an app can tolerate being down without affecting its normal execution through the PodDisruptionBidget object.

Openshift will try as much as it can to respect the budget. If necessary during the node drain operation, it will kill your pods sequentially, waiting for one pod to be rescheduled somewhere else before killing the next one.

Here is an example of a PodDisruptionBudget object.

apiVersion: policy/v1beta1 

kind: PodDisruptionBudget

metadata:

  name: mypod

spec:

  selector:  

    matchLabels:

      app: mypod

  minAvailable: 2

4: Define a descheduler policy

Even with all the above techniques accounted for, once a maintenance event has been completed, pods may be scheduled in an unbalanced manner. To be better prepared for the next maintenance event, we may have to make sure pods are scheduled evenly.

Some customers have been doing this manually, but fortunately in 3.11 we can use the cluster descheduler to evict pods that do not respect the defined scheduling policies.

I recommend running the descheduler job periodically with at least the removeDuplicates policy turned on.

This policy will make the descheduler evict pods when there are multiple pods belong to the same app on the same node.

When deciding whether to use the descheduler, consider that the descheduler is in tech preview in 3.11.

Configuring the descheduler is an action performed by the cluster administrator and not the application team.

5: Do not use local/host path storage

In order to let your pods move around freely, your application should not use local or hostPath volumes. In fact, local storage will tie pods to the node where the storage exists, making it impossible for the pod to migrate. So during a maintenance operation, those pods will simply not be available.

There should not be a reason to use local storage for the average application. Administrative DaemonSets and applications with strict I/O performance requirements are an exception to this rule.

If your app needs to use local storage, then you need to be running multiple instances and you need to make sure that the maintenance process never takes down the nodes where the multiple instances are running at the same time. This requires coordination with the team that performs the maintenance operations.

6: Design your application so that it tolerate losing pods

Your application pods will be killed during the maintenance process. The application code must be designed with this in mind.

The way OpenShift kills pods is by sending a SIGTERM signal to the app.

At this point the app has some time to shutdown. If the pod does not die within a given amount of time, the OpenShift instance will send a SIGKILL to the app process, terminating it instantly.

You need to instrument your app to do any necessary clean up when it receives a SIGTERM. This phase has to be quick.

In the case of a web app, there is no time at this point to wait for existing client sessions to be concluded. This leads us to the next principle.

7: It shouldn’t matter which pod receives a request

Because there is no way to clean in-flight sessions when a pod gets killed, it may happen that subsequent requests of an inflight session go to a different pod, after the pod that was managing that session is killed.

Your application needs to be designed around this eventuality. It might be ok to lose the session in some cases. But in most circumstances, you will want to give your customer the best user experience, which means not losing the session.

If your server-side application is completely stateless, then you have no issues to worry about.

If your server-side application maintains a state in the form of a session, then you need a way to persist it to either a cache or a database, so that it can be retrieved by other pods.

8: Capacity considerations

Doing maintenance requires having some spare capacity. At a minimum, one needs to be able to take one node off the cluster. Performing cluster maintenance on one node at a time can take a long time for large cluster. One can obviously take multiple nodes offline, but that requires having more spare capacity. Find the right balance between the spare capacity reserved for maintenance, and the time the maintenance operation will take.

Conclusion

If you follow the set of practices contained in this post, your applications should be able to go through an upgrade process without interrupting service. Many of these practices are also useful to survive node failures, so you will get the added benefit of better overall application performance.

About the author

Raffaele Spazzoli

Senior Principal Architect

Raffaele is a full-stack enterprise architect with 20+ years of experience. Raffaele started his career in Italy as a Java Architect then gradually moved to Integration Architect and then Enterprise Architect. Later he moved to the United States to eventually become an OpenShift Architect for Red Hat consulting services, acquiring, in the process, knowledge of the infrastructure side of IT.

Currently Raffaele covers a consulting position of cross-portfolio application architect with a focus on OpenShift. Most of his career Raffaele worked with large financial institutions allowing him to acquire an understanding of enterprise processes and security and compliance requirements of large enterprise customers.

Raffaele has become part of the CNCF TAG Storage and contributed to the Cloud Native Disaster Recovery whitepaper.

Recently Raffaele has been focusing on how to improve the developer experience by implementing internal development platforms (IDP).

Read full bio

Browse by channel

Explore all channels

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Eight Application Design Principles to Cope with OpenShift Maintenance Operations

1: Always run at least 2 pods

2: Spread your application pods evenly across nodes

3: Define a pod disruption budget

4: Define a descheduler policy

5: Do not use local/host path storage

6: Design your application so that it tolerate losing pods

7: It shouldn’t matter which pod receives a request

8: Capacity considerations

Conclusion

About the author

Raffaele Spazzoli

More like this

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links