Introduction

This is the second part of the How Full is my Cluster series; you can find part one here.

In part one, we covered some basic concepts and a series of recommendations to help OpenShift make informed decisions when scheduling pods.

This alone is still not enough to guarantee that each individual node will not be overloaded by a pod's workload, causing the node to have to shut down pods or even worse, jeopardizing the stability of the node itself.

A node may run out of resources when it is overcommitted and it will start killing pods when under resources pressure (that is, it's running out of resources). Pod eviction is a defense mechanism of the node and should be properly configured (by default, it's off).

Node Overcommitment

Technically a node is overcommitted when the sum of the limits is greater than the allocatable resources. In this situation, if all the pods of that node were to claim their limits, then the node would go under resource pressure.

The following picture shows a node that is strongly overcommitted on CPU:

Overcommited node

Overcommitment is not a bad thing per se. In fact, it allows you to pack your workloads more densely. It works on the assumption that not all the pods will claim all of their usable resources at the same time (it's the same principle by which banks work: They assume that not all the customers will want to withdraw all their money at the same time).

The right question to ask is then: By how much should we overcommit? And, also, how can we enforce that value?

Organizations will have to find their correct overcommitment level. I recommend a cautious strategy to identify it. You could, for example, progressively increment the overcommitment level and observe your workload. When the rate of evicted pods starts to become unacceptable for you then that is the correct overcommitment level for the given workload.

Once you have defined your overcommitment policy, you can enforce it either at the project level or at the cluster level.

On a per project level, you can use the maxLimitRequestRation of the LimitRange object.

If you want to set this ratio at the cluster level you can use memoryRequestToLimitPercent and cpuRequestToLimitPercent from the ClusterResourceOverride configuration object which can be specified in the master config file.

Recommendation: Define and enforce your overcommitment policy.

Recommendation: Develop the ability to create a report with your nodes and their level of commitment.

Recommendation: Create an alert to notify you if a node is overcommitted beyond what you defined in your overcommitment policy.

Resource Types

In terms of behavior under resource pressure, there are two types of resources:

  • Compressible resources: Resources that never run out as long as you have the time to wait for them. You may have access to a limited quantity of them in a given period of time, but if you are willing/able to wait there is an unlimited amount of them. Examples of this type of resource are CPU, block i/o, and network i/o.
  • Incompressible resources: resources that are limited and when you run out of them, your application will not get any more of them. Examples of this type of resource are memory and disk space.

It is important to keep this difference in mind when configuring settings for resources. Certain types of workloads may be more sensitive to one or the other. If memory and other incompressible resources are not set up correctly, pods may be killed. On the other hand, if CPU and other compressible resources are not set up correctly, workloads can starve.

Protecting a Node from Resource Pressure

As we have discussed, if a node is overcommitted, it could run out of resources. This situation is called resource pressure. If the node service realizes that it is under resource pressure, it stops accepting new pods and, if the resource is question is incompressible, then it starts trying to resolve the situation by evicting (that is, killing) pods. The pods to be evicted are chosen honoring the QoS of the pods.

Pod's Quality of Service

The following Quality of Services (QoS) exist in OpenShift and are determined by how requests and limits are defined on the pod:

QoS When
Guaranteed Request = Limit
Burstable Request < Limit
Best Effort Request and limit are not defined.

Pods start to be evicted when a given resource passes the eviction threshold. Pods can be evicted immediately ( hard eviction threshold) or by giving the pod time to shut down gracefully ( soft eviction threshold). Soft eviction threshold also allows you to define an observation period that triggers when the resource usage passes the threshold. If at the end of the observation period the resource usage is still above the threshold, eviction is started.

Currently, eviction threshold can be defined for the following incompressible resources: memory, node filesystem, image filesystem (this is the docker storage).

Recommendation : Always configure at least a hard eviction threshold for memory. For example:

kubeletArguments:
eviction-hard:
- memory.available<500Mi

In addition to defining an eviction threshold, you can also reserve resources for the node service specifically and for other OS-level services by configuring the following in the node service:

kubeletArguments:
kube-reserved:
- "cpu=<cpu>,memory=<mem>"
system-reserved:
- "cpu=<cpu>,memory=<mem>"

Kube-reserved: Reserves resources for the Kubelet service (or node service in OpenShift)

System-reserved: Reserves resources for all the other non-pod processes (excluding the Kubelet service)

Kube-reserved, system-reserved, and the eviction threshold together determine the allocatable level. The picture below shows how the final allocatable resources are calculated:

Allocatable resources

Recommendation: Always reserve some resource for your OS services through the use of the Kubelet arguments kube-reserved and system-reserved.

Note: You can reserve resources for both memory and CPU, but you can define an eviction threshold only for memory.

You can control these settings from the Ansible installer configuration file, here is an example:

openshift\_node\_kubelet\_args={''kube-reserved': ['cpu=xxxm,memory=xxxM'], 'system-reserved': ['cpu=xxxm,memory=xxxM']}

Coming up with a good default for these settings is difficult (you should profile the behavior of your systems to find the ideal settings for your installation); however, here is a formula that has worked for me for small-sized nodes:

Kube-reserved cpu: 5m x max number of pods per node

Kube-reserved memory: 5M x max number of pods per node

System-reserved cpu: 5m x max number of pods per node

System-reserved memory: 10M x max number of pods per node

Note that max number of pods per node is by default 10 pods per vCPU (you can change this value).

Additional Considerations for Memory

For memory there is an additional defense mechanism: If the node service fails to recognize a memory pressure situation, for example, because the memory spike was so sudden the node service didn't have time to register it, then the OOM kernel service will kill a process. The OOM can be advised on which priority to assign to processes, but it cannot be fully controlled and there is no guarantee that it will honor the OpenShift QoS. Even worse, the OOM can decide to kill non-pod related processes.

The OOM may kill one of the key services that must be always up and running or else the node is lost (to the cluster): The node service itself, Docker, the SDN service.

This is one of the reasons why it is always critical to protect the node from memory pressure.

 Memory metrics single node

This diagram represents various memory-related metrics for a single node over time.

The blue line and orange line represent the total amount of memory and the allocatable (calculated as described above) memory respectively. The values of these two measures are constant.
The green line represents the sum of the requests of the pods allocated to this node. The scheduler will make sure that this measure is always below the amount of allocatable memory. The red line is the actual memory usage on the node and includes the sum of all the pods and the amount reserved by the node daemons. This measure fluctuates more than the sum of the requests depending on several factors, such as the way the applications are written and the current load. The final measure, illustrated by the yellow line, is the sum of the limits. Based on the way cgroups are organized, the actual memory usage will always be below the sum of the limits. However, the sum of the limits can become greater than the node allocatable if the workload is burstable. In this case, as we have seen, the node becomes overcommitted.
The two important events to consider in this diagram are when the actual amount of memory crosses the allocatable and when it touches the total available memory of the node. In the first case, an eviction event is triggered. In the latter, the OOM event is.

Additional Considerations for CPU

Some OpenShift clusters are built to target specifically burstable workloads. For example, typically big data, machine learning, or simply batch and asynchronous integration data science types of workload are inherently burstable.

This means that those processes can run with the specified CPU request, but if there are resources available they can consume more, up to their limit, and terminate early.

It is possible to disable enforcing the limit on CPU. This will make burstable processes consume all the available resources. If two processes are competing for the same CPU resources fair scheduling is still guaranteed because it is calculated on the request value.

Here is how you can disable enforcing the CPU limit:

kubeletArguments:
cpu-cfs-quota:
- "false"

Getting Started with Monitoring Nodes' Overcommitment

I have created a set of Grafana graphs to track the commitment of OpenShift cluster nodes. You can set them up following the instructions you can find here. This setup is a quick start with the objective to let someone study this problem space, it is not supposed to be a production quality monitoring solution.

Conclusions

In this post, we examined a series of best practices to protect the nodes from becoming over-allocated. Implementing these practices will improve the overall stability of the cluster.


About the author

Raffaele is a full-stack enterprise architect with 20+ years of experience. Raffaele started his career in Italy as a Java Architect then gradually moved to Integration Architect and then Enterprise Architect. Later he moved to the United States to eventually become an OpenShift Architect for Red Hat consulting services, acquiring, in the process, knowledge of the infrastructure side of IT.

Read full bio