One of the benefits in adopting a system like OpenShift is facilitating burstable and scalable workload. Horizontal application scaling involves adding or removing instances of an application to match demand. When OpenShift schedules a Pod, it’s important that the nodes have enough resources to actually run it. If a user schedules a large application (in the form of Pod) on a node with limited resources , it is possible for the node to run out of memory or CPU resources and for things to stop working!

It’s also possible for applications to take up more resources than they should. This could be caused by a team spinning up more replicas than they need to artificially decrease latency or simply because of a configuration change that causes a program to go out of control and try to use 100% of the available CPU resources. Regardless of whether the issue is caused by a bad developer, bad code, or bad luck, what’s important is how a cluster administrator can manage and maintain control of the resources.

In this blog, let’s take a look at how you can solve these problems using best practices.

What does “overcommitment” mean in OpenShift ? 

In an overcommitted state, the sum of the container compute resource requests and limits exceeds the resources available on the system. 

Overcommitment might be desirable in development environments where a tradeoff of guaranteed performance for capacity is acceptable.Therefore, in an overcommitted environment, it is important to properly configure your worker node to provide the best system behavior. With this note let's find out what needs to be enabled on the worker nodes in an overcommitted environment.

Prerequisites for the overcommitted worker nodes: 

The following prerequisites flow chart describes all the checks that should be performed on the worker nodes. Let’s go into the details one by one.

 1. Is the worker node ready for overcommitment? 

In OpenShift Container Platform overcommitment is enabled by default. If not, it is always advisable to cross check. When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority.

You can view the current setting by running the following commands on your nodes:

$ oc debug node/<worker node>
Starting pod/<worker node>-debug ...

If you don't see a command prompt, try pressing enter.

sh-4.2# sysctl -a |grep commit
vm.overcommit_memory = 1
sh-4.2# sysctl -a |grep panic
vm.panic_on_oom = 0

In case your worker node settings are different than the expected you can easily set it via machine-config operator for RHCOS and for RHEL via below command .

<span>$ sysctl -w vm.overcommit_memory=1</span>

 2. Is the worker node enforcing CPU limits using CPU CFS quotas ?

The Completely Fair Scheduler (CFS) is a process scheduler which was merged into the Linux Kernel 2.6.23 release (October 2007) and is the default scheduler. It handles CPU resource allocation for executing processes, and aims to maximize overall CPU utilization while also maximizing interactive performance.

By default, the Kubelet uses CFS quota to enforce pod CPU limits. For example, when a user sets a limit on CPU to 100 millicores for the pod, kubernetes (via the kubelet on the node) specifies a CFS quota for CPU on the pod’s processes. The pod/’s processes get throttled if they try to use more than the CPU limit.

When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time.  Many workloads are not sensitive to this migration and thus work fine without any intervention.

kubeletArguments:
 cpu-cfs-quota:
   - “True”

 3. Are enough resources reserved for system and kube processes per node?

To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by all underlying node components (such as kubelet, kube-proxy) and the remaining system components (such as sshd, NetworkManager) on the host.

CPU and memory resources reserved for node components in OpenShift Container Platform are based on two node settings:

kube-reserved Resources reserved for node components. Default is none.
system-reserved Resources reserved for the remaining system components. Default is none.

 

If a flag is not set, it defaults to 0. If none of the flags are set, the allocated resource is set to the node’s capacity as it was before the introduction of allocatable resources.

The below table summarizes the recommended resources to be reserved per worker node . This is based upon OpenShift version 4.1. Also note that this does not include the resources required to run any 3rd party CNI plugin , its operator etc.

You can set the reserved resources with the help of machineconfigpool and KubeletConfig (CR) as shown in the example below .

Find out the correct machineconfigpool for your worker node and label it if not done already.

$ oc describe machineconfigpool worker

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
 creationTimestamp: 2019-02-08T14:52:39Z
 generation: 1
 labels:
   custom-kubelet: small-pods

$ oc label machineconfigpool worker custom-kubelet=small-pods

Create a KubeletConfig as shown below and set the desired resources for system and kube processes .

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
 name: set-allocatable 
spec:
 machineConfigPoolSelector:
   matchLabels:
     custom-kubelet: small-pods 
 kubeletConfig:
   systemReserved:
     cpu: 500m
     memory: 512Mi
   kubeReserved:
     cpu: 500m
     memory: 512Mi

 4. Is swap memory disabled on the worker node?

By default, OpenShift disables swap partitions in the Node. A good practice in Kubernetes clusters is to disable swap on the cluster nodes in order to preserve quality of service (QOS) guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.

For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.

Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.

 5. Is QoS defined ?

In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resources than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority.

Priority

Class

Name Description
Guaranteed If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed.
2 Burstable If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable.
BestEffort If requests and limits are not set for any of the resources, then the container is classified as BestEffort.

 

A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. For the critical Pods we define two classes .For example :

  • System-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node.
  • System-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances.

You can also use the qos-reserved parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower QoS classes from using resources requested by pods in higher QoS classes. For example a value of qos-reserved=memory=100% will prevent the Burstable and BestEffort QoS classes from consuming memory that was requested by a higher QoS class i.e. Guaranteed QoS. Similarly, a value of qos-reserved=memory=0% will allow a Burstable and BestEffort QoS classes to consume up to the full node allocatable amount if available, but increases the risk that a Guaranteed workload will not have access to the requested memory.

 

Mechanisms to control the resources on the overcommitted worker nodes :

After executing the prerequisites on the worker nodes / cluster  it's time now to see what all mechanisms are available from the kubernetes side to control the resources like CPU, Memory , Ephemeral storage, Ingress and Egress traffic etc.

  1. Limit Ranges: 

A limit range, defined by a LimitRange object, enumerates compute resource constraints in a project at the pod, container, image, image stream, and persistent volume claim level, and specifies the amount of resources that a pod, container, image, image stream, or persistent volume claim can consume.All resources create and modification requests are evaluated against each LimitRange object in the project. If the resource violates any of the enumerated constraints, then the resource is rejected. If the resource does not set an explicit value, and if the constraint supports a default value, then the default value is applied to the resource.

Below is the example of limit range definition.

apiVersion: "v1"
kind: "LimitRange"
metadata:
 name: "core-resource-limits" 
spec:
 limits:
   - type: "Pod"
     max:
       cpu: "2" 
       memory: "1Gi" 
     min:
       cpu: "200m" 
       memory: "6Mi" 
   - type: "Container"
     max:
       cpu: "2" 
       memory: "1Gi" 
     min:
       cpu: "100m" 
       memory: "4Mi" 
     default:
       cpu: "300m" 
       memory: "200Mi" 
     defaultRequest:
       cpu: "200m" 
       memory: "100Mi" 
     maxLimitRequestRatio:
       cpu: "10" 

 2. CPU Requests:

Each container in a pod can specify the amount of CPU it requests on a node. The scheduler uses CPU requests to find a node with an appropriate fit for a container.The CPU request represents a minimum amount of CPU that your container may consume, but if there is no contention for CPU, it can use all available CPU on the node. If there is CPU contention on the node, CPU requests provide a relative weight across all containers on the system for how much CPU time the container may use.On the node, CPU requests map to Kernel CFS shares to enforce this behavior.

 3. CPU Limits:

Each container in a pod can specify the amount of CPU it is limited to use on a node. CPU limits control the maximum amount of CPU that your container may use independent of contention on the node. If a container attempts to exceed the specified limit, the system will throttle the container. This allows the container to have a consistent level of service independent of the number of pods scheduled to the node.

 4. Memory Requests:

By default, a container is able to consume as much memory on the node as possible. In order to improve placement of pods in the cluster, specify the amount of memory required for a container to run. The scheduler will then take available node memory capacity into account prior to binding your pod to a node. A container is still able to consume as much memory on the node as possible even when specifying a request.

 5. Memory Limits:

If you specify a memory limit, you can constrain the amount of memory the container can use. For example, if you specify a limit of 200Mi, a container will be limited to using that amount of memory on the node. If the container exceeds the specified memory limit, it will be terminated and potentially restarted dependent upon the container restart policy.

 6. Ephemeral Storage Requests:

By default, a container is able to consume as much local ephemeral storage on the node as is available. In order to improve placement of pods in the cluster, specify the amount of required local ephemeral storage for a container to run. The scheduler will then take available node local storage capacity into account prior to binding your pod to a node. A container is still able to consume as much local ephemeral storage on the node as possible even when specifying a request.

 7. Ephemeral Storage Limits:

If you specify an ephemeral storage limit, you can constrain the amount of ephemeral storage the container can use. For example, if you specify a limit of 2Gi, a container will be limited to using that amount of ephemeral storage on the node. If the container exceeds the specified memory limit, it will be terminated and potentially restarted dependent upon the container restart policy.

 8. Pods per Core:

The podsPerCore parameter limits the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node is 40.

 9. Max Pods per node:

The maxPods parameter limits the number of pods the node can run to a fixed value, regardless of the properties of the node.Two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore and maxPods. If you use both options, the lower of the two limits the number of pods on a nodeIn order to configure these parameters , label the machineconfigpool.

$ oc label machineconfigpool worker custom-kubelet=small-pods

$ oc describe machineconfigpool worker

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
 creationTimestamp: 2019-02-08T14:52:39Z
 generation: 1
 labels:
   custom-kubelet: small-pods

Create the KubeletConfig (CR) as shown below.

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
 name: set-max-pods 
spec:
 machineConfigPoolSelector:
   matchLabels:
     custom-kubelet: small-pods 
 kubeletConfig:
   podsPerCore: 10 
   maxPods: 250

 10. Limiting the bandwidth available to the Pods:

You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. The limits you place on a pod do not affect the bandwidth of other pods.To limit the bandwidth on a pod you can specify the data traffic speed using kubernetes.io/ingress-bandwidth and kubernetes.io/egress-bandwidth annotations as shown below.

{
   "kind": "Pod",
   "spec": {
       "containers": [
           {
               "image": "openshift/hello-openshift",
               "name": "hello-openshift"
           }
       ]
   },
   "apiVersion": "v1",
   "metadata": {
       "name": "iperf-slow",
       "annotations": {
           "kubernetes.io/ingress-bandwidth": "10M",
           "kubernetes.io/egress-bandwidth": "10M"
       }
   }
}

Conclusion:

As you can see from the post, there are about 10 mechanisms available from the kubernetes perspective which can be used very effectively to control the resources on the worker nodes in the overcommitment state provided pre-requisites are applied at first place. Now which mechanism to be used preciously is entirely depends upon the end user and the use case he or she is trying to solve.