The Basics

A common request from OpenShift users has long been to raise the number of pods per node. OpenShift has set the limit to 250 starting with the first Kubernetes-based release (3.0) through 4.2, but with very powerful nodes, it can support many more than that.

This blog describes the work we did to achieve 500 pods per node, starting from initial testing, bug fixes and other changes we needed to make, the testing we performed to verify function, and what you need to do if you’d like to try this.

Background

Computer systems have continued unabated their relentless progress in computation power, memory, storage capacity, and I/O bandwidth. Systems that not long ago were exotic supercomputers are now dwarfed in their capability (if not physical size and power consumption) by very modest servers. Not surprisingly, one of the most frequent questions we've received from customers over the years is "can we run more than 250 pods (the -- until now -- tested maximum) per node?". Today we're happy to announce that the answer is yes!

In this blog, I'm going to discuss the changes to OpenShift, the testing process to verify our ability to run much larger numbers of pods, and what you need to do if you want to increase your pod density.

Goals

Our goal with this project was to run 500 pods per node on a cluster with a reasonably large number of nodes. We also considered it important that these pods actually do something; pausepods, while convenient for testing, aren't a workload that most people are interested in running on their clusters. At the same time, we recognized that the incredible variety of workloads in the rich OpenShift ecosystem would be impractical to model, so we wanted a simple workload that's easy to understand and measure. We'll discuss this workload below, which you can clone and experiment with.

Initial Testing

Early experiments on OpenShift 4.2 identified issues with communication between the control plane (in particular, the kube-apiserver) and the kubelet when attempting to run nodes with a large number of pods. Using a client/server builder application replicated to produce the desired number of pods, we observed that the apiserver was not getting timely updates from the kubelet when pods came into existence, resulting in problems such as networking not coming up for pods and the pods (when they required networking) failing as a result.

Our test was to run many replicas of the application to reproduce the requisite number of pods. We observed that up to about 380 pods per node, the applications would start running normally. Beyond that, we would see some pods remain in Pending state, and some start but terminate. Pods terminating that are expected to run do so because the pod itself decides to terminate. There were no messages in the logs identifying particular problems; the pods appeared to be starting up correctly, but the code within the pods was failing, resulting in the pods terminating. Studying the application, the most likely reason the pods would terminate was that the client pod would be unable to connect to the server, indicating that it did not have a network available.

As an aside, we observed that the kubelet declared the pods to be Running very quickly; the delay was in the apiserver realizing this. Again, there were no log messages in either the kubelet or the apiserver logs indicating any issue. The network team requested that we collect logs from the openshift-sdn that manages pod networking; that too showed nothing out of the ordinary. Indeed, even using host networking didn't help.

To simplify the test, we wrote a much simpler client/server deployment, where the client would simply attempt to connect to the server until it succeeded rather than failing, using only two nodes. The client pods logged the number of connection attempts made and the elapsed time before success. We ran 500 replicas of this deployment, and found that up to about 450 pods total (225 per node), the pods started up and quickly went into Running state. Between 450 and 620, the rate of pods transitioning to Running state slowed down, and actually stalled out for about 10 minutes, after which the backlog cleared at a rate of about 3 pods/minute until eventually (after a few more hours) all of the pods were running. This supported the hypothesis that there was nothing really wrong with the kubelet; the client pods were able to start running, but most likely timed out connecting to the server, and did not retry.

On the hypothesis that the issue was rate of pod creation, we tried adding sleep 30 between creating each client-server pair. This staved off the point at which pod creation slowed down to about 375 pods/node, but eventually the same problem happened. We tried another experiment placing all of the pods within one namespace, which succeeded -- all of the pods quickly started and ran correctly. As a final experiment, we used pause pods (which do not use the network) with separate namespaces, and hit the same problem, starting at around 450 pods (225/node). So clearly this was a function of the number of namespaces, not the number of pods; we had established that it was possible to run 500 pods per node, but without being able to use multiple namespaces, we couldn't declare success.

Fixing the problem

By this point, it was quite clear that the issue was that the kubelet was unable to communicate at a fast enough rate with the apiserver. When that happens, the most obvious issue is the kubelet throttling transmission to the apiserver per the kubeAPIQPS and kubeAPIBurst kubelet parameters. These are enforced by Go rate limiting. The defaults that we inherit from upstream Kubernetes are 5 and 10, respectively. This allows the kubelet to send at most 5 queries to the apiserver per second, with a short-term burst rate of 10. It's easy to see how under a heavy load that the kubelet may need a greater bandwidth to the apiserver. In particular, each namespace requires a certain number of secrets, which have to be retrieved from the apiserver via queries, eating into those limits. Additional user-defined secrets and configmaps only increase the pressure on this limit.

The throttling is used in order to protect the apiserver from inadvertent overload by the kubelet, but this mechanism is a very broad brush. However, rewriting it would be a major architectural change that we didn't consider to be warranted. Therefore, the goal was to identify the lowest safe settings for KubeAPIQPS and KubeAPIBurst.

Experimenting with different settings, we found that the setting QPS/burst to 25/50 worked fine for 2000 pods on 3 nodes with a reasonable number of secrets and configmaps, but 15/30 didn't.

The difficulty in tracking this down is that there's nothing in either the logs or Prometheus metrics identifying this. Throttling is reported by the kubelet at verbosity 4 (v=4 in the kubelet arguments), but the default verbosity, both upstream and within OpenShift, is 3. We didn't want to change this globally. Throttling had been seen as a temporary, harmless condition, hence its being relegated to a low verbosity level. However, with our experiments frequently showing throttling of 30 seconds or more, and this leading to pod failures, it clearly was not harmless. Therefore, I opened https://github.com/kubernetes/kubernetes/pull/80649, which eventually merged, and then pulled it into OpenShift in time for OpenShift 4.3. While this alone would not solve throttling, it greatly simplifies diagnosis. Adding throttling metrics to Prometheus would be desirable, but that is a longer-term project.

The next question was what to set the kubeAPIQPS and kubeAPIBurst values to. It was clear that 5/10 wouldn't be suitable for larger numbers of pods. We decided that we wanted some safety margin above the tested 25/50, hence settled on 50/100 following node scaling testing on OpenShift 4.2 with these parameters set.

Another piece of the puzzle was the watch-based configmap and secret manager for the kubelet. This allows the kubelet to set watches on secrets and configmaps supplied by the apiserver, which in the case of items that don't change very often, are much more efficient for the apiserver to handle, as it caches the watched objects locally. This change, which didn't make OpenShift 4.2, would enable the apiserver to handle a heavier load of secrets and configmaps, easing the potential burden of the higher burst/QPS values. If you’re interested in the details of the change, in Go 1.12, the details are here, under net/http

To summarize, we made the following changes between OpenShift 4.2 and 4.3 to set the stage for scaling up the number of pods:

  • Change the default kubeAPIQPS from 5 to 50.
  • Change the default kubeAPIBurst from 10 to 100.
  • Change the default configMapAndSecretChangeDetectionStrategy from Cache to Watch.

Testing 500 pods/node

The stage was now set to actually test 500 pods/node as part of OpenShift 4.3 scaling testing. The questions we had to decide were:

  • What hardware do we want to use?
  • What OpenShift configuration changes would be needed?
  • How many nodes do we want to test?
  • What kind of workload do we want to run?

Hardware

A lot of pods, particularly with many namespaces, can put considerable stress on the control plane and the monitoring infrastructure. Therefore, we deemed it essential to use large nodes for the control plane and monitoring infrastructure. As we expected the monitoring database to have very large memory requirements, we placed (as is our standard practice) the monitoring stack on a separate set of infrastructure nodes rather than sharing that with the worker nodes. We settled on the following, using AWS as our underlying platform:

  • Master Nodes

    The master nodes were r5.4xlarge instances. r5 instances are memory-optimized, to allow for large apiserver and etcd processes. The instance type consists of:

    • CPU: 16 cores, Intel Xeon Platinum 3175
    • Memory: 128 GB
    • Storage: EBS (no local storage), 4.75 Gbps
    • Network: up to 10 Gbps.
  • Infrastructure Nodes

    The infrastructure nodes were m5.12xlarge instances. m5 instances are general purpose. The instance type consists of:

    • CPU: 48 cores, Intel Xeon Platinum 8175
    • Memory: 192 GB
    • Storage: EBS (no local storage), up to 9.5 Gbps
    • Network: 10 Gbps
  • Worker Nodes

    The worker nodes were m5.2xlarge. This allows us to run quite a few reasonably simple pods, but typical application workloads would be heavier (and customers are interested in very big nodes!). The instance type consists of:

    • CPU: 8 cores, Intel Xeon Platinum 8175
    • Memory: 16 GB
    • Storage: EBS (no local storage), 4.75 Gbps
    • Network: up to 10 Gbps

Configuration Changes

The OpenShift default for maximum pods per node is 250. Worker nodes have to contain parts of the control infrastructure in addition to user pods; there are about 10 such control pods per node. Therefore, to ensure that we could definitely achieve 500 worker pods per node, we elected to set maxPods to 520 using a custom KubeletConfig using the procedure described here

% oc label --overwrite machineconfigpool worker custom-kubelet=large-pods
% oc apply -f - <<’EOF’
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: "set-max-pods"
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: large-pods
kubeletConfig:
maxPods: 520
EOF

This requires an additional configuration change. Every pod on a node requires a distinct IP address allocated out of the host IP range. By default, when creating a cluster, the hostPrefix is set to 23 (i. e. a /23 net), allowing for up to 510 addresses -- not quite enough. So clearly we had to set hostPrefix to 22 for this test in the install-config.yaml used to install the cluster.

In the end, no other configuration changes from stock 4.3 were needed. Note that if you want to run 500 pods per node, you’ll need to make these two changes yourself, as we did not change the defaults.

How many nodes do we want to test?

This is a function of how many large nodes we believe customers will want to run in a cluster. We settled on 100 for this test.

What kind of workload do we want to run?

Picking the workload to run is a matter of striking a balance between the test doing something interesting and being easy to set up and run. We settled on a simple client-server workload in which the client sends blocks of data to the server which the server returns, all at a pre-defined rate. We elected to start at 25 nodes, then follow up with 50 and 100, and use varying numbers of namespaces and pods per namespace. Large numbers of namespaces typically stress the control plane more than the worker nodes, but with a larger number of pods per worker node, we didn't want to discount the possibility that that would impact the test results.

Test Results

We used ClusterBuster to generate the necessary namespaces and deployments for this test to run.

ClusterBuster is a simple tool that I wrote to generate a specified number of namespaces, and secrets and deployments within those namespaces. There are two main types of deployments that this tool generates: pausepod, and client-server data exchange. Each namespace can have a specified number of deployments, each of which can have a defined number of replicas. The client-server can additionally specify multiple client containers per pod, but we didn't use this feature. The tool uses oc create and oc apply to create objects; we created 5 objects per oc apply, running two processes concurrently. This allows the test to proceed more quickly, but we've found that it also creates more stress on the cluster. ClusterBuster labels all objects it creates with a known label that makes it easy to clean up everything with

oc delete ns -l clusterbuster

In client-server mode, the clients can be configured to exchange data at a fixed rate for either a fixed number of bytes or for a fixed amount of time. We used both here in different tests.

We ran tests on 25, 50, and 100 nodes, all of which were successful; the "highest" test (i. e. greatest number of namespaces) in each sequence was:

  • 25 node pausepod: 12500 namespaces each containing one pod.
  • 25 node client-server: 2500 namespaces each containing one client-server deployment consisting of four replica client pods and one server (5 pods/deployment). Data exchange was at 100 KB/sec (in each direction) per client, total 10 MB in each direction per client.

  • 50 node pausepod: 12500 namespaces * 2 pods.

  • 50 node client-server: 5000 namespaces, one deployment with 4 clients + server, 100 KB/sec, 10 MB total.

  • 100 node client-server: 5000 namespaces, one deployment with 9 clients + server, 100 KB/sec for 28800 seconds. In addition, we created and mounted 10 secrets per namespace.

Test Data

I'm going to cover what we found doing the 100 node test here, as we didn't observe anything during the smaller tests that was markedly different (scaled appropriately).

We collected a variety of data during the test runs, including Prometheus metrics and another utility from my OpenShift 4 tools package (monitor-pod-status), with Grafana dashboards to monitor cluster activity. monitor-pod-status strictly speaking duplicates what we can get from Grafana, but it's in an easy to read textual format. Finally, I used yet another tool clusterbuster-connstat to retrieve log data left by the client pods to analyze the rate of data flow.

Test Timings

The time required to create and tear down the test infrastructure is a measure of how fast the API and nodes can perform operations. This test was run with a relatively low parallelism factor, and operations didn't lag significantly.

Operation Approximate Time (minutes)
Create namespaces 4
Create secrets 43
Create deployments 34
Exchange data 480
Delete pods and namespaces 29

One interesting observation during the pod creation time is that pods were being created at about 1600/minute, and at any given time, there were about 270 pods in ContainerCreating state. This indicates that the process of pod creation took about 10 seconds per pod throughout the run.

Networking

The expected total rate of data exchange is 2 * Nclients * XferRate. In this case, of the 50,000 total pods, 45,000 were clients. At .1 MB/sec, this would yield an expected aggregate throughput of 9000 MB/sec (72,000 Mb/sec). The aggregate expected transfer rate per node would therefore be expected to be 720 Mb/sec, but as we'd expect on average about 1% of the clients to be colocated with the server, the actual average network traffic would be slightly less. In addition, we'd expect variation due to the number of server pods that happened to be located per node; in the configuration we used, each server pod handles 9x the data each client pod handles.

I inspected 5 nodes at random; each node showed a transfer rate during the steady state data transfer of between 650 and 780 Mbit/sec, with no noticeable peaks or valleys, which is as expected. This is nowhere near the 10 Gbps limit of the worker nodes we used, but the goal of this test was not to stress the network.

Quis custodiet ipsos custodes?

With apologies to linguistic purists, the few events we observed were related to Prometheus. During the tests, one of the Prometheus replicas typically used about 130 Gbytes of RAM, but a few times the memory usage spiked toward 300 Gbytes before ramping down over a period of several hours. In two cases, Prometheus crashed; while we don't have records of why, we believe it likely that it ran out of memory. The high resource consumption of Prometheus reinforces the importance of robust monitoring infrastructure nodes!

Future Work

We have barely scratched the surface of pod density scaling with this investigation. There are many other things we want to look at, over time:

  • Even more pods: as systems grow even more powerful, we can look at even greater pod densities.
  • Adding CPU and memory requests: investigate the interaction between CPU/memory requests and large numbers of pods.

  • Investigate the interaction with other API objects: raw pods per node is only part of what stresses the control plane and worker nodes. Our synthetic test was very simple, and real-world applications will do a lot more. There are a lot of other dimensions we can investigate:

    • Number of configmaps/secrets: very large numbers of these objects in combination with many pods can stress the QPS to the apiserver, in addition to the runtime and the Linux kernel (as each of these objects must be mounted as a filesystem into the pods).
    • Many containers per pod: this stresses the container runtime.
    • Probes: these likewise could stress the container runtime.
  • More workloads: the synthetic workload we used is easy to analyze, but is hardly representative of every use people make of OpenShift. What would you like to see us focus on? Leave a comment with your suggestions.
  • More nodes: 100 nodes is a starting point, but we're surely going to want to go higher. We’d also like to determine whether there’s a curve for maximum number of pods per node vs. number of nodes.

  • Bare metal: typical users of large nodes are running on bare metal, not virtual instances in the cloud.

Credits

I'd like to thank the many members of the OpenShift Perf/Scale, Node, and QE teams who worked with me on this, including (in alphabetical order) Ashish Kamra, Joe Talerico, Ravi Elluri, Ryan Phillips, Seth Jennings, and Walid Abouhamad.