Over the last few days, the OpenShift CI/CD team (responsible for the rolling deployments of OpenShift code to the OpenShift Online environments) has been systems-thinking upgrading all clusters to OpenShift 3.6. Actually, OpenShift.com was running 3.6 code over a week before GA.

From a systems thinking perspective, I have been watching very closely during these upgrade windows, alternating my attention between Grafana dashboards, Prometheus, and good old ad hoc Ansible to learn the rules of these systems.

Given these large production clusters, what are the rules of the road? What are the key performance indicators that define success so that we can measure ourselves against them? There is very little prior art where we live—OpenShift.com is a multi-tenant Kubernetes-based container platform where users can bring their own code via source-to-image technology and run it, for free. This makes it very unique within the Kubernetes landscape. I'm not aware of similar (if any?) deployments that are in production today, of similar size and scale, that offer a secure multi-tenant distribution of Kubernetes for free.

Our team is responsible for performance and scalability testing of OpenShift. To that end, we stand up very large clusters and blast them with load generators. Largely due to research and community development done by CoreOS and Google, we anticipated gains in scalability when upgrading from etcd2 to etcd3. You can read more about the anticipated improvements in this blog or watch this presentation. The two main improvements we were after were moving from json to protobuf as the transport format, and the on-disk storage format changes. We expected a reduction in CPU related to marshaling json objects, reduction in network traffic due again to the transport format change, and also a reduction in disk IOPS. We published the gory technical details from our lab experiments in the OpenShift Scaling and Performance Guide here.

OpenShift began supporting etcd3 once it was supported by Kubernetes (v1.5). However, it did not support the migration to the v3 schema until 3.6 (which GA'd on August 9th).

As I mentioned earlier, the CI/CD team has recently completed the migration of on-disk storage format from v2 to v3 in most of our larger multi-tenant OpenShift Online clusters at this point.

I wanted to share several charts that support the improvements brought by the migration to etcd3 v3 schema. On each graph, I would like to draw your attention to the night of August 10th.

First, a significant reduction in CPU usage by the etcd processes (all of our OpenShift Online environments run 3-node etcd clusters, so there are 3 colored lines on this graph).

The Y-axis is %cpu, so we are over 1 full CPU core more efficient on each node (on average), after the upgrade. Side note: We co-locate apiserver (master), controllers, and etcd services onto the same machines.

etcd_cpu1

Second, a significant reduction in memory usage (RSS) of the etcd processes. This graph is average memory usage of etcd across the three etcd nodes, and represents a reduction from over 30GB per etcd instance down to 17GB. That's nearly 40GB RAM saved across the three nodes.

etcd_rss

One thing I'd like to call out in the etcd RSS graph is a change in behavior before and after the upgrade. Note the relative flatness of the line before the upgrade and the sawtooth graph after. The sawtooth represents a 6GB alloc/free by etcd roughly every 15 minutes. We looked into it more deeply and discovered that:

Snapshots are occurring at the same interval that matches the RSS spikes, roughly every 15 minutes:

Aug 18 10:12:46 ip-172-31-55-199.ec2.internal etcd[87310]: saved snapshot at index 7021433842
Aug 18 10:28:21 ip-172-31-55-199.ec2.internal etcd[87310]: start to snapshot (applied: 7021493843, lastsnap: 7021433842)
Aug 18 10:28:39 ip-172-31-55-199.ec2.internal etcd[87310]: saved snapshot at index 7021493843
Aug 18 10:43:54 ip-172-31-55-199.ec2.internal etcd[87310]: start to snapshot (applied: 7021553845, lastsnap: 7021493843)
Aug 18 10:44:13 ip-172-31-55-199.ec2.internal etcd[87310]: saved snapshot at index 7021553845
Aug 18 10:59:48 ip-172-31-55-199.ec2.internal etcd[87310]: start to snapshot (applied: 7021613846, lastsnap: 7021553845)
Aug 18 11:00:07 ip-172-31-55-199.ec2.internal etcd[87310]: saved snapshot at index 7021613846
Aug 18 11:15:13 ip-172-31-55-199.ec2.internal etcd[87310]: start to snapshot (applied: 7021673847, lastsnap: 7021613846)
Aug 18 11:15:31 ip-172-31-55-199.ec2.internal etcd[87310]: saved snapshot at index 7021673847

Based on our etcd configuration, snapshots will occur every 60000 writes:

ETCD_SNAPSHOT_COUNT=60000

Looking at only mutating apiserver requests, with this promql:

sort_desc(drop_common_labels(sum without (instance,type,code) (rate(apiserver_request_count{verb=~"POST|PUT|DELETE|PATCH"}[5m]))))

We see a very constant rate of change. Nothing that correlates to the spikes. This supports our theory that the periodicity of the spikes is a coincidence of having a cluster that is relatively stable in terms of etcd write traffic. Thus we conclude that the spikes are likely "the new normal" (systems thinking!). Below is a Prometheus graph of the mutating API server requests query. Note that this particular graph does not bracket the upgrade timeframe, it is all post-upgrade. The point of the graph is to show how mutating requests (those which cannot be served by caches or watches, and that must be snapshotted) are consistent over time:
prom_mutating

Third, (and most impressive to me anyway) is a 300% reduction in disk IOPS for each etcd node. For the last several months, OpenShift developers have been working in Kubernetes to reduce the impact of events on etcd (events are stored in etcd), and there were significant gains realized. For stability reasons, it was decided to effect those changes before conducting the etcd migration. Once the system was stabilized and the etcd migration completed, our I/O profile looks like this:
etcd_iops

Roughly a 3x reduction in IOPS for each etcd node, and a much tighter standard deviation. And yes, we are using io1 volumes.

Fourth, a 10x reduction in network traffic related to the efficiency of protobuf/grpc vs json:
etcd_net

These changes are foundational for us to reach the next levels of density and scale in OpenShift Online, as well as our enterprise on-premise products. Rolling pre-release bits out to the production "free" environments has unmistakably improved our quality as well as our understanding of the rules and operational models unique to large, multi-tenant Kubernetes clusters.


About the author

A 20+ year tech industry veteran, Jeremy is a Distinguished Engineer within the Red Hat OpenShift AI product group, building Red Hat's AI/ML and open source strategy. His role involves working with engineering and product leaders across the company to devise a strategy that will deliver a sustainable open source, enterprise software business around artificial intelligence and machine learning.

Read full bio