OpenShift Scale-CI: Part 1 – Evolution

If you’ve played around with Kubernetes or Red Hat OpenShift, which is an enterprise ready version of Kubernetes for production environments, the following questions may have occurred to you:

  • How large can we scale our OpenShift-based applications in our environments?
  • What are the cluster limits? How can we plan our environment according to object limits?
  • How can we tune our cluster to get maximum performance?
  • What are the challenges of running and maintaining a large and/or dense cluster?
  • How can we make sure each OpenShift release is stable, performant and satisfies our requirements in our own environments?

We, the OpenShift Scalability team at Red Hat created an automation pipeline and tooling called OpenShift Scale-CI to help answer all of these questions. OpenShift Scale-CI automates the installation, configuration and running of various Performance and Scale tests on OpenShift across multiple cloud providers.

Motivation behind building Scale-CI

There are two areas which led us to build Scale-CI:

  • Providing a green signal for every OpenShift product release, for all product changes to support scale and for shipping our Scalability and Performance guide with the product. 
  • Onboarding workloads to see how well they perform at scales above thousands of nodes per cluster.

It is important to find out at what point any system starts to slow down or completely fall apart. It could be because of various reasons:

  • Your cluster has low Master ApiServer, Kubelet QPS and Burst values.
  • Etcd backed quota size might be too low for large and dense clusters. 
  • The number of objects running on the cluster is beyond the supported cluster limits

This motivated us to scale test each and every release of OpenShift and ship the Scalability and Performance guide with each OpenShift release which helps users plan/tune their environment accordingly. 

 In order to make efficient use of the lab hardware or the hourly paid compute and storage in public cloud which might get very expensive at large scale, automation does a better job at optimization than humans do at the endless wash. rinse and repeat cycle of CI-based testing. This led us to create automation and tooling which works on any cloud provider and runs performance and scale tests to cover various components of OpenShift; Kubelet, Control plane, SDN, Monitoring with Prometheus, Router, Logging, Cluster Limits and Storage can all be tested with the click of a button.

We used to spend weeks to running tests and capturing data. Scale-CI speeds up the process, thus saving lots of time and money on compute and storage resources. Most importantly: It gave us the time to work on creative tasks like tooling and designing new scale tests to add to the framework.

Not every team or user has the luxury of building automation, tooling and access to the hardware to test how well their application or OpenShift component is working at scales above 2000 nodes . Being part of the Performance and Scalability team, we have access to a huge amount of hardware resources and this motivated us to build Scale-CI in such a way that anyone can come use it and participate in the community around it.  Users can submit a pull request on Github with a set of templates to get their workload onboarded into the pipeline. The onboarded workloads are automatically tested at scale on an OpenShift cluster built with the latest and greatest builds. It doesn’t hurt that this entire process is managed and maintained by the OpenShift Scalability team.

You can find us online at openshift-scale github organization. Any feedback or contributions are most welcome. Keep an eye out for our next blog, OpenShift Scale-CI Deep Dive: Part 2, which will have information about the various Scale-CI components including workloads, pipeline and the tooling we use to test OpenShift at scale.

Categories
Kubernetes, OpenShift Container Platform, OpenShift Ecosystem
Tags
, , ,