Firstly a big thanks to Thomas Wiest and the OpenShift Online/Dedicated teams. Their input was key in this.

Red Hat runs and operates hosted OpenShift solutions using the same codebase as OpenShift Container Platform. There are two hosted service offerings: OpenShift Online which is a shared multi-tenant service, and OpenShift Dedicated where we offer a dedicated hosted environment for single-tenant use.

All of this is monitored and managed based on tooling that has been developed and deployed by Red Hat, and the approach has been refined as part of the experience from running OpenShift Online (v2) and now OpenShift Dedicated and Online (v3). This monitoring is currently based around Zabbix, NB watch for future posts on the future of CloudForms, and the cm-ops project which is underway to add alerting and thresholds to CloudForms as well.

All of the code can be found in the following git-repo:

https://github.com/openshift/openshift-tools

NOTE: some of the Standard Operating Procedures (SOP) which you will see referred to in the public git repo are actually not publicly visible and are in a private git repo as these are specific to Red Hat Operations.

The tooling itself has all been designed to be deployed and operated as containers, and the complete installation has been automated using Ansible. This allows for users to deploy a stand-alone single machine deployment on which they can develop monitoring, metrics, and alerts as well as testing any changes. Once a new threshold or metric is developed it can be pushed into the stg branch and then up into int and then prod branches.

For more information about how to build an all-in-one branch please see:

https://github.com/openshift/openshift-tools/blob/stg/docs/local_development_monitoring.adoc

Today, this solution has been adopted by some customers who do not have existing monitoring and alerting capabilities already on premise - either via self-installation or working with Red Hat Consultancy based on the open code. Please note that Red Hat does not officially support the code in the openshift-tools repository. This code is in active use and may break from time to time. Also, no effort is being put forth to make it backwards-compatible.

The rest of this post is not going to explore the architectural decisions made, how monitoring works or components like the Zagg in detail, but will instead focus on alerts and thresholds.

Alerts and Thresholds

Alerts and thresholds can be reused for those enterprises who already have an event and alert infrastructure in place. For example, many organisations already have tooling such as IBM Netcool, CA Unified Infrastructure Monitoring, Solarwinds, BMC TrueSight or one of the many many other solutions out there.

For users of these tools, what you would like to do is:

  1. Harvest the counters that need to be tracked.
  2. Harvest the thresholds that are being alerted.
  3. Harvest the metrics to collect.

All of these counters and thresholds are configured by Ansible as part of the openshift-tools monitoring installation, so let’s walk through a couple of examples. The configuration is all stored as Ansible variables for the playbooks so I am looking at the directory in the git repo as below:

https://github.com/openshift/openshift-tools/tree/stg/ansible/roles/os_zabbix/vars

Firstly let’s define a couple of tags that are used in the playbooks. In Zabbix, items are referred to by their “key”. This represents a piece of data we want to receive, a metric of data.

Also in Zabbix, a trigger is a logical expression that defines a problem threshold and is used to “evaluate” data received in items. The part that defines what to evaluate is called the trigger’s expression.

Based on this, we can look at a key from one of the Ansible variable files, for example:

$ cat template_docker.yml |grep docker.storage.data.space.percent_available

- key: docker.storage.data.space.percent_available

expression: "{Template Docker:docker.storage.data.space.percent_available.max(#2)}<5 or {Template Docker:docker.storage.data.space.available.max(#2)}<5" # < 5% or < 5GB

expression: "{Template Docker:docker.storage.data.space.percent_available.max(#2)}<10 or {Template Docker:docker.storage.data.space.available.max(#2)}<10" # < 10% or < 10GB

Here you can see that we have defined a key (item) and then some expressions (triggers) which will fire when we hit percentage or absolute values for the storage space of the graph driver for the docker daemon on a node.

You can pull out all of the expressions to see all of the thresholds that are currently alerted for on the OpenShift Online and Dedicated platforms with a simple bit of “grepping” once you have cloned the repo:

# git clone https://github.com/openshift/openshift-tools

# cd openshift-tools

# git show-branch
[stg] …

# cat ./ansible/roles/os_zabbix/vars/* |grep -B1 -A2 expression

………………<snip>...............

Using this method you should be able to collect all expression thresholds that are alerted from on the platform and replicate these into your OpenShift Container Platform deployment.

How the Metrics for the Keys (Items) are Generated

For the first example above, where we looked at the docker daemon space and the alerts that fire when this drops to 5 or 10 percent/GB. To fire this alert we need to be collecting the raw metrics from the hosts and you can find the agents that collect these values also in the same repo, this time in the following folder:

https://github.com/openshift/openshift-tools/blob/stg/scripts/monitoring

For the docker daemon metrics you can see the following script which is just called from the crontab:

https://github.com/openshift/openshift-tools/blob/stg/scripts/monitoring/cron-send-docker-metrics.py

The script calls a utility script as below to collect the information and then posts this to the custom Zabbix Collector (which is beyond the scope of this blog post).

https://github.com/openshift/openshift-tools/blob/stg/openshift_tools/monitoring/dockerutil.py

Summary

We have looked at the openshift-tools repo and the OpenShift platform level monitoring, metrics and alerts that are configured and collected as part of the Red Hat OpenShift cloud services. Hopefully this will allow users who are looking to deploy OpenShift on premise to harvest this knowledge and to plug this into existing alerting and eventing systems.

Further reading

To complement the OpenShift Platform monitoring, also make sure to check out application level monitoring, either using tools such as Prometheus - check out the Fabric8 tooling for some great quick starts on this:

https://fabric8.io/

Or from partners such as:

http://www.coscale.com/blog/openshift-monitoring
https://blog.openshift.com/appdynamics-integration-with-openshift/
https://newrelic.com/openshift
https://blog.openshift.com/openshift-ecosystem-using-sysdig-monitor-openshift

Plus many more.

Thanks for reading this far.

Chris