Monitoring OpenShift Health and Application performance in Splunk

 

This is a guest post by Olga Chernysheva, co-founder of Outcold Solutions.

Outcold Solutions provide solutions for monitoring Red Hat OpenShift clusters in Splunk Enterprise and Splunk Cloud. We are helping businesses to reduce complexity related to logging and monitoring by providing an easier-to-use solution. With the power of Splunk Enterprise and Splunk Cloud, we offer an innovative solution to keep the metrics and logs in one place, allowing users to more quickly address complex questions on container performance and cluster health.

After a typical 10 minute setup, users get a monitoring solution that includes log aggregation, performance and system metrics, metrics from the control plane and application metrics, a dashboard for reviewing network activity, and alerts to notify them about cluster or application performance issues.

Our solutions are powered by the Collectord, a container-native application built by Outcold Solutions that provides capabilities for discovering, transforming and forwarding logs, collecting system metrics, collecting metrics from the control plane of the orchestration frameworks and forwarding network activity. Collectord provides flexible tools for transforming logs. With our software, users can hide sensitive information from the loglines before forwarding them. With Collectord, users can reduce the licensing costs associated with logging aggregation by choosing which data users want to forward from the log streams. Collectord forwards container logs and host logs, and can discover logs written by the containerized applications.

Outcold Solutions offers a Red Hat certified image built-in with Collectord available in the Red Hat OpenShift Container Catalog.

In this post, we will share some of the use cases how customers use our application.

Getting started with Monitoring OpenShift

The Monitoring OpenShift solution consists of two parts. The first part is Collectord, a container-native application running on OpenShift nodes and deployed as a DaemonSet. The second part is the Monitoring OpenShift application built for Splunk which includes pre-built dashboards and alerts. If you don’t have a Splunk deployment, you can download a trial version of Splunk Enterprise or sign up for Splunk Cloud trial.

In minutes users can start collecting metrics and forward logs. We provide installation instructions that work with OpenShift clusters starting from version 3.4. Out of the box users can start monitoring the health of the clusters, define custom triggers for pre-built alerts, navigate through logs from containers, system components, and application logs, review the cluster configurations, and monitor network activities in the cluster.

Our applications can be integrated with the OpenShift console to help you more quickly get information about the running workloads and pods, and start diagnosing performance issues within your applications.

Figure 1: OpenShift web console with integrated links to Monitoring Openshift application

Application Monitoring and Log Aggregation

The Monitoring OpenShift application provides rich dashboards for developers. These dashboards are powered by the metrics collected from the container and processes, and enriched with the OpenShift metadata. Developers can monitor the performance of their applications, review container limits, set up custom triggers for application specific alerts. The application provides capabilities to correlate container and application logs with the metrics collected from the system.

In addition to system metrics, Collectord can forward application-specific metrics, exported in Prometheus format. As an example, you can define exports from your load balancers, Java applications, and databases.

Figure 2: Host overview dashboard in Monitoring OpenShift application

Working closely with our customers, we have found that usually there are 3 teams involved in the process of log aggregation: a team for managing log aggregation infrastructure (Splunk team), a team for managing the application infrastructure (OpenShift team) and application developers that leverage both infrastructures. We have built a more simple approach for application developers so they can define how they want to see their logs forwarded, without requiring OpenShift admins to redeploy configurations for Collectord or Splunk admins to modify configurations on the indexers.

Collectord offers a large number of annotations. As an example, a common problem with the container logs is multi-line events (for example, Java call-stacks). Developers can attach annotations to their containers and define the pattern of events. Collectord uses these annotations for log transformation before forwarding them to Splunk. Similarly, developers can remove terminal escape symbols from the logs. With annotations, users can reduce the number of loglines or remove sensitive information from the logs, including PII data, to help keep logs GDPR compliant.

Collectord automatically enriches logs with the OpenShift specific metadata. This helps users browse logs by the name of the Deployments, BuildConfigs, pod name, container name, or even the IP address of the pod.

Figure 3: Container logs enriched with metadata

A best practice for collecting container logs is to forward your logs to the standard output and standard error of your container. Unfortunately, we have found that if you are dealing with a very complex application, such as a database or some legacy applications, this option is not always available. Sometimes having two streams is just not enough. Legacy log aggregation tools recommend you to run a sidecar container, however this approach can add additional complexity.

Collectord was built to be a container-native system for forwarding logs. It is designed to remove the requirement of running sidecar containers for this scenario. Collectord can automatically discover the application logs from containers and forward them to Splunk. Similar to container logs, these logs are enriched with the OpenShift specific metadata.

Collectord has been tested in large-scale environments, it is high performant and only limited by the given resources. Collectord can easily forward over 10,000 1k events per second from a single host.

Cluster Health Monitoring

The application comes with a rich number of dashboards that help to monitor cluster health. It has a dashboard for components from the control plane including etcd clusters, API servers, kubelets, and controllers. With the Monitoring Openshift application you can review the capacity and allocations of your clusters and verify how well you are utilizing resources.   

Figure 4: Allocatable resources dashboard

Controllerd offers over 30 pre-built alerts that will notify users about the issues in your deployment. SplunkBase offers a number of Alert Actions, including pre-built alerts and the pieces necessary to define custom actions.

Security and Audit

With our customers, we have found that they use various ways to define the access control to projects on their OpenShift clusters. Some use projects inside of OpenShift to limit the access while some deploy several OpenShift clusters for various teams. Collectord supports the above use cases.

OpenShift administrators can map projects to Splunk indexes and Splunk administrators can define role-based access to the indexes in Splunk. With this approach, users can give teams access to logs and metrics only from their projects, and provide admins access to the logs and metrics from the hosts and control plane.

Figure 5: Overview dashboard in Monitoring Openshift application

While reviewing security issues or deployment failures, it is always good to have a history of what has changed. This information is available through OpenShift events. After enabling the audit logs on the API server, users are able to find the initiator of a change, including what has changed, and the IP address from where the request was issued.

Figure 6: Audit dashboard in Monitoring OpenShift application

Additionally, Collectord forwards information about network socket tables from containers and hosts. This allows users to review network activities in your cluster and monitor connections established to IP addresses outside of your clusters. Users can find unexpected activities and connections on your clusters.

Figure 7: Network dashboard in Monitoring OpenShift application

Learn More

Follow the installation instructions to get started with Monitoring OpenShift in Splunk. The product has a built-in free 30-day trial license.

 

Categories
OpenShift Ecosystem
Tags
, ,