Microservices, and the Observability Macroheadache

Moving to a microservice architecture, deployed on a cloud platform such as OpenShift, can have significant benefits. However, it does make understanding how your business requests are being executed, across the potentially large numbers of microservices, more challenging.

If we wish to locate where problems may have occurred in the execution of a business request, whether due to performance issues or errors, we are potentially faced with accessing metrics and logs associated with many services that may have been involved. Metrics can provide a general indication of where problems have occurred, but not specific to individual requests. Logs may provide errors or warnings, but cannot necessarily be correlated to the individual requests of interest.

Distributed tracing is a technique that has become indispensable in helping users understand how their business transactions execute across a set of collaborating services. A trace instance documents the flow of a business transaction, including interactions between services, internal work units, relevant metadata, latency details and contextualized logging. This information can be used to perform root cause analysis to locate the problem quickly.

How does a OpenShift Service Mesh help

The OpenShift Service Mesh simplifies the implementation of services by delegating/moving some capabilities into the platform, such as circuit breaking, intelligent routing, etc. These capabilities include the ability to report tracing data associated with the HTTP interactions between services.

This means that the service is not required to support distributed tracing directly itself – the sidecar proxy will handle sampling decisions, creation of spans (the building blocks of a trace instance) and ensuring that consistent metadata is reported.

The only responsibility that cannot be handled by the OpenShift Service Mesh is the propagation of the trace context between inbound and outbound requests within the service itself. This needs to be implemented by the service – either by copying relevant headers from the inbound request to the outbound request, or using a suitable library to handle it.

Jaeger to the Rescue

Instrumenting the service mesh and your business application is only one part of the story. Presenting this data in a way that is easy to consume and understand is the role of a tracing solution. That’s why OpenShift Service Mesh bundles a component called Jaeger, that can be used to collect, store, query and visualize the tracing data.

The Jaeger UI/console allows users to search for trace instances that meet certain criteria, including service name, operation name, tag names/values, a time frame and containing spans that have a max/min duration.

The UI shows a scattergraph of the trace instance durations to enable users to focus in on performance issues. The list also highlights trace instances that represent error situations.

Once a trace instance of interest is selected, the UI will show the individual spans in a gantt chart style. Each line represents a unit of work, typically called a ‘span’ in the distributed tracing world, color coded based on the service it represents, with a length that identifies the time duration. This enables a user to focus in on the services and operations where most time is spent for the business transaction.

When a span is selected, it will be expanded to show further details, including tag names/values and log entries. This can provide additional information that may help diagnose issues.

It is also possible to compare the structure of trace instances against each other, by selecting multiple trace instances on the search page and pressing the “Compare Traces” button.

This feature is useful to narrow down the search space for traces with large number of spans. The visualization highlights added or missing operations in two trace instances.

One Less Headache for your Microservices Journey

While distributed tracing on its own is not the monitoring panacea that devops teams require, it is a prerequisite for understanding the root cause of problems that will arise in complex and distributed architectures. When used in conjunction with other observability signals, such as metrics and logging, it can help diagnose problems and provide a more comprehensive view of the health of our business applications.

Categories
OpenShift Container Platform
Tags
, ,