This is a guest post co-written by Solarflare, a Xilinx company. Miklos Reiter is Software Development Manager at Solarflare and leads the development of Solarflare’s Cloud Onload Operator. Zvonko Kaiser is Team Lead at Red Hat and leads the development of the Node Feature Discovery operator.

Figure 1: Demo of Onload accelerating Netperf in a Pod

 

Solarflare, now part of Xilinx, and Red Hat have collaborated to bring Solarflare’s Cloud Onload for Kubernetes to Red Hat’s OpenShift Container Platform. Solarflare’s Cloud Onload accelerates and scales network-intensive applications such as in-memory databases, software load balancers and web servers. The OpenShift Container Platform empowers developers to innovate and ship faster as the leading hybrid cloud, enterprise Kubernetes container platform.

The Solarflare Cloud Onload Operator automates the deployment of Cloud Onload for Red Hat OpenShift and Kubernetes. Two distinct use cases are supported:

  1. Acceleration of workloads using Multus and macvlan/ipvlan
  2. Acceleration of workloads over a Calico network

This blog post describes the first use case; a future blog post will focus on the second use case. 

Solarflare's Cloud Onload Operator provides an integration path with Red Hat OpenShift Container Platform's Device Plugin framework, which allows OpenShift to allocate and schedule containers according to the availability of specialized hardware resources. The Cloud Onload Operator uses the Multus multi-networking support and is compatible with both the immutable Red Hat CoreOS operating system as well as Red Hat Enterprise Linux. The Node Feature Discovery operator is also a part of this story, as it helps to automatically discover and use compute nodes with high-performance Solarflare network adapters, which Multus makes available to containers in addition to the usual Kubernetes network interface. OpenShift 4.2 will include the Node Feature Discovery operator.

Below is a network benchmark showing the benefits of Cloud Onload on OpenShift.

Up to 15x Performance Increase

Figure 2: NetPerf request-response performance with Onload versus the kernel

 

Figure 2 above illustrates the dramatic acceleration in network performance, which can be achieved with Cloud Onload. With Cloud Onload, a NetPerf TCP request-response test produces a more significant number of transactions per second delivering better performance than with just the native kernel network stack.

Moreover, performance scales almost linearly as we scale the number of NetPerf test streams up to the number of CPU cores in each server. In this test, Cloud Onload achieves eight times the kernel transaction rate with one stream, rising to a factor of 15 for 36 concurrent streams.

This test used a pair of machines with Solarflare XtremeScale X2541 100G adapters connected back-to-back (without going via a switch). The servers were using 2 x Intel Xeon E5-2697 v4 CPUs running at 2.30GHz.

Integration with Red Hat OpenShift

Deployment of Onload Drivers

The Cloud Onload Operator automates the deployment of the Onload kernel drivers and userspace libraries in Kubernetes.

For portability across operating systems, the kernel drivers are distributed as a driver container image. The operator ships with versions built against Red Hat Enterprise Linux and Red Hat CoreOS kernels. For non-standard kernels, one can build a custom driver container image. The operator automatically runs the driver container on each Kubernetes node, which loads the kernel modules.

Also, the driver container installs the user-space libraries on the host. Using a device plugin, the operator then injects the user-space libraries, together with the necessary Onload device files, into every pod which requires Onload.

Deployment of Onload on Kubernetes is significantly simplified by the operator, as it is not necessary to build Onload into application container images or to write custom sidecar injector or other logic to achieve the same effect.

Configuring Multus

OpenShift 4 ships with the Multus multi-networking plugin. Multus enables the creation of multiple network interfaces for Kubernetes pods.

Before we can create accelerated pods, we need to define a Network Attachment Definition (NAD) in the Kubernetes API. This object specifies which of the node's interfaces to use for accelerated traffic, and also how to assign IP addresses to pod interfaces.

The Multus network configuration can vary from node to node, which is useful to assign static IPs to pods, or if the name of the Solarflare interface to use varies between nodes.

The following steps create a Multus network that can provide a macvlan subinterface for every pod that requests one. The plugin automatically allocates static IPs to configure the subinterface for each pod.

First, we create the NetworkAttachmentDefinition(NAD) object:

cat << EOF | oc apply -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: onload-network
EOF

Then on each node that uses this network, we write a Multus config file specifying the properties of this network:


mkdir -p /etc/cni/multus/net.d

cat << EOF > /etc/cni/multus/net.d/onload-network.conf
{
  "cniVersion": "0.3.0",
  "type": "macvlan",
  "name": "onload-network",
  "master": "sfc0",
  "mode": "bridge",
  "ipam": {
      "type": "host-local",
      "subnet": "172.20.0.0/16",
      "rangeStart": "172.20.10.1",
      "rangeEnd": "172.20.10.253",
      "routes": [
          { "dst": "0.0.0.0/0" }
      ]
  }
}
EOF

Here, master specifies the name of the Solarflare interface on the node, and rangeStart and rangeEnd specify non-overlapping subsets of the subnet IP range.

An alternative to the macvlan plugin is the ipvlan plugin. The main difference between the ipvlan subinterfaces and the macvlan is that the ipvlan subinterfaces share the parent interface’s MAC address, providing better scalability in the L2 switching infrastructure. Cloud Onload 7.0 adds support for accelerating ipvlan subinterfaces in addition to macvlan subinterfaces. OpenShift 4.2 will add support for ipvlan.

Node Feature Discovery

A large cluster often runs on servers with different hardware. This means that workloads requiring high-performance networking may need to be explicitly scheduled to nodes with the appropriate hardware specification. In particular, Cloud Onload requires Solarflare XtremeScale X2 network adapters.

To assist with scheduling, we can use Node Feature Discovery. The NFD operator automatically detects hardware features and advertises them using node labels. We can use these node labels to restrict which nodes are used by the Cloud Onload Operator, by setting the Cloud Onload Operator’s nodeSelector property.

In future, NFD will be available within the operator marketplace in OpenShift 4.2. At the time of writing, NFD is installed manually as follows:

$ git clone https://github.com/openshift/cluster-nfd-operator
$ cd cluster-nfd-operator
$ make deploy

We can check that NFD has started successfully by confirming that all pods in the openshift-nfd namespace are running:

$ oc get pods -n openshift-nfd

At this point, all compute nodes with Solarflare NICs should have a node label indicating the presence of a PCI device with the Solarflare vendor ID (0x1924). We can check this by querying for nodes with the relevant label:

$ oc get nodes -l feature.node.kubernetes.io/pci-1924.present=true

We can now use this node label in the Cloud Onload Operator’s nodeSelector to restrict the nodes used with Onload. For maximum flexibility, we can, of course, use any node labels configured in the cluster.

Cloud Onload Installation

Installation requires downloading a zip file containing a number of YAML manifests from the Solarflare support website https://support.solarflare.com. Following the installation instructions in the README.txt contained in the zip file, we edit the example custom resource to specify the kernel version of the cluster worker nodes we are running:

kernelVersion: "4.18.0-80.1.2.el8_0.x86_64”

and the node selector:

nodeSelector:
beta.kubernetes.io/arch: amd64
node-role.kubernetes.io/worker: ''
feature.node.kubernetes.io/pci-1924.present: true

We then apply the manifests

$ for yaml_spec in manifests/*; do oc apply -f $yaml_spec; done

We expect to list the Solarflare Cloud Onload Operator on https://operatorhub.io soon, for installation using the Operator Lifecycle Manager and OpenShift’s built-in Operator Hub support.

Running the NetPerf Benchmark

We are now ready to create pods that can run Onload.

Netperf Test Image

We now build a container image which includes the netperf performance benchmark tool using a Fedora base image. Most common distributions that use glibc are compatible with Onload. This excludes extremely lightweight images, such as Alpine Linux.

The following Dockerfile produces the required image.

NetPerf.Dockerfile:
FROM fedora:29
RUN dnf -y install gcc make net-tools httpd iproute iputils procps-ng kmod which
ADD https://github.com/HewlettPackard/netperf/archive/netperf-2.7.0.tar.gz /root
RUN tar -xzf /root/netperf-2.7.0.tar.gz
RUN netperf-netperf-2.7.0/configure --prefix=/usr
RUN make install
CMD ["/bin/bash"]

We build the image:

$ docker build -t netperf -f netperf.Dockerfile

Example onloaded netperf daemonset

This is an example daemonset that runs netperf test pods on all nodes that have Solarflare interfaces.

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: netperf
spec:
selector:
  matchLabels:
    name: netperf
template:
  metadata:
    labels:
      name: netperf
    annotations:
      k8s.v1.cni.cncf.io/networks: onload-network
  spec:
    nodeSelector:
      node-role.kubernetes.io/worker: ''
    containers:
      - name: netperf
        image: {{ docker_registry }}/netperf:latest
        stdin: true
        tty: true
        resources:
          limits:
            solarflare.com/sfc: 1

Here {{ docker_registry }} is the registry hostname (and :port if required).

The important sections are:

  1. The annotations section under spec/template/metadata specifies which Multus network to use. With this annotation, Multus will provision a macvlan interface for the pods.
  2. The resources section under containers requests Onload acceleration from the Cloud Onload Operator.

Running Onload inside accelerated pods with OpenShift/Multus

Each netperf test pod we have created has two network interfaces.

eth0: the default Openshift interface

net1: the Solarflare macvlan interface to be used with Onload

Any traffic between the net1 interfaces of two pods can be accelerated using Onload by either:

  1. Prefixing the command with "onload"
  2. Running with the environment variable LD_PRELOAD=libonload.so

Note: One caveat to the above is that two accelerated pods can only communicate using Onload if they are running on different nodes. (Onload bypasses the kernel's macvlan driver to send traffic directly to the NIC, so traffic directed at another pod on the same node will not arrive.)

To run a simple netperf latency test we open a shell on each of two pods by running:

$ kubectl get pods
$ kubectl exec -it <pod_name> bash

On pod 1:

$ ifconfig net1
net1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
      inet 172.20.0.16  netmask 255.255.0.0  broadcast 0.0.0.0
[...]
bash-4.4$ onload --profile=latency netserver -p 4444
oo:netserver[107]: Using OpenOnload 201811 Copyright 2006-2018 Solarflare Communications, 2002-2005 Level 5 Networks [4]
Starting netserver with host 'IN(6)ADDR_ANY' port '4444' and family AF_UNSPEC

On pod 2:

$ onload --profile=latency netperf -p 4444 -H 172.20.0.16 -t TCP_RR

Running multiple parallel NetPerf pairs, concurrently produced the results shown above.

Obtaining Cloud Onload

Visit https://solarflare.com/cloud-onload/ to learn more about Cloud Onload or make a sales inquiry. An evaluation of Solarflare’s Cloud Onload Operator for Kubernetes and OpenShift can be arranged on request.