Connecting Multiple OpenShift SDNs with a Network Tunnel

Introduction

Istio, the upstream project for Red Hat OpenShift Mesh, has an interesting feature that allows you to extend the service mesh across multiple OpenShift clusters.

The main requirement to implement this feature is that the IPs of the pods of the clusters that comprise the service mesh are all routable between each other. That is to say that a pod in cluster A should be able to communicate with a pod in cluster B, assuming the pod in cluster A knows the IP address of the pod in cluster B.

This article shows a way to implement this requirement, though this is not yet officially supported by Red Hat.

Potentially this capability will enable more use cases besides Istio Multicluster.

Assumptions

For the design that follows we are going to assume that the nodes of the clusters that we want to connect are not directly routable between each others. Instead we will assume that it is possible to create public VIPs that are reachable by the nodes, and that those VIPs can load balance connections to the nodes and services of the OpenShift cluster. Potentially these clusters could exists in different clouds.

This setup should be realistic for both cloud and on premises deployments of OpenShift.

The automation that sets up the SDN connectivity assumes that you can create LoadBalancer type services. If you don’t have this automation you can still build the described design, but you’ll have to develop a different automation.

If your clusters do not support LoadBalancer services, it is possible to use NodePort services. In this case though, the nodes must be routable and therefore other/better mechanisms to build the network tunnel rather than what is proposed here may be available.

Finally the pod’s CIDRs of all the clusters that we are trying to connect must not be overlapping.

Tunnel Design

For the rest of the document we will assume that we want to connect two clusters for simplicity, but the design can be generalized to multiple clusters.

In order to connect the different SDNs we have to create a network tunnel between them.

Because we can only use VIPs (a pair of IP and port) to connect from one cluster to the other the transport of the tunnel will have to be a layer 4 protocol. So we need a Layer 3 over Layer 4 (remember the tunnel needs to transport IP packets which are layer 3) tunnel.

UDP, with its connectionless property, lends itself to this job.

The tool the we can use in Linux to do this job is a tunnel device. A tunnel device is a network device that has one end attached to the network stack the other end managed by some software (it can be a kernel module or an user space application). It is a way to let an application manipulate network packets directly instead of through the abstraction of the socket API. In our case the application will put encapsulate these packets and forward them on a different type of transport layer to the other cluster.

The Linux kernel provides two types of tunnel devices: tap devices for Layer 2 encapsulation and tun devices for layer 3 encapsulation. In our case, we are going to use a tun device.

With this is mind, here is a possible design for our tunnel:

A daemonset pod will create the tunnel and ensure that packets directed to the other cluster are correctly routed. The flow works as follows:

SETUP PHASE:

  1. A tun device is created by the tunnel daemon set.
  2. The tun is wired to the bridge so that IP packets destined to the CIDR of the other cluster are routed to the tunnel.

TRANSMIT PHASE:

  1. A pod puts a packet with an IP belonging to the CIDR of the other cluster as the destination in the bridge.
  2. The flow rules send the packet to the tunnel.
  3. The tunnel daemonset process manages the wired side of the tunnel and sends the UDP encapsulated packet to the VIP of the other cluster.

RECEIVE PHASE:

  1. A UDP encapsulated packet is received by the VIP and load balanced to one of the tunnel daemon set processes.
  2. The tunnel daemon set process extracts the packet from the UDP envelope and puts in the tun device.
  3. The packet ends up in the bridge.
  4. The bridge examines the destination, if local to the node the packet is delivered immediately. Otherwise it is forwarded to the right node.

Note: a sequence of IP packets belonging to the same TCP stream do not necessarily follow the same path because the load balancer may choose a different daemon set pod for each element of the sequence. Also return packets do not necessarily follow the same path a original packets. This is all fine with IP packet routing.

This design creates a many-to-many mesh of connections with no single point of failure. It also better leverages the bandwidth that we have at our disposal relative to a solution with a single jump node that establishes the tunnel with the other cluster. We have no control over the latency of the communication, which mainly depends on what is between the two clusters.

It turns out that the Linux kernel has the capability of creating IP over UDP tunnels via the fou module.

Fou stands for Foo over UDP and implements a couple of generic encapsulation mechanisms over UDP. We are interested in the IP over UDP option.

When I started implementing with fou I quickly realized that the fou module is not available in RHEL.

It is possible to build a user-space surrogate of the IP over UDP tunnel using socat. This article explains how. Also this approach didn’t work for me. Contributions to make it work are welcome.

I didn’t spend much time on the solution based on socat because the more I thought about this problem and talked about it with my colleagues, the more I realized that I needed to build an encrypted tunnel.

Encrypted Tunnel Design

I believe that applications should be responsible for encrypting their own data if they need confidentiality in their communications. So that’s not the reason why we need an encrypted tunnel. The reason is that having a public VIP to which it’s possible to send packets that will then be routed inside the OpenShift SDN, creates an architecture that is vulnerable to denial of service attacks.

We need a way to authenticate legitimate packets and drop the others as soon as they are received by the daemon set. In theory we just need to authenticate the packet with a cryptographic signature, such as, for the example, the one provided by the Authentication Header (AH) protocol. In practice most VPN software will not implement this feature in isolation and will always also add encryption.

Encryption necessarily introduces statefulness to the communication, even if it occurs over UDP, therefore we need to adjust the tunnel design as showed in the below picture:

Now each daemonset pod has a dedicated VIP that load balances only to it.

The flow works as follows:

SETUP PHASE:

  1. A tun device is created by the tunnel daemon set.
  2. The tun is wired to the bridge so that IP packets destined to the CIDR of the other cluster are routed to the tunnel.

TRANSMIT PHASE:

  1. A pod puts a packet with an IP belonging to the CIDR of the other cluster as the destination in the bridge.
  2. The flow rules send the packet to the tunnel.
  3. The tunnel daemonset process manages the wired side of the tunnel and sends the UDP-encapsulated and encrypted packet to the VIP connected to the correct node of the other cluster.

RECEIVE PHASE:

  1. A UDP encapsulated and encrypted packet is received by the VIP and sent to the corresponding tunnel ds process
  2. The tunnel daemonset process extracts and decrypts the packet from the UDP envelope and puts it in the tun device.
  3. The packet ends up in the bridge.
  4. The bridge examines the destination, which will be local to the node, and delivers the packet immediately.

We now have one VIP per node and one. Each VIP will forward the traffic only to the daemonset pod deployed on that node.

Notwithstanding this change, our tunnel still retains the properties of not having a single point of failure and of fully leveraging the bandwidth at our disposal.

With this design, we gain the fact that now the tunnel always delivers the packet to the node that houses the destination pod, and the SDN never has to route the packets across nodes, saving a hop.

Not being a VPN expert, I started looking for VPN products that could be used to implement this VPN mesh. OpenVPN and IPSec were the obvious ones to start with. I found both of them to be not easy to configure in a mesh fashion, not very container friendly, and sometime they made too many assumptions on they way packets should be routed.

Eventually I found Wireguard. This VPN of modern conception seems to have all that was needed. It is easy to configure for a mesh deployment, it is container friendly and it just creates a tunnel, leaving to the user the responsibility to configure routing for it.

Note: when introducing a new security product one should make sure that all the needed security standards are met. I am not claiming that WireGuard is secure. You should make up your own mind on that. This article may be a good start, followed by this deeper and more technical description.

Routing Packets

We said a few times that the tunnel device was wired to the SDN bridge.

In Openshift the SDN bridge is a logical bridge implemented with OpenVSwitch.

This bridge is programmed to create the logical abstraction of the OpenShift SDN. Out-of-the box it is not configured to manage routing to other SDNs, so we will have to add this logic.

In the OpenShift SDN bridge, routed packets go through a set of rules (not unlike what happens with iptables) and eventually they are either delivered to one of the ports attached to the bridge or they are dropped.

The bridge contains several tables of rules in which different sets of actions are performed. At a high level, the first set of tables are about verifying that the packet is a legitimate packet or dropping spurious packets. The second set of rules (from table 30 up) are about analysing the destination of the packet and making the correct routing decision.

Pods are attached to the bridge by means of virtual ethernet interface (veth) pairs. Veths are a Linux kernel mechanism to let packets traverse network namespaces. Whatever enters one end of the pair exits on the other side and one can put each of the veth pair in different namespaces. This setup is prepared for us by the OpenShift SDN when a pod is created.

We can add a few rules to the bridge to make sure that the traffic to the other cluster is channeled through the pod’s veth device and then we can create the Wireguard tunnel in the pod and route the traffic coming from the veth device to the tunnel.

Here is the design:

In our MVP implementation we are going to be less meticulous in terms of packet validation and add two simple routing rules to the bridge that can be paraphrased as follows:

  1. If you see a packet destined to the CIDR of the other cluster, route it to the daemonset pod.
  2. If you see a packet coming from the daemonset pod and whose source IP falls into the CIDR of the other cluster jump, skip all the verifications steps and make the routing decision for this packet.

In terms of routing rules inside the pod, we just need to add a rule that sends all the packets whose destination is the other cluster to the Wireduard tunnel.

Once the packet is in the Wireguard tunnel, it will follow the flow described in the previous sections and when it lands in the daemonset pod it will be routed to the bridge and then to its destination pod.

Installation

I created an Ansible automation to connect the SDNs of multiple clusters through Wireguard as described above. You can find the Ansible playbook here, together with the instructions on how to use it.

A word of warning:

Wireguard installs an uncertified kernel module. This taints the kernel. You should proceed with caution when making that decision (see this knowledge base article on tainted kernels).

Also the OpenVSwitch bridge and its internal routing rules are not part of the public OpenShift API, they are an implementation detail that can change from one release to another. This means that this design that works today may not work anymore in future releases of OpenShift (at the time of writing this article, this implementation was tested with OCP 3.9 and 3.10).

Finally, this solution is currently not officially supported by Red Hat: at this stage it is just my individual effort, so the quality as we said is at MVP level and support will be best effort.

Note: you need a real OpenShift SDN to make this tunnel work, so clusters such a Minishift that do not install the SDN components of OpenShift will not work.

Conclusions

As we saw in this article, we had to accept a couple of compromises to be able to create this solution. The fou module is not yet available in the RHEL kernel and the incumbent VPNs are not really suitable for building a VPN mesh. I think this indicates that the technology still needs to mature in this space and over time it will become much easier to implement these types of designs.

Also it’s worth repeating that we assumed that there was no direct connectivity between the nodes of the clusters involved in this design. If you don’t have that constraint, other potentially simpler designs become possible (for example ip over ip tunneling).

My hope is that with this article, we can start a conversation on how to build these types of topologies.

Finally, I’d like to thank all the colleguages that have spent time with me explaining networking and helping me fix any issues. A thank to: Vadim Zharov, Brent Roscos, Matthew Witzenman, Øystein Bedin, Julio Villarreal Pelegrino, Qi Jun Ding, Clark Hale.

Categories
OpenShift Ecosystem
Tags
, ,