Authors: Davis Phillips, Annette Clewett
Kubernetes Federation‘s objective is to provide a control plane to manage multiple Kubernetes clusters. Unfortunately, Federation is still considered an alpha project with no timeline for General Availability release. As a stop gap for Federation services a couple of different solutions are available for dispersing cluster endpoints: a cluster stretched across multiple datacenters or multiple clusters deployed across datacenters.
Kubernetes recommends that all VMs be isolated to a single datacenter: “when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or correlated failures) they are assuming all the machines are in a single data center, or are otherwise closely connected.” Therefore, stretching an OpenShift Cluster Platform across multiple data centers is not recommended. However if you need to have a disaster recovery plan today this article will detail a potential solution.
Some design considerations that are important to keep in mind concerning stretching a cluster across multiple datacenters:
- Network latency & bandwidth
- Providing a storage solution across multiple data centers
- Disaster recovery for etcd and storage GlusterFS
- Application placement
As bulleted above, storage will need to be replicated to any number of datacenters. While some storage providers have replication services to solve this, this is usually an asynchronous replication that does not provide read/write (RW) access at all locations. In this design, Red Hat OpenShift Container Storage (RHOCS) in converged mode was chosen for its robust replication capabilities and tight integration with OpenShift.
RHOCS was formerly called Container-Native Storage and utilizes Red Hat GlusterFS. GlusterFS is an open-source software-based network-attached filesystem that deploys on commodity hardware.
Additionally, the idea of a quorum is the toughest obstacle in the design of an OpenShift stretch cluster. etcd and RHOCS uses a 2n+1 scheme for cluster members and a cluster of n members can tolerate (n-1)/2 permanent failures. A 3 node cluster can only tolerate the loss of a single node. Ideally, this design consideration would be solved via a 3 datacenter design.
GlusterFS REQUIRES a latency connection of 5 milliseconds RTT or lower. Unfortunately, most customers do not have three datacenters with the required connectivity. This required latency, less than 5 milliseconds RTT, can be found using a metropolitan area network (MAN) interconnecting the cluster in a geographic area or region. A stretched OpenShift cluster with higher latency than 5 milliseconds RTT between nodes was found to fail health checks for persistent applications.
OpenShift Stretch Cluster Configuration
This test environment consisted of two vSphere clusters (6.x) deployed on a single VLAN across two blade enclosures to simulate multiple datacenters. The underlying storage utilized two iSCSI LUNs, one assigned to each vSphere cluster for persistent storage.
VMware vSphere was used as the platform for deployment, a commonly used on-premise cloud provider. Other IaaS platforms could be utilized provided the requirements from above are all met for compute, storage and network.
The RHEL virtual machines were created and OpenShift 3.9 was installed using relevant Ansible playbooks. The basic premise is all masters run etcd. Two masters are located in DC1 and one in DC2. Additionally, the OpenShift nodes for storage (GlusterFS PODs) are positioned in a similar fashion, two in DC1 and one in DC2. Layout shown in Figure 2.
In an attempt to simulate geographically dispersed DNS, two records were created to represent endpoints destined for DC1 or DC2. Each HAproxy node owns that virtual IP address.
KeepaliveD monitors for failover between either datacenter. The configuration resembles an active/active cluster configuration. Lastly, the HAproxy servers prefer to offer from their service pools in their home datacenter.
$ dig +noall +answer stretch-haproxy.example.com stretch-haproxy.example.com. 1800 IN A 10.x.y.20 stretch-haproxy.example.com. 1800 IN A 10.x.y.21
KeepaliveD VIP Data center 1 – 10.x.y.20
KeepaliveD VIP Data center 2 – 10.x.y.21
Figure 3 below highlights the HAproxy configuration used for the stretch cluster:
Disaster Recovery Scenarios
As you can see in Figure 2, the test environment was an OpenShift deployment stretched across Datacenter 1 (DC1) and Datacenter 2 (DC2). The network between DC1 and DC2 in this case was a MAN (Metropolitan Network) so there were no bandwidth restrictions and latency was approximately 5 milliseconds RTT.
DC1 had two of the three OpenShift Container Platform (OCP) nodes with GlusterFS PODs and DC2 had the other POD. The goal was to recover after an outage of DC1. The persistent storage needs to continue to work for read/write (RW) in a scenario where neither GlusterFS nor etcd have quorum.
After reviewing several different failure scenarios based on conversations with users, the goal was to account for recovery operations that include infrastructure, applications, and the OpenShift control plane. The following section describes some of these failure scenarios in more detail.
Datacenter 2 Failure: Quorum Remains
In the event of failure for DC2, DC1 remains online and fully operational. However, there is a single node failure in both GlusterFS and etcd. Any workloads running in DC2 will be rescheduled for DC1.
The etcd cluster remains RW so the OpenShift cluster is responsive. The MySQL POD will stay online (provided the POD is running in DC1) and able to RW data to the mounted GlusterFS volume. The MySQL POD will be rescheduled in DC1 eventually and will mount the same GlusterFS volume.
For the storage, there are three OpenShift nodes hosting GlusterFS PODs and one of the PODs is offline. Therefore all of the GlusterFS volumes are in a degraded state but still accepting RW. Once DC2 is returned to operational state, the third replica for every volume at DC2 will be healed with any changed data during the outage of DC2.
Datacenter 1 Failure: Quorum Failure
In the event of failure for DC1, DC2 remains responsive. However, there is now quorum failure for both GlusterFS and etcd. Both services must be recovered for any persistent PODs to remain operational for RW workloads. Also, any workloads lost in DC1 will not be rescheduled in DC2 until etcd is recovered. At this time PODs using ephemeral storage will become operational but any PODs using GlusterFS persistent storage will be in an Error state until the storage is recovered as well.
Datacenter 1 Failure: Recovering etcd
First, etcd is offline and the cluster API is no longer processing requests. It must be recovered first. The following commands will create a RW single node etcd cluster with the remaining etcd member:
# systemctl stop etcd.service
Add the following line to the etcd.conf file.
# echo "ETCD_FORCE_NEW_CLUSTER=true" >> /etc/etcd/etcd.conf
Restart the services.
# systemctl restart etcd.service
Remove the line that was added above.
# sed -i '/^ETCD_FORCE_NEW_CLUSTER=/d' /etc/etcd/etcd.conf
The cluster should now be operational and oc commands should properly respond.
# oc get nodes
Datacenter 1 Failure: Recovering GlusterFS
Secondly, GlusterFS volumes must be recovered for workloads with persistent storage. With quorum lost all volumes will be in a Read Only (RO) state. The following options must be added to all existing GlusterFS volumes for RW functionality to be restored:
Every volume in the GlusterFS cluster will need to be configured with these options to transition from RO to RW.
# oc get pod -n app-storage | grep glusterfs glusterfs-storage-cnvpj 1/1 Running 0 1d # oc rsh -n app-storage glusterfs-storage-cnvpj sh-4.2# gluster vol list glusterfs-registry-volume heketidbstorage vol_c1381534e3779fe2fad1e52b0f1ad148 ... sh-4.2# volume set vol_c1381534e3779fe2fad1e52b0f1ad148 cluster.quorum-type fixed sh-4.2# volume set vol_c1381534e3779fe2fad1e52b0f1ad148 cluster.quorum-count 1
Alternatively, all of the existing GlusterFS volumes can be recovered with the following loop:
sh-4.2# for vol in `gluster vol list`; gluster vol set $vol cluster.quorum-type fixed && gluster vol set $vol cluster.quorum-count 1;done
Now, all workloads originally in DC1 will become operational in DC2 with RW access to GlusterFS persistent storage. All workloads running in DC2 should also now be online and functional.
Datacenter 1 Failback: etcd and GlusterFS
After DC1 is fully operational again, etcd and GlusterFS must be restored to their original state before the failure of DC1.
Add back recovered etcd nodes with etcdctl.
# etcdctl -C https://192.168.1.2:2379 --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt member add master2.example.com https://192.168.1.2:2380 2 >/dev/null | grep '^ETCD_INITIAL_CLUSTER='
Add the name back to the etcd.conf file:
# sed -i '/^ETCD_NAME=/d' /etc/etcd/etcd.conf # echo "ETCD_NAME=master2" >> /etc/etcd/etcd.conf
The returned value from the above command should be added to the etcd.conf file for that node:
# sed -i '/^ETCD_INITIAL_CLUSTER=/d' /etc/etcd/etcd.conf # echo "ETCD_INITIAL_CLUSTER=$OUTPUT" >> /etc/etcd/etcd.conf
Lastly, tell the nodes that they are joining an existing cluster:
# sed -i '/^ETCD_INITIAL_CLUSTER_STATE=/d' /etc/etcd/etcd.conf # echo "ETCD_INITIAL_CLUSTER_STATE=existing" >> /etc/etcd/etcd.conf # systemctl restart etcd
Next, now that the two GlusterFS PODs are back online the GlusterFS volume settings need to be set back to default as shown below.
May require adding back glusterfs=storage=host or glusterfs=registry-host label to OCP nodes that were offline in DC2.
Next, now that the GlusterFS PODs are back online the GlusterFS volume settings need to be set back to default as shown below.
cluster.quorum-type=auto cluster.quorum-count=3 sh-4.2# for vol in `gluster vol list`; gluster vol set $vol cluster.quorum-type auto && gluster vol set $vol cluster.quorum-count 3;done
After DC1 is returned to an operational state, all GlusterFS volume changes will be copied to the DC1 volume replicas thereby healing the cluster.
Thoughts and Conclusions:
As outlined in this article, the Metro Stretch Cluster solution does provide disaster recovery. Given your organization’s requirements for Recovery Time Objective (RTO) and Recovery Point Objective (RPO) extensive testing and validation needs to be done before implementation.
Also, all use cases are not a good fit for this scenario. The manual recovery process involved with both etcd and RHOCS provides additional operational overhead at a time of crisis. Another concern for operations is application performance. If the cluster is heavily utilized with limited capacity and headroom, the increased workload in the surviving datacenter (after the failover and recovery) could cause unacceptable service levels and unhappy customers.
This post explains the pros and cons of stretching an OpenShift cluster across datacenters. Kubernetes Federation is around the corner. But, the design described above can serve as acceptable stop gap until its release.The information in this article provides some requirements, considerations and conclusions for implementing this design these items should be considered with your own application use cases and architectures for making deployment decisions.
Plan the cluster design appropriately. At the moment, there is no transition from a stretch cluster to multi-cluster with Federation.