OpenShift’s Automated Node Recovery Architecture - Archived

One of the great things about OpenShift is that users have the flexibility to run applications the way they want. OpenShift provides ways for users to run applications in a completely stateless way, or if they prefer, use persistent storage for stateful applications.

If you run your application on OpenShift Online (public cloud) we take care of all this for you. If you’re an administrator of OpenShift Enterprise (private cloud), what we share here is useful as you consider how you want to manage your environment.

Stateless Apps and Cattle

We agree that applications keeping state should be avoided, and that running nodes as ‘cattle’ is a very desirable way to run a datacenter. However, we also understand that not all applications are engineered to work in such an environment today. If the OpenShift application is stateless, then all nodes are treated as “cattle”–allowed to come and go without issue. We have scripts, executable from cron, which clean up these types of nodes when needed.

Stateful App Best Practices

One of our long held best practices is that if you allow user gears to have state, that state should be kept on a separate shared storage volume (like a SAN, iSCSI, EBS, etc). There are various reasons for this, but the biggest reason is to make node recovery quick and easy, even in the most extreme cases.

Accidents Happen, Even Here

A few weeks ago, one of my teammates accidentally deleted a few nodes in our staging environment. Since we follow this best practice, we were able to easily recover them. We simply created new nodes, pointed the node DNS at them, mounted the gear data volumes and ran oo-admin-regenerate-gear-metadata.

The purpose of oo-admin-regenerate-gear-metadata is to go through each gear that’s on the gear data volume and ensure it has entries in the following:
* /etc/passwd
* /etc/shadow
* /etc/group
* /etc/cgrules.conf
* /etc/cgconfig.conf
* /etc/security/limits.d

oo-admin-regenerate-gear-metadata is smart and won’t make any changes unless there are missing entries. It’s pretty easy to add this to a startup script in your node images so recovery is even easier.

Check it out yourself

You can find oo-admin-regenerate-gear-metadata in OpenShift Origin, under the origin-server repository on github:

Integrating these scripts into a config management system like Puppet or Ansible makes it even easier for systems to auto recovery from unexpected node issues. oo-admin-regenerate-gear-metadata is another great tool OpenShift provides operations teams (like ours) the ability to focus on our jobs and not interruptions like hardware failure or even the occasional fat-finger.

OpenShift Container Platform, OpenShift Online, OpenShift Origin, Thought Leadership
Comments are closed.