Simplifying deployments of accelerated AI workloads on Red Hat OpenShift with NVIDIA GPU Operator

In this blog we would like to demonstrate how to use the new NVIDIA GPU operator to deploy GPU-accelerated workloads on an OpenShift cluster.

The new GPU operator enables OpenShift to schedule workloads that require use of GPGPUs as easily as one would schedule CPU or memory for more traditional not accelerated workloads. Start by creating a container that has a GPU workload inside it and request the GPU resource when creating the pod and OpenShift will take care of the rest. This makes deployment of GPU workloads to OpenShift clusters straightforward for users and administrators as it is all managed at the cluster level and not on the host machines. The GPU operator for OpenShift will help to simplify and accelerate the compute-intensive ML/DL modeling tasks for data scientists, as well as  help running inferencing tasks across data centers, public clouds, and at the edge. Typical workloads that can benefit from GPU acceleration include image and speech recognition, visual search and several others.

We assume that you have an OpenShift 4.x cluster deployed with some worker nodes that have GPU devices.

$ oc get no

NAME                           STATUS ROLES AGE VERSION

ip-10-0-130-177.ec2.internal   Ready worker 33m v1.16.2

ip-10-0-132-41.ec2.internal    Ready master 42m v1.16.2

ip-10-0-156-85.ec2.internal    Ready worker 33m v1.16.2

ip-10-0-157-132.ec2.internal   Ready master 42m v1.16.2

ip-10-0-170-127.ec2.internal   Ready worker 4m15s v1.16.2

ip-10-0-174-93.ec2.internal    Ready master 42m v1.16.2

In order to expose what features and devices each node has to OpenShift we first need to deploy the Node Feature Discovery (NFD) Operator (see here for more detailed instructions).

Once the NFD Operator is deployed we can take a look at one of our nodes; here we see the difference between the node before and after. Among the new labels describing the node features, we see:

feature.node.kubernetes.io/pci-10de.present=true

This indicates that we have at least one PCIe device from the vendor ID 0x10de, which is for Nvidia. These labels created by the NFD operator are what the GPU Operator uses in order to determine where to deploy the driver containers for the GPU(s).

However, before we can deploy the GPU Operator we need to ensure that the appropriate RHEL entitlements have been created in the cluster (see here for more detailed instructions). After the RHEL entitlements have been deployed to the cluster, then we may proceed with installation of the GPU Operator.

The GPU Operator is currently installed via helm chart, so make sure that you have helm v3+ installed. Once you have helm installed we can begin the GPU Operator installation.

     1. Add the Nvidia helm repo:

 

$ helm repo add nvidia https://nvidia.github.io/gpu-operator 
"nvidia" has been added to your repositories

     2. Update the helm repo:


$ helm repo update Hang tight while we grab the latest from your chart repositories...
 ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈ Happy Helming!⎈

     3. Install the GPU Operator helm chart:

 

$ helm install --devel https://nvidia.github.io/gpu-operator/gpu-operator-1.0.0.tgz 
--set platform.openshift=true,operator.defaultRuntime=crio,nfd.enabled=false --wait --generate-name

     4. Monitor deployment of GPU Operator:

 

$ oc get pods -n gpu-operator-resources -w


This command will watch the gpu-operator-resources namespace as the operator rolls out on the cluster. Once the installation is completed you should
see something like this in the gpu-operator-resources namespace.

We can see that both the nvidia-driver-validation and the nvidia-device-plugin-validation pods have completed successfully and we have four daemonsets, each running the number of pods that match the node label feature.node.kubernetes.io/pci-10de.present=true. Now we can inspect our GPU node once again.

Here we can see the latest changes to our node which now include Capacity, Allocatable and Allocatable Resources for a new resource called nvidia.com/gpu. As we see above, since our GPU node only has one GPU we can see that reflected.

Now that we have the NFD Operator, cluster entitlements, and the GPU Operator deployed we can assign workloads that will use the GPU resources.

Let’s begin by configuring Cluster Autoscaling for our GPU devices. This will allow us to create workloads that request GPU resources and then will automatically scale our GPU nodes up and down depending on the amount of requests pending for these devices.

The first step is to create a ClusterAutoscaler resource definition, for example:

$ cat 0001-clusterautoscaler.yaml

apiVersion: "autoscaling.openshift.io/v1"

kind: "ClusterAutoscaler"

metadata:

  name: "default"

spec:

  podPriorityThreshold: -10

  resourceLimits:

    maxNodesTotal: 24

    gpus:

      - type: nvidia.com/gpu

        min: 0

        max: 16

  scaleDown:

    enabled: true

    delayAfterAdd: 10m

    delayAfterDelete: 5m

    delayAfterFailure: 30s

    unneededTime: 10m


$ oc create -f 0001-clusterautoscaler.yaml

clusterautoscaler.autoscaling.openshift.io/default created

 

Here we define the number of nvidia.com/gpu resources that we expect for the Autoscaler.

 

After we deploy the ClusterAutoscaler, we deploy the MachineAutoscaler resource that references the MachineSet that is used to scale the cluster:

$ cat 0002-machineautoscaler.yaml

apiVersion: "autoscaling.openshift.io/v1beta1"

kind: "MachineAutoscaler"

metadata:

  name: "gpu-worker-us-east-1a"

  namespace: "openshift-machine-api"

spec:

  minReplicas: 1

  maxReplicas: 6

  scaleTargetRef:

    apiVersion: machine.openshift.io/v1beta1

    kind: MachineSet

    name: gpu-worker-us-east-1a


$ oc create -f 0002-machineautoscaler.yaml

machineautoscaler.autoscaling.openshift.io/sj-022820-01-h4vrj-worker-us-east-1c created

 

The metadata name should be a unique MachineAutoscaler name, and the MachineSet name at the end of the file should be the value of an existing MachineSet.

 

Looking at our cluster, we check what MachineSets are available:

 

$ oc get machinesets -n openshift-machine-api

NAME                                   DESIRED CURRENT READY AVAILABLE AGE

sj-022820-01-h4vrj-worker-us-east-1a   1 1 1 1 4h45m

sj-022820-01-h4vrj-worker-us-east-1b   1 1 1 1 4h45m

sj-022820-01-h4vrj-worker-us-east-1c   1 1 1 1 4h45m

In this example the third MachineSet sj-022820-01-h4vrj-worker-us-east-1c is the one that has GPU nodes.

 

$ oc get machineset sj-022820-01-h4vrj-worker-us-east-1c -n openshift-machine-api -o yaml 

apiVersion: machine.openshift.io/v1beta1

kind: MachineSet

metadata:

  name: sj-022820-01-h4vrj-worker-us-east-1c

  namespace: openshift-machine-api

...

spec:

  replicas: 1

...

    spec:

      metadata:

          instanceType: p3.2xlarge

          kind: AWSMachineProviderConfig

          placement:

            availabilityZone: us-east-1c

            region: us-east-1

 

We can create our MachineAutoscaler resource definition, which would look like this:

 

$ cat 0002-machineautoscaler.yaml

apiVersion: "autoscaling.openshift.io/v1beta1"

kind: "MachineAutoscaler"

metadata:

  name: "sj-022820-01-h4vrj-worker-us-east-1c"

  namespace: "openshift-machine-api"

spec:

  minReplicas: 1

  maxReplicas: 6

  scaleTargetRef:

    apiVersion: machine.openshift.io/v1beta1

    kind: MachineSet

    name: sj-022820-01-h4vrj-worker-us-east-1c



$ oc create -f 0002-machineautoscaler.yaml

machineautoscaler.autoscaling.openshift.io/sj-022820-01-h4vrj-worker-us-east-1c created

We can now start to deploy RAPIDs using shared storage between multiple instances. Begin by creating a new project:

 

$ oc new-project rapids

 

Assuming you have a StorageClass that provides ReadWriteMany functionality like OpenShift Container Storage with cephfs, we can create a PVC to attach to our RAPIDs instances. (‘storageClassName` is the name of the StorageClass)

 

$ cat 0003-pvc-for-ceph.yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: rapids-cephfs-pvc

spec:

  accessModes:

  - ReadWriteMany

  resources:

    requests:

      storage: 25Gi

  storageClassName: example-storagecluster-cephfs



$ oc create -f 0003-pvc-for-ceph.yaml

persistentvolumeclaim/rapids-cephfs-pvc created



$ oc get pvc -n rapids

NAME                STATUS VOLUME                                 CAPACITY ACCESS MODES STORAGECLASS       AGE

rapids-cephfs-pvc   Bound pvc-a6ba1c38-6498-4b55-9565-d274fb8b003e   25Gi RWX example-storagecluster-cephfs   33s

 

Now that we have our shared storage deployed we can finally deploy the RAPIDs template and create the new application inside our rapids namespace:

 

$ oc create -f 0004-rapids_template.yaml

template.template.openshift.io/rapids created

$ oc new-app rapids

--> Deploying template "rapids/rapids" to project rapids


     RAPIDS

     ---------

     Template for RAPIDS


     A RAPIDS pod has been created.


     * With parameters:

        * Number of GPUs=1

        * Rapids instance number=1


--> Creating resources ...

    service "rapids" created

    route.route.openshift.io "rapids" created

    pod "rapids" created

--> Success

    Access your application via route 'rapids-rapids.apps.sj-022820-01.perf-testing.devcluster.openshift.com'

    Run 'oc status' to view your app.

 

In a browser we can now load the route that the template created above: rapids-rapids.apps.sj-022820-01.perf-testing.devcluster.openshift.com

Image shows example notebook running using GPUs on OpenShift

 

We can also see on our GPU node that RAPIDs is running on it and using the GPU resource:

 

$ oc describe gpu node 

 

Given we have more than one person that wants to run Jupyter playbooks, lets create a second RAPIDs instance with its own dedicated GPU.

 

$ oc new-app rapids -p INSTANCE=2

--> Deploying template "rapids/rapids" to project rapids

     RAPIDS

     ---------

     Template for RAPIDS

     A RAPIDS pod has been created.

     * With parameters:

        * Number of GPUs=1

        * Rapids instance number=2

--> Creating resources ...

    service "rapids2" created

    route.route.openshift.io "rapids2" created

    pod "rapids2" created

--> Success

    Access your application via route 'rapids2-rapids.apps.sj-022820-01.perf-testing.devcluster.openshift.com'

    Run 'oc status' to view your app.

 

But we just used our only GPU resource on our GPU node, so the new deployment of rapids (rapids2) is not schedulable due to insufficient GPU resources.

 

$ oc get pods -n rapids

NAME      READY STATUS    RESTARTS AGE

rapids    1/1 Running   0 30m

rapids2   0/1 Pending   0 2m44s

 

If we look at the event state of the rapids2 pod:

 

$ oc describe pod/rapids -n rapids

...

Events:

  Type     Reason         Age From             Message

  ----     ------         ---- ----             -------

  Warning  FailedScheduling  <unknown> default-scheduler   0/9 nodes are available: 9 Insufficient nvidia.com/gpu.

  Normal   TriggeredScaleUp  44s cluster-autoscaler  pod triggered scale-up: [{openshift-machine-api/sj-022820-01-h4vrj-worker-us-east-1c 1->2 (max: 6)}]

 

We just need to wait for the ClusterAutoscaler and MachineAutoscaler to do their job and scale up the MachineSet as we see above. Once the new node is created:

 

$ oc get no 

NAME                           STATUS ROLES AGE VERSION

(old nodes)

...

ip-10-0-167-0.ec2.internal     Ready worker 72s v1.16.2

 

The new RAPIDs instance will deploy to the new node once it becomes Ready with no user intervention.

To summarize, the new NVIDIA GPU operator simplifies the  use of GPU resources in OpenShift clusters. In this blog we’ve demonstrated the use-case for multi-user RAPIDs development using NVIDIA GPUs. Additionally we’ve used OpenShift Container Storage and the ClusterAutoscaler to automatically scale up our special resource nodes as they are being requested by applications.

As you observed, NVIDIA GPU Operator is already relatively easy to deploy using Helm and work is ongoing to support t deployments right from OperatorHub, simplifying this process even further. 

For more information on NVIDIA GPU Operator and OpenShift, please see the official Nvidia documentation.

1 – Helm 3 is in Tech Preview in OpenShift 4.3, and will GA in OpenShift 4.4

Categories
OpenShift Ecosystem, Operators
Tags
, , ,