How to use GPUs in OpenShift 3.6 (Still Alpha)

This post updates the previous version based on OpenShift 3.5 with relevant changes for OpenShift 3.6. GPU support in Kubernetes remains in alpha through the next several releases. The Resource Management Working Group is driving progress towards stabilizing these interfaces.

Environment Overview

  • Red Hat Enterprise Linux 7.4, RHEL7.4 container image
  • OpenShift 3.6.0.173.0.5 (GA) Cluster running on AWS
  • Master node:  m4.xlarge
  • Infra node:  m4.xlarge
  • Compute node 1:  m4.xlarge
  • Compute node 2:  g2.2xlarge (GPU)

NVIDIA Driver Installation

Currently, the NVIDIA driver packaging requires DKMS. DKMS is available in EPEL.

Install EPEL:

# yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

Install NVIDIA cuda repo:

# yum install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-8.0.61-1.x86_64.rpm

Install cuda and NVIDIA drivers:

# yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel

Install nvidia-docker. The nvidia-docker service provides an endpoint for communication between GPU hardware and the kubelet.

# yum install -y https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm

Enable and restart the nvidia-docker service.

# systemctl enable nvidia-docker
# systemctl start nvidia-docker

Run nvidia-smi to verify the previous steps (nvidia-smi command will fail if the driver is not loaded correctly). You should see output like this:

# nvidia-smi
Tue Aug 22 12:45:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 00000000:00:03.0 Off |                  N/A |
| N/A   31C    P8    17W / 125W |     11MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note the memory usage and power draw of the adapter at idle. In this case, 17 watts and 11MiB RAM (GPU memory) is in use.

Node Configuration

Alpha GPU support in OpenShift 3.5 leveraged Opaque Integer Resources to steer pods to nodes with the required hardware. It also leveraged the experimental-nvidia-gpus kubelet flag to populate the node with all available GPUs.

The experimental-nvidia-gpus kubelet flag was removed in version 1.6 of Kubernetes and replaced with an Accelerators feature gate. Thus in OpenShift 3.6, we now enable the Accelerators feature-gate in the kubelet configuration. In addition, we make use of node labels to handle workload routing.

Let’s start with the scheduler piece by manually labeling a node so that it has GPU support:

# oc label node ip-172-31-4-10.us-west-2.compute.internal alpha.kubernetes.io/nvidia-gpu-name='GRID_K520' --overwrite

Verify that the node was correctly labeled, and that the kubelet was able to see the GPU:

# oc describe node ip-172-31-4-10.us-west-2.compute.internal | egrep -B1 'Name:|gpu:'
Name:                   ip-172-31-4-10.us-west-2.compute.internal
--
Capacity:
alpha.kubernetes.io/nvidia-gpu:        1
--
Allocatable:
alpha.kubernetes.io/nvidia-gpu:        1

It is important to note that the allocatable field will not be decremented when a GPU is assigned to a pod. Allocatable count represents what the kubelet currently thinks exists in a node. It does NOT represent the total of “unassigned GPUs” on any node. One attempt by Red Hat engineers at addressing cluster-level capacity and utilization is called the cluster-capacity tool.

Enable the hardware accelerators feature-gate in /etc/origin/node/node-config.yaml

kubeletArguments:

feature-gates:
- Accelerators=true

Restart the atomic-openshift-node service to enable the configuration change.

# systemctl restart atomic-openshift-node

You should now see the following line in the journal for atomic-openshift-node:

# journalctl -u atomic-openshift-node --since="5 minutes ago" | grep feature< Aug 22 14:57:31 ip-172-31-4-10.us-west-2.compute.internal atomic-openshift-node[27395]: I0822 14:57:31.486944   27395 feature_gate.go:144] feature gates: map[Accelerators:true] ~~~ ## Pod Spec Now let’s create a pod that consumes a GPU. Here is where the scheduler affinity (routing based on nodeSelector) capability is leveraged: ~~~ # cat gpu-pod.yaml apiVersion: v1 kind: Pod metadata:  generateName: gpu-pod-  annotations:    scheduler.alpha.kubernetes.io/affinity: &gt;
{
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "alpha.kubernetes.io/nvidia-gpu-name",
"operator": "In",
# This value has to match what you labeled the node with:
# I used alpha.kubernetes.io/nvidia-gpu-name='GRID_K520'
# It is just a string match.  Nothing intelligent yet.
"values": ["GRID_K520"]
}
]
}
]
}
}
}
spec:
containers:
- name: gpu-container-1
# We don’t need privileges, so commented out now.
#      securityContext:
#      privileged: true
image: rhel7
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
# We don’t need volume mounts because for now we will embed all necessary libraries in the container image. See next section.
#      volumeMounts:
#      - mountPath: /usr/bin
#        name: bin
#      - mountPath: /usr/lib64/nvidia
#        name: lib
command:
- python
- "/root/mnist.py"

Wait, why are you embedding libraries and drivers in the container?
Users of NVIDIA GPUs know that there is a very tight coupling between the kernel drivers/headers, and userspace CUDA libraries that leverage the hardware. The installation of required packages is a root-level operation. Those packages would be leveraged by every pod that uses CUDA. Thus having container images with drivers/packages included decouples the image from the kernel module version running, and will require larger images available on every node with a GPU.

So, with those downsides, why do it? In short, as is documented in Kubernetes (and commented out in the above pod spec), you could use host mounts to bind mount in shared libraries into each pod at runtime. The downside of this approach is that host mounts require the pod to run as privileged. Embedding necessary libraries in the container image means that that the pod does not require elevated privileges to use a GPU.

Embedding libraries requires you to have an operational model that triggers image rebuilds/redeploys whenever a driver version update is done on the host itself…otherwise versions could drift apart and become incompatible.

Privileged pods are fully supported. The trade off of elevated privileges versus container image size and image hygiene is a trade-off that each site needs to evaluate.

Dockerfile

For the short term, Red Hat Performance Engineering has chosen to develop a utility that generates a set of Dockerfiles that represent a wide variety of machine learning frameworks (such as Theano, Lasagne, Tensorflow,and many more). While this utility isn’t open source as of this writing, we are preparing it for posting externally shortly.

Here is the Dockerfile generated by the utility:

FROM registry.access.redhat.com/rhel7.4

RUN yum install -y cmake curl gcc gcc-c++ git make patch pciutils unzip

RUN yum install -y cuda; export CUDA_HOME="/usr/local/cuda" CUDA_PATH="${CUDA_HOME}" PATH="${CUDA_HOME}/bin${PATH:+:${PATH}}" LD_LIBRARY_PATH="${CUDA_HOME}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"; echo -e 'export CUDA_HOME=/usr/local/cuda \nexport CUDA_PATH=${CUDA_HOME} \nexport PATH=${CUDA_HOME}/bin:${PATH} \nexport LD_LIBRARY_PATH=${CUDA_HOME}/lib64:/usr/local/lib:$LD_LIBRARY_PATH \n' &gt;&gt; ~/.bashrc; cd /tmp &amp;&amp; curl -fsSLO "http://developer.download.nvidia.com/compute/redist/cudnn/v6.0/cudnn-8.0-linux-x64-v6.0.tgz"; tar -C /usr/local -zxf /tmp/cudnn-8.0-linux-x64-v6.0.tgz
RUN yum install -y python2*pip python-devel; pip install --upgrade pip
RUN pip install "https://github.com/Lasagne/Lasagne/archive/master.zip"
RUN yum install -y cmake3; cd /tmp &amp;&amp; git clone "https://github.com/Theano/libgpuarray.git"; mkdir -p /tmp/libgpuarray/Build &amp;&amp; cd /tmp/libgpuarray/Build &amp;&amp; cmake3 .. -DCMAKE_BUILD_TYPE=Release &amp;&amp; make &amp;&amp; make install; pip install Cython; cd /tmp/libgpuarray &amp;&amp; python setup.py build_ext -L /usr/local/lib64 -I /usr/local/include &amp;&amp; python setup.py install &amp;&amp; ldconfig; echo -e '[global] \ndevice = cuda0 \nfloatX = float32 \n[nvcc] \nfastmath=True \n[cuda] \nroot=/usr/local/cuda \n' &gt;&gt; ~/.theanorc ; pip install Theano
RUN curl https://raw.githubusercontent.com/Lasagne/Lasagne/master/examples/mnist.py -o /root/mnist.py

Note the last step in the Dockerfile pulls down a small test utility that we will use to prove the GPU is correctly presented in the pod, and that it can accelerate a workload.

Running My GPU Pod

For the morbidly curious, here is some spelunking through the various portions of the system. Look at major/minor numbers of NVIDIA device files created by the NVIDIA driver:

# ls -al /dev/nvidia*
crw-rw-rw-. 1 root root 195,   0 Aug 21 16:22 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Aug 21 16:22 /dev/nvidiactl
crw-rw-rw-. 1 root root 245,   0 Aug 21 16:22 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 245,   1 Aug 21 16:22 /dev/nvidia-uvm-tools

Get the container ID from the gpu-pod:

# docker ps|grep gpu
bbe9121ba506        registry.access.redhat.com/rhel7@sha256:35e639660198b9eb6d207ad4cb23547f4ab96af4c48c81a143335a01ad4f063f   "sleep infinity"    4 hours ago         Up 4 hours                              k8s_gpu-container-1_gpu-pod-mtg81_default_d18d46d0-8682-11e7-b43e-026879ce7ee8_1

eedd3dac5344        registry.access.redhat.com/openshift3/ose-pod:v3.6.140                                                     "/usr/bin/pod"      4 hours ago         Up 4 hours                              k8s_POD_gpu-pod-mtg81_default_d18d46d0-8682-11e7-b43e-026879ce7ee8_1

Note that the kubelet has correctly whitelisted the necessary device files to bind the GPU to this pod.

# find /sys/fs/cgroup -type f -name devices.list|grep bbe9121ba506|xargs grep 195
c 195:0 rwm
c 195:255 rwm

I’ve executed the mnist.py workload on a pod with a GPU, and one without. The workload runs a training simulation in 500 batches. Below is a graph of the average time each batch ran for in the pod with a GPU and without a GPU (meaning it ran on the host CPU):

You can see a roughly 15x reduction in time to complete the model training.

While the benchmark was running, I ran the nvidia-smi tool again. Note the increase in power draw (was 17W, now 43W) and memory usage (was 11MiB, now 85MiB). Additionally, you can see the python interpreter process is listed along with its memory usage.

# nvidia-smi
Tue Aug 22 12:45:22 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 00000000:00:03.0 Off |                  N/A |
| N/A   32C    P0    43W / 125W |     85MiB /  4036MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14906    C   python                                          74MiB |
+-----------------------------------------------------------------------------+

One interesting little nit that I stumbled upon was that if I ran nvidia-smi inside the same container (via oc rsh…), it did not return any running process. In this case it did not say “No running processes found”. That section of the table was blank. However when I ran nvidia-smi on the host where the pod and mnist was currently running, it was indeed able to see the python process. Weird.

Categories
OpenShift Container Platform, OpenShift Ecosystem, OpenShift Origin
Tags
, , ,