IMPORTANT: the examples in this blog are only valid for the corresponding version of OpenShift.  If you have a newer version of OpenShift, such as 3.9, see this blog.

Running general-purpose compute workloads on Graphics Processing Units (GPUs) has become increasingly popular recently in a wide range of application domains, mirroring the increased ubiquity of deploying applications in Linux containers. Thanks to community participant Clarifai, Kubernetes became able to schedule workloads depending on GPUs beginning with version 1.3, enabling us to develop applications that are on the cutting edge of both trends with Kubernetes or OpenShift.

When folks talk about GPU-accelerated workloads (at least as of now), they are generally referring to NVIDIA-based GPUs, and applications developed leveraging the CUDA toolchain. These apps are typically stateful and run on dedicated resources, not the kind of stateless microservice greenfield apps we think of when we think of where Kubernetes shines today. But, the industry at-large is demonstrating a desire to expand the base of applications that can be run optimally in containers, orchestrated by Kubernetes.

In the past, Red Hat has experimented with technologies like Intel DPDK and Solarflare OpenOnload -- and it's immediately obvious that NVIDIA's progress in containerizing CUDA along with their hardware represents a microcosm of the technical challenges facing those other pieces of hardware, as well as Kubernetes in general, following known patterns for closed source applications wanting to integrate with the open source community.

For example -- distributions must be concerned with licensing, version management, QA procedures, kernel module and ABI/symbol conflicts that occur with any closed source driver and stack. These challenges precisely mirror those faced by many other hardware vendors, whether it's co-processors, FPGA, bypass accelerators or similar.

All of that said, the benefits of GPUs and other hardware accelerators over generic CPUs is often dramatic, leading to jobs completing potentially order(s) of magnitude faster. The demand for blending the benefits of hardware accelerators with a data-center-wide workload orchestration is reaching fever pitch. Typically, this line of thinking terminates in an important density, efficiency, and often a power-consumption exercise attributable to those efficiency gains.

I should note that due to the "alpha" state of GPU support in Kubernetes, the following run-through on how to connect OpenShift, running on RHEL, with an NVIDIA adapter inside an EC2 instance, is currently unsupported. Polishing some of the sharp corners is a community responsibility, and indeed there is plenty of work underway.

If you're interested in following upstream developments, I encourage you to monitor Kubernetes sig-node.

Environment

  • RHEL 7.3, RHEL7.3 container image
  • OpenShift 3.5.0.17OpenShift Master:  EC2 m4.xlarge instance
    • OpenShift Infra:  EC2 m4.xlarge instance
    • OpenShift Node:  EC2 g2.xlarge instance

 

 

 

Howto

As with any nascent/alpha technology, documentation is somewhat lacking and there are a lot of disparate moving pieces to line up.  Here is how we're able to get a basic smoke test of a GPU going on OpenShift 3.5:

Install the nvidia-docker RPM:

# rpm -ivh --nodeps https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0/nvidia-docker-1.0.0-1.x86_64.rpm

Set up a yum repo on the host that has the GPU card. This is because we will have to install proprietary NVIDIA drivers.

/etc/yum.repos.d/nvidia.repo:

[NVIDIA]
name=NVIDIA
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/
failovermethod=priority
enabled=1

Install the driver and the devel headers on the host. Note this step takes about 4 minutes to complete (rebuilding kernel module)

# yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel

On the node with the GPU, ensure the new modules are loaded. On RHEL, the nouveau module will load by default. This prevents the nvidia-docker service from starting. The nvidia-docker service blacklists the nouveau module, but does not unload it. So you can either reboot the node, or remove the nouveau module manually:

 

# modprobe -r nouveau
# nvidia-modprobe
# systemctl restart nvidia-docker

On the node that has the GPU, update /etc/origin/node/node-config.yaml to set a single NVIDIA GPU in node capacity/allocatable. Note the kubelet flag is named experimental, and that this was a manual change. Also note that at the time of this writing, Kubernetes only supports a single GPU per node. These rough edges are where the “alpha” nature of GPU support in Kubernetes becomes apparent.

We hope to arrive, along with the community, at a unified hardware-discovery feature (perhaps a pod/agent), that feeds the scheduler with all of the hardware information to make intelligent workload-routing decisions possible going forward.

In /etc/origin/node/node-config.yaml 

kubeletArguments:
experimental-nvidia-gpus:
- '1'

Then restart the openshift-node service so this setting takes effect.

# systemctl restart atomic-openshift-node

Here is what the updated node capacity looks like. You can see that there's a new capacity field, and this can now be used by the Kubernetes scheduler to route pods accordingly.

 

# oc describe node ip-x-x-x-x.us-west-2.compute.internal
<snip>
Capacity:
alpha.kubernetes.io/nvidia-gpu: 1
cpu: 8
memory: 14710444Ki
pods: 250
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 1
cpu: 8
memory: 14710444Ki
pods: 250

And here is an example pod file that requests the GPU device. The default command is "sleep infinity" so that we can connect to the pod after it is created (using the "oc rsh" command) to do some manual inspection.

# cat openshift-gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: openshift-gpu-test
spec:
  containers:
  - command:
  - sleep
  - infinity
  name: openshift-gpu
  image: rhel7
  resources:
    limits:
      alpha.kubernetes.io/nvidia-gpu: 1

 

Create a pod using the above definition:

# oc create -f openshift-gpu-test.yaml

Connect to the pod:

# oc rsh openshift-gpu-test

Inside the pod, install EPEL, RHEL and NVIDIA repos. Then install CUDA (note, here we could have used the nvidia/cuda:centos7 container image). This is again a place where the experience could be smoothed out to provide an all-in-one container that includes GPU/ML toolchains that developers can consume.

# yum install cuda -y

The cuda packages include some test utilities we can use to verify that the GPU can be accessed from inside the pod:

sh-4.2# /usr/local/cuda-8.0/extras/demo_suite/deviceQuery
/usr/local/cuda-8.0/extras/demo_suite/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4036 MBytes (4232052736 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

sh-4.2# /usr/local/cuda-8.0/extras/demo_suite/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

Device 0: GRID K520
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 8003.2

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5496.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 119111.3

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Some more snooping around to make sure the cgroups are set up correctly...on the host running the pod, get the container ID:

# docker ps | grep rhel7
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
deb709448449 rhel7 "sleep infinity" 44 minutes ago Up 44 minutes k8s_nvidia-gpu.134f09f4_openshift-gpu-test_default_cd8446e9-ed69-11e6-86f5-02fdcf6b20ab_da81cc2c
4d2608e1808c registry.fqdn/openshift3/ose-pod:v3.5.0.17 "/pod" 44 minutes ago Up 44 minutes k8s_POD.17e9e6be_nvidia-gpu-test_default_cd8446e9-ed69-11e6-86f5-02fdcf6b20ab_0fc36347

Check out the major/minor device numbers for the NVIDIA hardware. Note that these devices are created by the proprietary NVIDIA drivers installed earlier on the host system:

# ls -al /dev/nvidia*
crw-rw-rw-. 1 root root 195, 0 Feb 7 13:54 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Feb 7 13:54 /dev/nvidiactl
crw-rw-rw-. 1 root root 247, 0 Feb 7 13:47 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 247, 1 Feb 7 13:47 /dev/nvidia-uvm-tools

# egrep '247|195' /sys/fs/cgroup/devices/system.slice/docker-deb709448449bf1ef1366c08addc2e0d68188225d9973f4eb87f2e4658f85571.scope/devices.list
c 195:0 rwm
c 195:255 rwm
c 247:0 rwm

Summary

While GPU technology is still in alpha state both in Kubernetes and OpenShift (unsupported), and there are some rough edges, it does work well, and is making progress towards full support in the future.

Some of the important gaps that the community needs to resolve include:

  • Proper handling of proprietary drivers (some DKMS or privileged-init-container-like technology to build/rebuild/securely handle modules).
  • Manual configuration of the kubelet, necessitated by the lack of a hardware-fleecing facility (device discovery).
  • Maximum of 1 GPU pod per node allowed, we should eventually be able to provide secure, multi-tenant access to multiple GPUs.
  • For those interested in top-performance and the best possible efficiencies, Kubernetes should be able to understand physical NUMA topology of a system, and affine workload processes accordingly.

About the author

A 20+ year tech industry veteran, Jeremy is a Distinguished Engineer within the Red Hat OpenShift AI product group, building Red Hat's AI/ML and open source strategy. His role involves working with engineering and product leaders across the company to devise a strategy that will deliver a sustainable open source, enterprise software business around artificial intelligence and machine learning.

Read full bio