Running general-purpose compute workloads on Graphics Processing Units (GPUs) has become increasingly popular recently in a wide range of application domains, mirroring the increased ubiquity of deploying applications in Linux containers. Thanks to community participant Clarifai, Kubernetes became able to schedule workloads depending on GPUs beginning with version 1.3, enabling us to develop applications that are on the cutting edge of both trends with Kubernetes or OpenShift.
When folks talk about GPU-accelerated workloads (at least as of now), they are generally referring to NVIDIA-based GPUs, and applications developed leveraging the CUDA toolchain. These apps are typically stateful and run on dedicated resources, not the kind of stateless microservice greenfield apps we think of when we think of where Kubernetes shines today. But, the industry at-large is demonstrating a desire to expand the base of applications that can be run optimally in containers, orchestrated by Kubernetes.
In the past, Red Hat has experimented with technologies like Intel DPDK and Solarflare OpenOnload — and it’s immediately obvious that NVIDIA’s progress in containerizing CUDA along with their hardware represents a microcosm of the technical challenges facing those other pieces of hardware, as well as Kubernetes in general, following known patterns for closed source applications wanting to integrate with the open source community.
For example — distributions must be concerned with licensing, version management, QA procedures, kernel module and ABI/symbol conflicts that occur with any closed source driver and stack. These challenges precisely mirror those faced by many other hardware vendors, whether it’s co-processors, FPGA, bypass accelerators or similar.
All of that said, the benefits of GPUs and other hardware accelerators over generic CPUs is often dramatic, leading to jobs completing potentially order(s) of magnitude faster. The demand for blending the benefits of hardware accelerators with a data-center-wide workload orchestration is reaching fever pitch. Typically, this line of thinking terminates in an important density, efficiency, and often a power-consumption exercise attributable to those efficiency gains.
I should note that due to the “alpha” state of GPU support in Kubernetes, the following run-through on how to connect OpenShift, running on RHEL, with an NVIDIA adapter inside an EC2 instance, is currently unsupported. Polishing some of the sharp corners is a community responsibility, and indeed there is plenty of work underway.
If you’re interested in following upstream developments, I encourage you to monitor Kubernetes sig-node.
- RHEL 7.3, RHEL7.3 container image
- OpenShift 220.127.116.11OpenShift Master: EC2 m4.xlarge instance
- OpenShift Infra: EC2 m4.xlarge instance
- OpenShift Node: EC2 g2.xlarge instance
As with any nascent/alpha technology, documentation is somewhat lacking and there are a lot of disparate moving pieces to line up. Here is how we’re able to get a basic smoke test of a GPU going on OpenShift 3.5:
Install the nvidia-docker RPM:
# rpm -ivh --nodeps https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0/nvidia-docker-1.0.0-1.x86_64.rpm
Set up a yum repo on the host that has the GPU card. This is because we will have to install proprietary NVIDIA drivers.
/etc/yum.repos.d/nvidia.repo: [NVIDIA] name=NVIDIA baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/ failovermethod=priority enabled=1
Install the driver and the devel headers on the host. Note this step takes about 4 minutes to complete (rebuilding kernel module)
# yum -y install xorg-x11-drv-nvidia xorg-x11-drv-nvidia-devel
On the node with the GPU, ensure the new modules are loaded. On RHEL, the nouveau module will load by default. This prevents the nvidia-docker service from starting. The nvidia-docker service blacklists the nouveau module, but does not unload it. So you can either reboot the node, or remove the nouveau module manually:
# modprobe -r nouveau # nvidia-modprobe # systemctl restart nvidia-docker
On the node that has the GPU, update /etc/origin/node/node-config.yaml to set a single NVIDIA GPU in node capacity/allocatable. Note the kubelet flag is named experimental, and that this was a manual change. Also note that at the time of this writing, Kubernetes only supports a single GPU per node. These rough edges are where the “alpha” nature of GPU support in Kubernetes becomes apparent.
We hope to arrive, along with the community, at a unified hardware-discovery feature (perhaps a pod/agent), that feeds the scheduler with all of the hardware information to make intelligent workload-routing decisions possible going forward.
In /etc/origin/node/node-config.yaml kubeletArguments: experimental-nvidia-gpus: - '1'
Then restart the openshift-node service so this setting takes effect.
# systemctl restart atomic-openshift-node
Here is what the updated node capacity looks like. You can see that there’s a new capacity field, and this can now be used by the Kubernetes scheduler to route pods accordingly.
# oc describe node ip-x-x-x-x.us-west-2.compute.internal <snip> Capacity: alpha.kubernetes.io/nvidia-gpu: 1 cpu: 8 memory: 14710444Ki pods: 250 Allocatable: alpha.kubernetes.io/nvidia-gpu: 1 cpu: 8 memory: 14710444Ki pods: 250
And here is an example pod file that requests the GPU device. The default command is “sleep infinity” so that we can connect to the pod after it is created (using the “oc rsh” command) to do some manual inspection.
# cat openshift-gpu-test.yaml apiVersion: v1 kind: Pod metadata: name: openshift-gpu-test spec: containers: - command: - sleep - infinity name: openshift-gpu image: rhel7 resources: limits: alpha.kubernetes.io/nvidia-gpu: 1
Create a pod using the above definition:
# oc create -f openshift-gpu-test.yaml
Connect to the pod:
# oc rsh openshift-gpu-test
Inside the pod, install EPEL, RHEL and NVIDIA repos. Then install CUDA (note, here we could have used the nvidia/cuda:centos7 container image). This is again a place where the experience could be smoothed out to provide an all-in-one container that includes GPU/ML toolchains that developers can consume.
# yum install cuda -y
The cuda packages include some test utilities we can use to verify that the GPU can be accessed from inside the pod:
sh-4.2# /usr/local/cuda-8.0/extras/demo_suite/deviceQuery /usr/local/cuda-8.0/extras/demo_suite/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GRID K520" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 3.0 Total amount of global memory: 4036 MBytes (4232052736 bytes) ( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores GPU Max Clock rate: 797 MHz (0.80 GHz) Memory Clock rate: 2500 Mhz Memory Bus Width: 256-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GRID K520 Result = PASS sh-4.2# /usr/local/cuda-8.0/extras/demo_suite/bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: GRID K520 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 8003.2 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 5496.3 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 119111.3 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Some more snooping around to make sure the cgroups are set up correctly…on the host running the pod, get the container ID:
# docker ps | grep rhel7 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES deb709448449 rhel7 "sleep infinity" 44 minutes ago Up 44 minutes k8s_nvidia-gpu.134f09f4_openshift-gpu-test_default_cd8446e9-ed69-11e6-86f5-02fdcf6b20ab_da81cc2c 4d2608e1808c registry.fqdn/openshift3/ose-pod:v18.104.22.168 "/pod" 44 minutes ago Up 44 minutes k8s_POD.17e9e6be_nvidia-gpu-test_default_cd8446e9-ed69-11e6-86f5-02fdcf6b20ab_0fc36347
Check out the major/minor device numbers for the NVIDIA hardware. Note that these devices are created by the proprietary NVIDIA drivers installed earlier on the host system:
# ls -al /dev/nvidia* crw-rw-rw-. 1 root root 195, 0 Feb 7 13:54 /dev/nvidia0 crw-rw-rw-. 1 root root 195, 255 Feb 7 13:54 /dev/nvidiactl crw-rw-rw-. 1 root root 247, 0 Feb 7 13:47 /dev/nvidia-uvm crw-rw-rw-. 1 root root 247, 1 Feb 7 13:47 /dev/nvidia-uvm-tools # egrep '247|195' /sys/fs/cgroup/devices/system.slice/docker-deb709448449bf1ef1366c08addc2e0d68188225d9973f4eb87f2e4658f85571.scope/devices.list c 195:0 rwm c 195:255 rwm c 247:0 rwm
While GPU technology is still in alpha state both in Kubernetes and OpenShift (unsupported), and there are some rough edges, it does work well, and is making progress towards full support in the future.
Some of the important gaps that the community needs to resolve include:
- Proper handling of proprietary drivers (some DKMS or privileged-init-container-like technology to build/rebuild/securely handle modules).
- Manual configuration of the kubelet, necessitated by the lack of a hardware-fleecing facility (device discovery).
- Maximum of 1 GPU pod per node allowed, we should eventually be able to provide secure, multi-tenant access to multiple GPUs.
- For those interested in top-performance and the best possible efficiencies, Kubernetes should be able to understand physical NUMA topology of a system, and affine workload processes accordingly.