A primer on using Altair Grid Engine with GPU-aware applications
In Part I of this two-part article, I provided a short primer on GPUs, explained how they are used in HPC and AI, and covered some of the challenges users encounter running GPU applications on HPC clusters.
In this article, I’ll cover some of the specific innovations in Altair Grid Engine that help make GPU applications easier to deploy and manage at scale.
GPU-aware scheduling facilities in Altair Grid Engine
Altair Grid Engine helps us to do a variety of things that are essential for managing GPU workloads in HPC and AI clusters.
- We can auto-select suitable free GPUs based on GPU-specific parameters and load conditions available at real-time
- Bind nearest neighbor cores and memory for efficient execution of host programs
- Allocate jobs based on nearest neighbor network interfaces, so that parallel jobs run efficiently
- Take bus topology into account when scheduling complex GPU workloads
- Start workloads with nvidia-docker so that applications with different CUDA library dependencies can share GPU hosts
- Start tasks such that only allocated GPU devices are accessible and visible to applications at runtime to promote isolation and avoid errors
We describe some of the unique capabilities in UGE that enable these capabilities below.
1. Load sensors in Altair Grid Engine
One of the simplest ways to monitor and manage GPU workloads in Altair Grid Engine is by using a load sensor. By default, Altair Grid Engine tracks a variety of load parameters automatically. To track additional parameters (GPU-related parameters in our case), we can add a GPU-specific load sensor. A load sensor is a script or binary that runs on a host and outputs an arbitrary resource name and a current resource value so that additional resources can be factored into scheduling/workload placement decisions.
To simplify support for GPU-enabled cluster hosts, Altair Grid Engine includes default load sensors for various types of GPUs. Load sensors are included for Nvidia GPUs (cuda_load_sensor.c) and Intel Phi GPUs (phi_sensor.c). Users can also build their own load sensors, and samples are distributed as binaries and source code to allow for recompilation as libraries or available GPU parameters change.
When in use, UGE GPU load sensors provide values for various GPU device parameters in real time. These values can be used for monitoring GPU status and reporting.
$ qstat -F queuename qtype resv/used/tot. load_avg arch states ------------------------------------------------------------------------------------------------------------------ all.q@hostname BIPC 0/0/10 0.00 lx-amd64 hl:cuda.verstr=270.41.06 hl:cuda.0.name=GeForce 8400 GS hl:cuda.0.totalMem=511.312M hl:cuda.0.freeMem=500.480M hl:cuda.0.usedMem=10.832M hl:cuda.0.eccEnabled=0 hl:cuda.0.temperature=44.000000 hl:cuda.1.name=GeForce 8400 GS hl:cuda.1.totalMem=511.312M hl:cuda.1.freeMem=406.066M hl:cuda.1.usedMem=20.274M hl:cuda.1.eccEnabled=0 hl:cuda.1.temperature=43.000000 hl:cuda.devices=2
Load sensor parameters can also be used when submitting jobs to specify various resource and runtime requirements for GPU jobs. As examples:
- Run a job on any host with a Tesla V100 GPU and 8GB of free memory, but if possible, select a node in host group A.
- Run a job on any hosts with either V100 or P100 GPUs where 4GB of available RAM is available on the host, and 10GB of memory on the GPU, but only select hosts where the GPU temperature is less than 60 degrees Celsius.
2. Resource Maps (RSMAPS)
When an application executing on a CPU core wants to communicate with a PCIe device such as a GPU it is preferable that the GPU be attached to the same PCIe bus. Otherwise traffic needs to cross CPU sockets resulting in high-latency and lower performance.
It is increasingly common to have multiple GPUs installed on a single host. For example, new AWS P3 instances provide up to 8 NVIDIA V100 GPUs per machine instance. To support these and similar NUMA architectures, UGE supports the notion of Resource Maps (RSMAPS) since UGE 8.1.3.
When using RSMAPs in Altair Grid Engine, each physical GPU on a host can be mapped to a set of cores that have good affinity to the device as shown below. In this example, we define a resource map complex called “gpu” with two IDs (gpu0 and gpu1), and each ID is mapped to a GPU. The second line provides RSMAP topology information to show what CPU cores have affinity to each GPU device.
To make our example more interesting, let’s assume that the first GPU is a NVIDIA Tesla P100 and that that the second GPU is a NVIDIA Tesla V100. Mixed GPU models in a host are a more likely scenario in an on-premise cluster than in the cloud. The symbolic names assigned to GPUs in the RSMAP do not matter, as long as they are unique. Symbolic names are useful because they can be used to select specific GPU models.
Using the qconf command, we configure an RSMAP with two GPUs (P100 and V100), and each ID is mapped to a GPU device. We express topology information in the RSMAP to indicate what CPU cores have affinity to each GPU device.
When expressing RSMAP topologies, “S” refers to sockets and “C” refers to cores, and the topology mask reflects the number of sockets, cores (and threads) available on a CPU. By convention cores expressed in upper-case (“C”) have good affinity while cores expressed in lower-case (“c”) have a poor affinity.
With RSMAPs, we have precise control over how GPU jobs can bind CPU cores to GPU workloads.
Suppose we want to run a job that needs a single CPU core associated with a GPU connected on a local PCIe bus. In this case, we can use the syntax below, and UGE will pick a host and assign a core and GPU (of any type) based on the RSMAP host-specific topology mask.
$ qsub -binding linear:1 -l gpu=1 cudajob.py
We can also submit a job that requests all free available cores with affinity to a GPU. This will prevent other jobs from being assigned to cores with affinity to the same GPU that could potentially conflict with our GPU workload.
$ qsub -l gpu=1 cudajob.py
For many GPU workloads (like parallel MPI-enabled Deep Learning workloads) we need to schedule GPUs as parallel workloads. This is easily accomplished using parallel environment (PE) features in UGE.
Consider an example where the allocation rule in our PE is 28 slots on each machine. We want to reserve four machines (each with 28 host slots) and four GPUs per host for a parallel job that requires 112 host slots and 16 physical GPUs.
In the example below, we create a reservation for one hour for 112 slots and four GPUs per host, and submit the parallel deep learning workload spanning four hosts and 16 GPUs by specifying the reservation ID to Altair Grid Engine:
$ qrsub -pe mpi 112 -l hgpu=4 -d 1:0:0 $ qsub -ar <id> -pe mpi 4 -par 4 -l hgpu=4 gpu_mpi_job.sh
3. Nvidia DGCM integration
As of Altair Grid Engine 8.6.0, configuring GPUs and scheduling GPU workloads became easier still because UGE is integrated directly with NVIDIA’s Data Center GPU Manager (DCGM) providing Altair Grid Engine with detailed information about GPU resources. This avoids the need for administrators to use UGE customized load sensors.
If DCGM is running on a cluster host, Altair Grid Engine can automatically retrieve load values and other metrics for installed GPUs and expose them through Altair Grid Engine so that GPU information is available for scheduling and reporting.
DCGM provides mechanisms for discovering GPU topology. Topology information is exposed automatically in Altair Grid Engine via the load value “affinity” as shown in the example below.
In this example, cores in the first socket have good affinity to gpu0 (a P100 GPU) while cores on the second socket have good affinity to gpu1 (a V100 GPU).
$ qconf -sc <host> .. host.cuda.0.affinity=SCTTCTTCTTCTTScttcttcttctt, host.cuda.0.gpu_temp=36, host.cuda.0.mem_free=16280.000000M, host.cuda.0.mem_total=16280.000000M, host.cuda.0.mem_used=0.000000M, host.cuda.0.name=Tesla P100-PCIE-16GB, host.cuda.0.power_usage=28.527000, host.cuda.0.verstr=390.46, host.cuda.1.affinity=ScttcttcttcttSCTTCTTCTTCTT, host.cuda.1.gpu_temp=40, host.cuda.1.mem_free=16160.000000M, host.cuda.1.mem_total=16160.000000M, host.cuda.1.mem_used=0.000000M, host.cuda.1.name=Tesla V100-PCIE-16GB, host.cuda.1.power_usage=27.298000, host.cuda.1.verstr=390.46, host.cuda.devices=2 ..
If we want to schedule a GPU enabled workload taking this affinity into account, we can request a single P100 GPU as shown below and require that a job is scheduled only on a host with available cores that have affinity to the required P100 GPU.
$ qsub -l gpu=1(P100)[affinity=true]” gpu_workload.py
4. Controlling access to GPU devices via Linux cgroups
As explained above, Resource Maps can be used to identify the RSMAP ids on a host (for example gpu0, gpu1, gpu2, gpu3, etc.) and associate each RSMAP id with a physical device. This allows Altair Grid Engine users to request one or more GPUs, and Altair Grid Engine worries about keeping track of what physical devices are allocated to each GPU-enabled job. Users can see these associations when they run the qstat command to see what jobs are associated with GPU devices on each host.
Normally, the configuration and assignment of devices have no effect on scheduling, but in newer Linux kernels, control groups (cgroups) can be used for fine-grained resource management. A new setting called cgroups_params can be set globally or at the host level in Altair Grid Engine (host level settings override global defaults) to provide granular control over resources and how jobs use them. By listing GPU devices in a cgroup_path undercgroups_params, Altair Grid Engine will limit access to GPUs using cgroups based on how resources are assigned in RSMAPs. This provides administrators with better control over how GPU-enabled jobs use resources, and prevents applications from accidentally accessing devices that have not been allocated to them.
5. Running nvidia-docker workloads with Altair Grid Engine
In part I of this article, we showed an example of how a TensorFlow deep learning workload could be run interactively using a container pulled from NGC (Nvidia’s GPU Cloud) and easily run a compute host with a GPU and CUDA runtime.
$ docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects\ nvcr.io/nvidia/tensorflow:18.03-py2
Provide seamless support for similar containerized workloads is critical in cluster environments because many GPU workloads run in containers.
When running with Altair Grid Engine, rather than running nvidia-dockerfrom the command line, it is a better practice to run docker with the alternate --runtime=nvidia syntax as shown above to register the nvidia container runtime with the Docker daemon. This allows Altair Grid Engine to recognize this as a docker workload and take advantage of the rich Docker integration features in Altair Grid Engine. Using nvidia-docker requires that users be able to pass environment variables (NVIDIA_VISIBLE_DEVICES for example) into a running container, so Altair Grid Engine needs to provide a mechanism for this as well.
Altair Grid Engine supports the NVIDIA 2.0 Docker Container Runtime (nvidia-docker) allowing transparent access to GPUs on Altair managed compute nodes. For a detailed discussion on running Docker workloads on Altair Grid Engine see our article Using Altair Grid Engine with Docker.
To submit a nvidia-docker workload to UGE simply use the -xd “--runtime=nvidia” switch on the qsub or qrsh command line. To pass environment variables that need to run inside the Docker container, add the additional directive -xd “--env NVIDIA_VISIBLE_DEVICES=0” switch to have Altair Grid Engine pass environment variables used by nvidia-docker into the container.
With these enhancements, users can easily run a variety of nvidia-docker enabled containers with different software and library versions across UGE cluster hosts to share GPU resources more effectively while taking advantage of the other rich, topology-aware GPU scheduling features described above.