Getting the Most Out of NVIDIA DGX Systems with Altair Grid Engine
Unlike machine learning where humans need to identify features relevant to a model, with deep learning, feature identification is built in. This means that models can be trained directly from raw data, including text, images, videos, and domain-specific datasets.
For big, gnarly problems with lots of data but where features are poorly understood, deep learning is where the action is. While it is more compute-intensive to develop and train deep learning models, they can substantially outperform other machine learning techniques. Not surprisingly, deep learning is being embraced in industries from oil & gas to cybersecurity to manufacturing.
GPUs are critical for deep learning
Deep learning has become practical because of tremendous advances in computing power and affordability enabled by modern GPUs. Deep learning environments are typically comprised of multiple servers, each with several NVIDIA GPUs connected via high-speed interconnects. The software stack typically includes Linux, NVIDIA drivers, Docker, NVIDIA Docker, and various management tools. Users often run multiple deep learning frameworks such as Tensorflow, Keras, PyTorch, Caffe, Theano, MXNet, and others. Whether on-premise or in the cloud, configuring these complex GPU environments can be challenging.
NVIDIA DGX systems simplify deep learning
Fortunately, NVIDIA offers purpose-built deep learning hardware platforms that make deep learning applications much easier to deploy and manage. NVIDIA® DGX™ Systems are designed specifically for Deep Learning applications. The DGX family is comprised of the NVIDIA DGX Station™ and NVIDIA DGX-1™ and DGX-2™ rackmount servers. NVIDIA DGX-2 servers provide up to 16 NVIDIA Volta™ V100 GPUs with an NVIDIA NVSwitch™ powered NVLink™ fabric offering up to 2.4 TB/s of bandwidth. DGX-2 servers also come pre-configured with Mellanox® EDR Infiniband offering 1,600 Gb/s bi-directional bandwidth between hosts.
Managing deep learning workloads
In enterprise environments, multiple data science teams often share a DGX cluster. Workloads range from ETL flows for generating training data to training jobs to ongoing model validation. Many of these workloads are long-running, taking hours or even days to complete. Jobs can involve different software frameworks, multiple GPUs spread across hosts, and can have different resource requirements and business priorities. Users are all but guaranteed to “trip over each other” in these shared environments without workload management causing conflict, confusion, and reduced productivity.
As an example, a single, distributed training job might request as follows:
"A distributed, containerized Tensorflow model needs two parameter servers and ten workers. Each parameter server needs a single CPU and 8GB of memory, and each worker requires a P100GPU with at least 48GB and 5GB of host memory. Workers must be scheduled to processor cores on each host such that CPU-GPU pairs share memory and a direct bus connection. Workers should be concentrated on as few hosts as possible, and if the workers need to distributed across hosts, hosts should reside in the same rack and switch to minimize network latency."
Now imagine dozens of jobs with similar constraints submitted by different groups. With many users and workloads, hardcoding hostnames and GPU-device names is a recipe for disaster. This is where GPU-aware workload management and Altair Grid Engine comes in.
Optimized management of deep learning workloads for DGX clusters
Altair software manages workloads across NVIDIA GPUs on some of the world’s largest AI supercomputers, including the ABCI supercomputer in Japan. Based on practical experience managing containerized deep learning environments at scale, Altair has captured best practices and made these capabilities easily available to DGX users.
Whether you are deploying a single NVIDIA DGX server, or a cluster comprised of multiple servers, Altair Grid Engine brings important capabilities to the DGX environment. Among these capabilities are:
- Support for control groups (cgroups), ensuring isolation between deep learning workloads and preventing conflicts at runtime.
- NVIDIA Data Center GPU Manager (DCGM) support – A built-in integration with NVIDIA’s DCGM enables Altair Grid Engine to monitor GPU health and utilization in real-time and place workloads optimally for maximum performance and reliability.
- Advanced Docker/NVIDIA Docker support – Grid Engine provides sophisticated support for Docker and NVIDIA Docker, transparently managing parallel containerized workloads from NGC and other repositories as if they were native jobs.
- Sophisticated topology-aware scheduling and CPU-GPU affinity – automatically bind deep learning jobs to CPU cores with affinity to corresponding GPUs considering bus and switch topologies, and NUMA memory characteristics for optimal efficiency.
- Job Classes – Users can’t be expected to understand the complexities associated with optimally placing GPU-aware workloads. With Job classes, users specify a class of job (e.g., PyTorch), and Altair Grid Engine transparently applies scheduling policies behind the scenes to optimize placement policies for each class of workload.
For NVIDIA DGX customers, Altair Grid Engine provides the following benefits:
- Simplify the submission and management of deep-learning workloads
- Boost performance by placing distributed jobs optimally
- Align workloads to business priorities
- Improve productivity and use resources more efficiently by reducing wait times and allowing more training jobs to run simultaneously without conflict
You can learn more about how AltairGrid Engine supports GPU workloads by reading Managing GPU workloads with Altair Grid Engine.