Ten Cool Things You May Not Have Known About Altair Grid Engine
Used by thousands of organizations around the world, Altair Grid Engine is one of the world’s most popular and capable grid management systems with a vibrant, growing user community. When application providers describe their support for Altair Grid Engine, they often lump the different Altair Grid Engine variants into a single bucket. The basic features and interfaces are largely the same among Altair Grid Engine variants, helping assure compatibility.
I’m frequently asked about the differences between Altair Grid Engine and various open-source distributions. Since Univa [Altair] acquired the rights to Grid Engine from Oracle in 2013, the advances in Altair Grid Engine have been dramatic. Altair Grid Engine continues to be enhanced by essentially the same development team that worked on Univa Grid Engine at Sun Microsystems, and [Altair] has been making steady improvements for almost as long as Grid Engine’s tenure at Sun.
It’s hard to keep such a list short, but here are ten interesting things you may not know about Altair Grid Engine.
- Advanced container support – Containers have taken the IT world by storm. In Altair Grid Engine we’ve made container integration seamless – not just for Docker, but for Singularity as well. Altair Grid Engine users can manage and control containerized applications (including containerized MPI parallel jobs) just like any other application with full support for reporting and accounting. With integrated container support, Altair Grid Engine users avoid the complexities and security challenges previously associated with running containerized applications on Altair Grid Engine.
- Scalability and throughput – HPC & AI workloads continue to get larger and more complex. At Altair, we’ve been laser-focused on performance. In a benchmark study, we demonstrated up to a 9x performance improvement over open-source Grid Engine. These scalability improvements are more than just an academic result. Western Digital recently announced a million-core Altair Grid Engine deployment on AWS where they ran a workload with 2.5 million simulation tasks in 8 hours. This would have taken 480 hours (20 days) to complete on their on-premises cluster. The lesson? Don’t even think about running workloads at this scale on an open-source scheduler. In the cloud, where time is money, scheduling performance is critical.
- GPU support – Along with containers, another trend in modern software is the increased use of general-purpose GPUs, particularly for AI and Deep Learning. Training a deep neural network can take days or weeks, even on large GPU clusters. Components such as parameter servers and workers need to be placed optimally considering details such as GPU and processor architecture, core affinity, and bus topologies including NVLink interconnects. Altair Grid Engine provides multiple enhancements for GPU and AI workloads such Resource Maps (RSMAPS), Cgroup support (to precisely control access to GPUs) and a tight integration with NVIDIA Data Center GPU Manager (DCGM) so that the scheduler has full visibility to GPU-related metrics to make scheduling decisions in real time.
- Robustness and diagnosability – Substantial work has gone into making Altair Grid Engine clusters more reliable, easier to manage, and easier to support and troubleshoot. Improvements to job pending reasons is one of dozens of enhancements that have made Altair Grid Engine easier to use and support. Understanding why jobs sit in queues can be surprisingly hard to diagnose. In earlier versions of Altair Grid Engine, users would use the “qalter” command to change various parameters of a pending job until they found the reason a job was pending. Altair Grid Engine exposes additional information when users query the state of a pending job, so it is faster and easier to determine why jobs are failing to run, making users and administrators more efficient.
- Cloud integration and bursting – Increasingly, customers want to deploy Altair Grid Engine clusters in the cloud or operate hybrid environments where some jobs run on-premise while others execute in the cloud. Expanding a cluster to tap cloud resources automatically when local resources are busy is known as “cloud bursting.” Users of open-source Altair Grid Engine typically need to devise their own cloud bursting solutions working with cloud-specific APIs and tools, but in Altair Grid Engine, we’ve made this seamless.
- Job Classes – For administrators supporting many users and applications, managing complex job submission options for each application can be a major challenge. For example, submitting a distributed workload involving GPUs and expressing optimal topology and affinity related constraints via the command line or scripted parameters is non-trivial. A better solution is to use Job Classes introduced in Altair Grid Engine 8.1. With jobs classes, administrators can define a reasonable set of default submission options reflecting best practices for each class of job, vastly simplifying job submission and configuration management. Rather than needing to understand details of how to run a GPU-aware deep learning workload, for example, users can simply invoke and pre-defined job class (qsub -jc tensorflow [input file]), and appropriate job submission directives will be applied for each class of workload.
- Enterprise-level features – Numerous enhancements have been made to Altair Grid Engine to make it more enterprise-friendly. Examples include Windows support, support for enterprise authentication frameworks, reliability and diagnosability enhancements, and more accessible REST APIs that simplify integration with enterprise applications. Altair Grid Engine also provides comprehensive cluster and workload monitoring and reporting to help organizations understand exactly how resources are being used to aid in accounting, chargeback, and capacity planning.
- Scheduling policies – Altair Grid Engine provides many improvements in scheduling policies, and it does so without breaking compatibility with existing scripts and applications. There are too many scheduling improvements in Altair Grid Engine to enumerate fully, but as some examples:
- Altair Grid Engine supports fair-share based allocation across any consumable resource – not just CPU, memory and IO as is the case with open-source Altair Grid Engine.
- Resource affinity scheduling, enabling policies where jobs can "flock" towards best-suited nodes (because they have needed data for example) or conversely anti-affinity policies that distribute jobs as much as possible.
- Altair Grid Engine supports advance reservations as well as new standing reservation functionality in Altair Grid Engine 8.5.4. Standing reservations allow users to reserve resources for jobs based on recurrence patterns, reducing the effort required to manage regularly recurring jobs.
- There have also been major improvements in parallel jobs support including enabling different resource requirements for master and slave tasks and enabling parallel containerized jobs.
- Core binding, affinity scheduling, NUMA support – Most Altair Grid Engine administrators will be familiar with core binding. This functionality was first introduced in SGE 6.2u5 describing the sockets, cores and threads support by a host. Controlling the geometry of how jobs are placed on sockets, cores, and threads rather than leaving placement to the whim of the operating system can have dramatic impacts on performance. For example, tasks can be scheduled to maximize cache effectiveness, CUDA programs can be placed on cores close to the GPUs they communicate with, and multi-threaded tasks can be spread across multiple cores to maximize throughput. Altair Grid Engine provides ten new core binding strategies for ease of configuration such as “pack sockets,” “one_socket_per_task,” “balance_sockets,” etc.
- Modern SMP systems are not only multi-core, but they tend to have non-uniform memory architectures (NUMA) as well. With NUMA systems, each processor has local memory, and accessing memory attached to another socket is more expensive. Altair Grid Engine provides users with flexible policies to schedule jobs that take non-uniformity into accounting for better performance and job isolation.
- Advanced APIs - UGE Configuration API, DRMAAv2 – Among developers, Python has emerged as the world’s most popular programing language[1]. To provide more flexibility to Altair Grid Engine users and administrators, Altair Grid Engine 8.3.1 and later support a comprehensive Python-based SDK (PyCL) that communicates with qmaster via the qconf command to retrieve information from Altair Grid Engine clusters and configure them programmatically. This API supports the notion of “configuration as code,” helping customers realize a software-defined HPC infrastructure. Documentation about the UGE Config Library is available on GitHub. Also, Altair Grid Engine supports an updated Distributed Resource Management Application API (DRMAA v2), a high-level global grid forum API with multiple language bindings for submission and control of jobs in heterogeneous grid computing environments.
[1] Popularity of programming languages June 2019 - http://pypl.github.io/PYPL.html