To Infinity and Beyond – Expanded support in Altair Grid Engine for large-scale, high-throughput cloud HPC
A big motivator for deploying HPC workloads to the cloud is improving application throughput and performance with scalability. This is especially true in life sciences and CAE, where workloads tend to be cloud-friendly, and customers have an insatiable appetite for performance.
In late 2018, working with AWS and our team at Univa (now Altair), Western Digital demonstrated HPC in the cloud at an extreme scale. In one of the largest commercial deployments to date, Western Digital ran a million+ core Altair Grid Engine cluster on AWS. It collapsed the runtime for a large multiphysics simulation from 20 days to just eight hours – a staggering 60x improvement! While deployments at this scale are still relatively rare, Altair is seeing demand for ever-larger cloud workloads. Clusters with thousands or even tens of thousands of vCPUs are becoming more common. The lessons learned from these large scale deployments are being baked into Altair Grid Engine, giving rise to important new product features. In this article, we discuss some recent scalability enhancements in versions up to Altair Grid Engine 8.6.17.
Operating at scale brings unique challenges
Deploying and managing clusters at scale poses unique challenges. Large scale clusters will typically leverage spot or spot fleet instances to operate economically. This means that cluster nodes will continually be reclaimed while workloads are running, requiring that scheduler continually restart pre-empted jobs. Similarly, users cannot afford to wait until clusters are at full-scale to submit workloads, so the workload manager needs to tolerate clusters that are rapidly adding large numbers of instances as jobs are submitted.
Some scalability issues are best addressed with some common-sense best practices. For example, using a scalable object store such as AWS S3 is much more efficient for data persistence than NFS services. Similarly, when running containerized workloads, a good practice is to bake containers directly into cloud machine images to avoid overwhelming a container registry with thousands of requests. Other issues require customizations to the environment supported by Altair Grid Engine – as examples, using a distributed cache to avoid overwhelming the cloud provider with extreme volumes of cloud API requests and DNS lookups.
New scalability and throughput enhancements in Altair Grid Engine
In addition to the challenges above, other more subtle bottlenecks routinely surface in large scale-deployments. Some recent Altair Grid Engine enhancements aimed at removing these bottlenecks are described below:
- Optimizing name service lookups at scale – Aside from DNS services, clusters also use services such as NIS, LDAP, or Active Directory to resolve username and group names to their corresponding OS-level ids. Resolving supplementary groups (a feature where OS-level users can be assigned to groups beyond their primary group id) is particularly expensive. This is because, with supplementary groups, the same user id can be associated with multiple group entries. To help avoid this performance bottleneck, Altair Grid Engine avoids resolving supplementary group ids for client applications that do not need the information. Also, administrators can optionally suppress looking up supplementary group ids entirely for better performance. They can also disable forwarding supplementary group ids in a range when administrators know that this information is not needed for their workloads.
- Disabling unnecessary runtime checks – At large-scale, basic validation checks at runtime can be a luxury that administrators cannot afford. For example, when a job is submitted, Altair Grid Engine will validate that queue instances exist across cluster hosts and ensure that users have permission to access them. In situations where queues are known to be correct, Altair Grid Engine now allows this runtime checking to be disabled. Suppressing unnecessary checks further increases scheduling throughput and performance.
- Faster scheduling of parallel workloads – Scheduling parallel workloads is an expensive operation. This is because Altair Grid Engine will find the optimal resource assignment that best satisfies all request criteria. For example, the scheduler will try and provide the most possible slots when a slot range is specified, and it will seek to maximize the number of soft (optional) resource requests. Altair Grid Engine will also look for the earliest possible time window to run a job. At cloud scale, it is important that “the perfect not be the enemy of the good.” Often throughput is more important than optimizing every workload placement. New scheduling parameters in Altair Grid Engine allow these settings to be selectively relaxed for dramatically faster scheduling of parallel workloads.
These enhancements are in addition to various other enhancements in releases up to Altair Grid Engine 8.6.17 aimed at maximizing performance, reliability, and integrity in large scale environments with high job volumes.