Keeping Pace with New HPC Workloads
Altair Grid Engine and the changing face of HPC
Analytics and AI have emerged as a competitive weapon across every industry. Gartner’s 2019 CIO survey found that 37% of organizations had implemented AI in the data center as of 2019 - a 270% increase over 2015 with predictive analytics being the fastest-growing application segment for AI.[i] MMC Ventures reports that by 2021 two-thirds of companies will be live with new AI initiatives.[ii]
The infrastructure requirements for these new workloads are strikingly similar to HPC. Whether on-premise or in the cloud, both require clustered hosts with many-core CPUs, fast interconnects, and high-performance storage subsystems. Like their HPC colleagues, data scientists worry about time-to-market and having sufficient compute and data handling capacity to build high-quality predictive models. Also, both HPC and data science applications are increasingly containerized, distributed, and are often reliant on GPUs. Given these similarities and opportunities to improve efficiencies with peak usage offsets, it only makes sense to consolidate HPC, Analytic and AI environments.
In this article, I’ll discuss how parallel environments for analytic and AI workloads are evolving and cover different strategies for supporting them. I'll also explain why Altair Grid Engine is an excellent choice for users that need to run a mix of traditional HPC and data science workloads.
A sea change in analytic applications
While HPC is evolving quickly, analytic and data science tools are evolving even faster. Leading commercial tools have steadily ceded market share to open-source software. Tools and frameworks such as Python, R, and TensorFlow have seen explosive growth. According to a 2019 survey of data scientists, almost two-thirds of data scientists use Python-based frameworks, and nearly 50% use R-language tools.[iii]
Diverse parallel execution environments
To complicate things further, the number of tools isn’t consolidating anytime soon. There are hundreds of libraries and tools in the Python and R ecosystems alone. These include multiple parallel execution solutions such as Dask, Horovod, IPython.parallel (ipyparallel) and others. While some parallel computing solutions are more popular than others, Python users can choose from dozens of potential solutions – see a list here.
Some view distributed computing as critical only for compute-intensive applications such as model training, but parallel execution is essential at all stages of machine learning pipelines. Examples include data ingest, streaming, data processing/ETL, reinforcement learning, model validation, and model serving. Contributing further to this challenge of diversity, data scientists seldom use a single tool or framework. In a recent survey of data scientists, the average data scientist reported using as many as seven different analytic tools to do their jobs.[iv]
How to accommodate these modern workloads?
Unfortunately, there is no single solution that supports all these workloads. Some tools integrate with Kubernetes, but Kubernetes is poorly suited to run many HPC workloads. It lacks features that HPC users take for granted, such as high throughput array jobs, interactive jobs, parallel jobs, and a variety of other capabilities. Also, Kubernetes requires that workloads be containerized and adapted to Kubernetes, but most environments run a mix of applications, including virtualized and non-containerized workloads.
Some open-source tools provide their own built-in scheduling and distributed computing facilities with features such as load-balancing, affinity, anti-affinity, and packing. While this approach may work for some environments, users should be cautious about tool-specific solutions. This path can lead to expensive siloed infrastructure, cost inefficiencies, and the need to support multiple distributed resource managers.
Projects such as Horovod and Dask provide a more flexible and portable approach to parallelization. While they present their own APIs, they also plugin to popular HPC workload managers, such as Altair Grid Engine, enabling seamless resource sharing between HPC and machine learning workloads. Horovod enables parallel execution of most deep learning frameworks by leveraging MPI – an environment already familiar to most HPC users. Similarly, while Dask provides a native Dask.distributed library in Python, it also interfaces with popular HPC workload managers through its dask-jobqueue and dask-drmaa modules.
The case for Altair Grid Engine
For HPC environments running analytic and AI workloads, Altair Grid Engine is an excellent choice. It is flexible, configurable, has rich support for parallel execution environments. Users can generally adapt distributed applications to Altair Grid Engine with simple script-based integrations.
Given its open-source heritage, Altair Grid Engine supports a wide variety of analytic and machine learning workloads already. It also offers rich container management, GPU scheduling, and hybrid cloud features important for modern applications. Some of the more popular analytic and distributed machine learning frameworks integrated with Altair Grid Engine are listed below.
Applications | Description & Reference URLs |
TensorFlow | Distributed Tensorflow integration published by [Altair] or use Horovod with Open MPI |
Keras | Run Distributed Keras with Horovod and launch using Open MPI with Altair Grid Engine |
PyTorch | Run Distributed PyTorch with Horovod and launch using Open MPI with Altair Grid Engine |
Dask | Dask-jobqueue deploys Dask on job queueing systems including Altair Grid Engine |
Jupyter | Users can launch Jupyter / iPython Notebooks under Altair Grid Engine control or can use remote_ikernel to launch Jupyter kernels on Altair Grid Engine hosts from within a Jupyter Notebook |
IPython Parallel | IPython Parallel provides a PBS mode that starts the ipcluster controller and engines under control of PBS style grid managers, including Altair Grid Engine |
Anaconda | Anaconda is a complete data science distribution including popular Python and R-based tools and libraries that work in Altair Grid Engine environments including NumPy, SciPy, Dask and others |
Python | Python jobs can be launched under control of Altair Grid Engine, and Python supports multiple parallel frameworks with Altair Grid Engine integrations |
R CRAN | The open-source R CRAN task view provides multiple distributed and parallel computing solutions that work with Altair Grid Engine including Rmpi, qsub, BatchJobs, flowr, clustermq and others |
Apache Spark | While Spark is more at home in a Hadoop environment using the YARN resource manager, Spark standalone scripts can be used to launch individual Spark clusters as user jobs |
Matlab | Mathworks provides a commercially supported integration for Altair Grid Engine via the Parallel Computing Toolbox plugin for MATLAB Parallel Server with Grid Engine |
[i] Gartner Research – 2019 CIO Survey: CIOs Have Awoken to the Importance of AI - https://www.gartner.com/en/documents/3897266/2019-cio-survey-cios-have-awoken-to-the-importance-of-ai
[ii] MMC Ventures – The State of AI – Divergence - https://mmc.vc/reports/state-of-ai-2019-divergence
[iii] KDNuggets 2019 Annual Data Science Survey - https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
[iv] KDNuggets Python eats away at R: Top Software for Analytics, Data Science, Machine Learning - https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html