How Altair Enables Sustainable HPC
This article has been adapted from the "Sustainable Computing for HPC and AI" eGuide.
As we mentioned in our last article, artificial intelligence (AI) and high-performance computing (HPC) demand a lot of energy. As these technologies expand and mature, this energy demand is only set to grow – and fast. According to research from the International Energy Agency (IEA), global energy consumption due to data centers, AI, and the cryptocurrency sector is poised to double from an estimated 460 terawatt-hours (TWh) in 2022 to more than 1,000 TWh by 2026 — an amount roughly equal to Japan’s total annual electricity consumption.
Moreover, training a large language model (LLM) such as GPT-4 requires approximately 25,000 GPUs. For GPT-4’s 100-day training period (and factoring in overhead), this adds up to roughly 60,000,000 GPU hours and consumes approximately 28,800,000 kWh. This translates into 6,912 tons of CO2 equivalent (tCO2e) – the equivalent of powering 1,300 homes for one year.
In an era where the global community is already feeling the powerful effects of climate change, sustainability and energy efficiency has become paramount. In this article, we’ll explore how organizations seeking to leverage HPC, AI, and simulation can make their computing more efficient and more sustainable thanks to Altair technology.
Reducing Your HPC and AI Energy Footprint
Though modern energy demands are steep, many straightforward techniques can help organizations reduce the cost and environmental impact of their AI and HPC workloads. These include:
- GPU-Aware Scheduling: Given the enormous energy requirements of GPUs and accelerated processing units (APUs), maximizing their utilization with GPU-aware scheduling is critical to containing costs and reducing total power, cooling, and associated emissions.
- Run Mixed Workloads: Consolidate AI, HPC, and cloud-native Kubernetes workloads onto a shared infrastructure with robust GPU and container support to improve resource usage and minimize energy consumption.
- Optimize Hardware Selection: For HPC and AI, scheduling on-premises or cloud resources optimized for particular workloads is essential for reducing power requirements. For example, using specialty tensor processing units (TPUs) can boost performance per watt by 2-5x compared to general-purpose GPUs.
- Employ Purpose-Built Models: Many predictive models are over-parameterized. For narrow tasks such as sentiment analysis or classification, purpose-built models can be far more energy-efficient than general-purpose LLMs, dramatically reducing training and inference costs.
- Cloud Bursting: With policy-based cloud bursting, organizations can reduce scope 2 emissions by taking advantage of the energy efficiency and lower power usage effectiveness (PUE) ratios of carbon-neutral cloud data centers.
- Energy-Aware Scheduling: Leverage schedulers that consider power requirements in scheduling decisions with power capping, per-job power profiles, and energy accounting to reduce total power consumption.
- Allocate Resources Based on Need: By prioritizing AI model training jobs and HPC simulations based on business needs, organizations can avoid redundant or unnecessary computations and reduce their overall costs and carbon footprint.
- Workload/Resource Monitoring and Reporting: Use workload management together with monitoring solutions that can report on resource usage and power consumption to optimize efficiency and reduce energy usage over time.
GPU-Aware Scheduling Dramatically Improves Sustainability for AI Workloads
GPU and topology-aware scheduling optimally place components of distributed GPU workloads, considering resource requirements and underlying CPU, GPU, and memory architectures to optimize performance, avoid conflict, and make unused CPU cores available for other workloads. The result is that expensive and power-hungry GPU and CPU cores are fully utilized, and the CPU portion of workloads are pinned to cores in proximity to GPUs, resulting in lower latency, higher throughput, and higher throughput per watt.
In addition, for partitioned models that span multiple GPU-capable nodes, topology-aware scheduling considers underlying network architectures (where power considerations are also important), including intra-server GPU interconnects and InfiniBand or Ethernet links between servers, minimizing communication overhead and reducing the time and energy required to train a model.
Sharing expensive resources among HPC, model training, and inference workloads results in additional financial and energy savings. Mixing CPU- and GPU-intensive workloads improves utilization on large compute nodes, and time-critical inference jobs can be prioritized to minimize wait times.
With cloud bursting and energy-aware scheduling, workloads can automatically be directed to the most cost- and energy-efficient computing resources, on-premises or in the cloud.
How Altair Technology Fosters HPC Sustainability
Whether HPC and AI workloads run on-premises or in public, private, or hybrid clouds, organizations need to optimize productivity and maximize resource utilization. Altair’s HPC solutions can help organizations manage large-scale workloads effectively while minimizing energy usage and carbon emissions.
The Altair® HPCWorks® HPC and cloud platform provides a rich set of tools to access, control, and optimize computing resources. It enables users to move seamlessly between on-premises and cloud environments and make better decisions with detailed monitoring and reporting data. Organizations can take advantage of convenient Jupyter Notebook integration, GPU acceleration, and rapid scaling to enable the latest analytics and AI workloads with flexible scheduling and workflow design.
Workload managers, including Altair® PBS Professional® and Altair® Grid Engine®, provide extensive support for containerized GPU workloads and rich, topology-aware scheduling and cloud-bursting features to help organizations simplify administration and maximize infrastructure usage for a wide variety of HPC, AI, and analytics workloads.
Altair® Access™ offers a simple, powerful, consistent interface for submitting and monitoring jobs on remote clusters, enabling data scientists and analysts to focus on their work and access the most energy-efficient resources for their workload requirements.
For organizations that need an easy-to-use application for monitoring cluster configuration and reporting in HPC and AI environments, Altair® Control™ supplies a control center for managing, optimizing, and forecasting resources with advanced analytics to support data-driven decision-making.
Altair® NavOps® lets you define intelligent, business- and workload-aware scaling automations to maximize resource utilization. NavOps reduces costs and cuts down on common problems such as the energy wasted bringing machines up and down too often, or the waste resulting from the wrong machines being scaled. NavOps works with Altair schedulers and cloud providers to dynamically scale on-demand cloud resources while providing detailed visibility and control over cloud spending.
Altair Mistral™ provides live system telemetry and I/O monitoring for data-intensive distributed model training workloads, quickly pinpointing compute and storage- related bottlenecks to maximize on-premises and cloud resources in HPC and AI environments. Altair Breeze™ profiles application file I/O to optimize data handling and ensure data-hungry model training workloads run at peak efficiency and utilization.
Sustainability is a critical concern for organizations of all sizes across all industries. With the widespread adoption of energy-intensive AI applications in the enterprise, energy efficiency has become more important than ever. Operating an efficient, sustainable computing environment isn’t just good for the planet — it’s good for the bottom line. Organizations that operate more efficiently can significantly reduce costs related to data center operations, power, and cooling and curb costs in the cloud.
Overall, Altair offers a rich portfolio of software and services to help optimize all facets of the AI and machine learning operations pipeline, from data collection to data preparation to feature engineering to model training.
To learn more about HPC and cloud solutions that can help improve the efficiency and sustainability of HPC and AI environments, visit altair.com/altair-hpcworks.
To read the full eGuide, visit "Sustainable Computing for HPC and AI."