Altair_Blog_hero_1920x225

Featured Articles

Enabling Simulation-Driven Design with Altair® CFD® and NVIDIA® H100 Tensor Core GPUs

For manufacturers large and small, computational fluid dynamics (CFD) is a critical part of their workflows. Engineers use CFD techniques to solve diverse problems ranging from modeling the aerodynamic performance of vehicles, to reducing noise from HVAC systems, to simulating oil flow in a gearbox, and more.

In vehicle design, optimization is critical. Manufacturers compete not only on the aesthetics of a design but also on its efficiency. Aerodynamicists need to predict drag for various body designs versus baseline cases, modeling airflow over a surface to maximize fuel efficiency or enhance the range of electric vehicles (EVs). They also need to conduct aero-acoustic simulations to minimize vehicle cabin noise inside the vehicle. They must develop "styling that sells" while meeting tight project deadlines and satisfying targets and constraints related to other CAE disciplines such as crash, noise, and cooling.

 

The Need for Speed in CFD

The challenge with high-fidelity CFD simulations is that they’re enormously compute-intensive. A single simulation can involve models with hundreds of millions of nodes or cells. In addition, simulations are typically conducted over hundreds of thousands – or even millions – of timesteps on clusters comprised of powerful servers. With traditional CPU-based solvers, models are parallelized across cores, sockets, and nodes, relying on MPI and high-performance interconnects to share state among adjacent regions of a model. Simulating just a few seconds of real-world activity can take days. It’s not uncommon for a single design to require thousands of CPU core hours, and a single design effort can involve 10,000 to 1,000,000 simulations.1

 

GPU-Native Simulators Come of Age

While GPU-accelerated CFD solvers have been available for years, in the past, practical considerations limited their adoption. Despite offering a clear performance advantage, GPU-based algorithms didn't always have feature parity with CPU-based solvers, and models were often too large to run in available GPU memory. Because of these limitations, GPUs were sometimes only used for smaller workloads.

This situation has changed dramatically with Altair’s CFD Lattice Boltzmann Method (LBM) solver, ultraFluidX®, and NVIDIA's latest data center GPUs. LBM solvers lend themselves to efficient GPU-based implementations as they’re highly parallel, computing density and velocity for all points in a cubic grid with every time step. ultraFluidX employs a wall-modeled Large Eddy Simulation (LES) solver to simulate turbulent flows in and around complex geometries.  The solver supports complex simulations involving rotating geometries,2 making it ideal for high-fidelity aerodynamic and aero-acoustic use cases. 

 

NVIDIA Hopper™ for Aerodynamic Simulations

Announced in March 2022, the NVIDIA H100 Tensor Core GPU based on the NVIDIA Hopper™ architecture delivers breakthrough floating performance for demanding CFD workloads. With up to 18,432 FP32 CUDA cores in the PCIe version of the card, the GPU is available in three different configurations:3 

  • H100 PCIe – delivering 26/51 teraFLOPS of FP32/FP64 performance (PCIe 5.0 slots)4
  • H100 SXM – delivering 34/67 teraFLOPS of FP32/FP64 performance (NVIDIA NVLink™ interconnect)
  • H100 NVL – delivering 68/145 teraFLOPS of FP32/FP64 performance5

The H100 succeeds the NVIDIA Ampere architecture (NVIDIA A100) launched in 2019. Customers can run four or eight H100 GPUs per server, depending on the hardware OEM. The H100 SXM and H100 PCIe cards each support 80 gigabytes of high-bandwidth memory (HBM) and up to eight H100 cards per server, enabling production-scale CFD workloads to run on a single machine.

 

Benchmark Results

Altair's ultraFluidX® team recently put the latest NVIDIA H100 GPUs to the test. They ran several production-scale aerodynamic and aero-acoustic workloads on two NVIDIA H100 PCIe cards, each with 80 gigabytes of memory. Results were compared with the same tests that were run on a server populated with the previous generation NVIDIA A100 SXM card.6

This was not a strictly "apples-to-apples" comparison because the SXM form factor uses NVIDIA NVLink™. Better results would have been expected had the NVIDIA H100 SXM version of the card been available for testing, but the results were impressive nonetheless.7

In their initial test, the Altair team ran a proprietary vehicle aerodynamics benchmark (Altair Roadster) to compare the performance of a single NVIDIA A100 SXM to the new NVIDIA H100 PCIe card running different mesh sizes across both GPUs. The NVIDIA H100 card consistently delivered about 1.4x higher throughput for each model size, as illustrated next.8


Results for each simulation are plotted in millions of node updates per second (MNUPS). When running the 143M voxel model, the NVIDIA H100 GPU delivered a staggering ~1 billion node updates per second

Next, the same model was run with different mesh sizes across single and dual NVIDIA H100 PCIe configurations to assess how well the simulation scaled across multiple GPUs. As shown above, each model's throughput in node updates per second nearly doubled when the second GPU was added, delivering strong scaling efficiency between ~93% and ~97%.9

 

A Production-Scale Test

Next, Altair conducted a more extensive aerodynamics simulation using a version of the production-level DrivAer validation test. This test involved a 1.5-millimeter mesh over the vehicle's surface, translating into a 330M voxel model simulated over 3.88 seconds of physical time. This full-scale test, including the time required for meshing, was repeated on 2,4, and 8-way NVIDIA A100 SXM GPU configurations. Using two GPUs, the simulation took over 40 hours to run, and with eight A00 GPUs, the model took about 12 hours. 

Although only two H100 GPUs were available for testing, by extrapolating results from the A100 SXM tests and applying the observed multi-GPU scaling factor, the DrivAer test has a predicted time to solution of under eight hours using eight NVIDIA H100 GPUs.

This level of performance is a game changer for sites running large-scale simulation-driven designs. Initial tests with the latest NVIDIA H100 GPUs show that engineers can comfortably run high-fidelity simulations on a single server overnight using ultraFluidX—a previously unattainable level of performance that will dramatically improve the efficiency of the CAE design environment. We look forward to validating these results with NVIDIA H100 SXM cards once the hardware is available for testing.

 

A Range of Solvers for Different Workloads

Altair CFD provides a comprehensive set of CPU and GPU-based solvers for a wide range of fluid mechanics problems. Solvers in the portfolio include Altair® AcuSolve® (a general-purpose Navier-Stokes solver), Altair® nanoFluidX® (a smooth-particle hydrodynamics solver), and Altair® ultraFluidX (the LBM solver used in tests described above).

Customers can also leverage Altair® Inspire™ Studio with GPU-accelerated Altair TheaRender™ and Paraview to reduce the time required to render high-quality noiseless images. These Altair solutions are qualified to run on various NVIDIA platforms and GPUs, including NVIDIA RTX workstations, NVIDIA DGX platforms, and OEM solutions based on NVIDIA GPUs.

 

The Bottom Line

Traditional CPU-based simulations can take a long time to complete, inhibiting a manufacturer's ability to explore design parameters and arrive at an optimal solution. Using Altair tools, aerodynamicists and engineers can completely automate their workflow – from building a model in Inspire Studio to meshing using Altair® HyperMesh® to importing models into Altair® Virtual Wind Tunnel™. From there, they can submit jobs to a GPU-based cluster, then seamlessly post-process and visualize results with Altair® Access™

As these early results with the NVIDIA H100 suggest, the ability to run overnight transient simulations for full vehicle aerodynamics is a game changer for manufacturers. Using the NVIDIA H100 GPU with ultraFluidX, designers can iterate faster, create higher-quality, better-performing designs, and simulate designs more thoroughly to achieve better outcomes and remove risk from the design process. 

To learn more about these and additional aero-acoustic benchmarks on the NVIDIA H100, watch the webinar: "High-Fidelity CFD Simulations Overnight: A Breakthrough for Simulation-Based Design."

 


 

 

1. See the NVIDIA article: The Computational Fluid Dynamics Revolution Driven by GPU Acceleration.

2. Large Eddy Simulations (LES) is a cutting-edge technique used in the aerospace and automobile industries to simulate how hardware interacts with fluids.

3. See NVIDIA H100 Tensor Core GPU Architecture, page 18, for detailed specifications.

4. See NVIDIA H100 Tensor Core GPU technical specifications

5. The H100 NVL comprises two PCIe 5.0 cards paired with an NVLink bridge.

6. Unfortunately, at the time these tests were run, the H100 SXM was not available for testing. Altair and NVIDIA expect to run additional benchmarks in future. 

7. See the NVIDIA H100 datasheet for additional information. The H100 SXM delivers 67 teraFLOPS of FP64 performance vs. 51 teraFLOPS for the H100 PCIe card. NVLink on the A100 SXM has an interconnect bandwidth of 600 GB/s vs. 128 GB/s for PCIe Gen 5.

8. The model size is represented in voxels, generic units representative of a volume of 3D space computed when the model is meshed. 

9. Strong scaling refers to the improvement in time required to solve a fixed-size problem as the number of GPUs increases. For the 36M voxel test, the dual GPU configuration delivered 1,417 MNUPS vs. 762 MUPS for the single GPU configuration. (1417/762)/2 = 93.0% scaling efficiency. For the 143M voxel test, the dual GPU configuration delivered 1,916 MNUPS vs. 986 MUPS for the single GPU configuration. (1916/986)/2 = 97.2% scaling efficiency. Astute readers will notice that there was also an improvement in “weak scaling,” defined as the time required for a fixed unit of work to complete as the problem size is increased. Due to memory limitations, the 286 million voxel model was not run on a single GPU configuration.