Data Management in the HPC Cloud

December 13, 2018

Data management is always very important for enterprises. With a big business, it's important to know about trends and common themes in your data which is why web scraping is popular amongst enterprises (take a look at other web scraping use cases here). On top of this, for those considering a move to the cloud, cloud data management is often a key consideration as it offers another way to provide feedback and management for your data. Whether you're planning a hybrid environment where an on-premises cluster will share data with cloud instances, or a cloud-only solution spread across multiple clusters and availability zones, you'll need to think about data movement, replication, and synchronization. This can be used with a cloud-based monitoring service.

Navigating Cloud Data Management Solutions for Hybrid Cloud

There are a variety of cloud data stores (databases, key-value stores, object stores, etc.), but most HPC applications rely on filesystems. For this reason, data synchronization in HPC hybrid-cloud environments is usually done at the level of the file system. In an ideal world, regardless of the file system used (NFS, Lustre, BeeGFS, WekaIO Matrix, etc.), administrators would like a simple solution that is inexpensive and transparent to users and applications.

When a file is created, updated, or deleted at one location the change should ideally be reflected immediately at one or more other locations. In a perfect world, each cluster would have a shared consistent global filesystem view with file replication handled automatically behind the scenes.

Given the variety of file systems, cloud offerings, and data replication use cases achieving this goal is harder than it seems. A common architectural approach for network attached storage (NAS) environments is illustrated in the figure below where an on-premise edge filer and caching filers in the cloud ensure that cloud storage stays synchronized with local file systems.

The optimal solution depends on several factors including the nature of the applications, the existing cluster environment, performance requirements, and the amount of data to be moved between on-premise clusters and the cloud or between clouds resident clusters and whether or not you are using some form of fog computing as an intermediary.

A Variety of Considerations

A good place to start is to have a clear idea of your requirements so that you can map them to potential solutions and weigh the pros and cons objectively. Some considerations when selecting cloud-friendly cloud data management and replication solutions are offered below:

Primary location of data – Some solutions assume that you want to maintain data locally and make it available in the cloud only occasionally to support use cases like cloud-bursting. Other solutions assume that you want to store data in the cloud persistently and provide a gateway to access cloud-resident data locally using familiar protocols like NFS.
The frequency of data replication, quality-of-service – do you need to replicate data only once, periodically, or continuously? Some solutions are designed for continuous replication while others are better suited to one-time or periodic data movement.
Anticipated data volume – Are you moving gigabytes, terabytes, or petabytes? When moving large data sets, transfer times and data transfer costs are important considerations. With a dedicated one Gbps connection and WAN optimization software, moving one terabyte can take three hours. For large file transfers, technologies like compression and WAN-optimization are essential.
Block vs. Object storage – Most file systems perform I/O against file systems implemented on block storage. Some modern file systems employ tiering and leverage object stores behind the scenes. Object stores include AWS Simple Storage Service (S3) and similar cloud object stores along with Ceph (that presents an S3 compatible interface) and OpenStack Swift. Understanding the type of cloud storage required by a cloud data management solution is important because object storage in the cloud is significantly less expensive than block storage. A good practice is to use an object store for storing large data sets in the cloud and moving data to block storage services such as Amazon's Elastic Block Store (EBS) or Elastic File System (EFS) only when cloud instances need access to the data.
Unidirectional, bi-directional or n-way replication – Can you live with one-way synchronization (master-slave), or do directories and data need to be synchronized bi-directionally across multiple sites? Do you need global namespace consistency, or is synchronizing just a few key sub directories good enough?
Replication bandwidth/performance requirements – How quickly to changes need to appear on are mote file system? Do you need strong-consistency or will eventual consistency suffice?
File locking/concurrency issues – will there be multiple writers to the same file at different locations or will each site work on discrete copies of data independently?
Nature of workloads – Some workloads require parallel file access for high-bandwidth. Examples include scratch storage for parallel CFD simulations or reading and writing multi-gigabyte aligned genome sequences. Other workloads require extreme IOPS (input/output operations per second) and low-latency. Reading and storing millions of small data snippets of data from genome sequencers or IoT sensors are good examples.
Commercial vs. open-source – Are you comfortable deploying an open-source solution or do you need a commercially supported solution? Some organizations will have adequate staff with technical talent to manage and support open-source source solutions in-house. A commercial solution might be more suitable for others.

The Best Choice May Depend on Your Existing Cloud Data Management Environment

With the capabilities above in mind, a good place to start is to look for solutions complementary to the storage solution or filesystem that you're already using.

For example, if you're already using NetApp filers, Cloud Volumes ONTAP together with solutions like NetAppCloudSync or NetApp SnapMirror are likely worth a look. If you use Amazon Web Services AWS and have store data in NFS on-premise you'll want to look at the new AWS DataSync service. If you have more basic needs and don't require features like strong consistency, data de-duplication, or distributed caching, open-source rsync may be sufficient.

It's also important to remember that data replication doesn't necessarily need to be done at the file system level. If you are a Altair Grid Engine user, and data sets are relatively small, you might rely on features in Grid Engine to stage data for you at the job-level in advance of job execution using various transfer mechanisms. Similarly, if you are using Altair Navops Launch to manage dynamic clusters in the cloud, applet functionality in Navops Launch can be used to stage data intelligently when cluster nodes are deployed using a variety of underlying transfer mechanisms (scp, ftp,rsync, or cloud-specific utilities) to support cloud bursting use cases.

While there are many ways to manage data staging and replication we've concentrated on file-system oriented solutions that either offer data replication capabilities or are HPC-friendly.

Distributed File Systems and Data Replication Solutions

We'll start with some commercial solutions purpose-built for cloud and hybrid cloud environments.

Cloud-Centric File System Solutions

Elastifile – As the name suggests, Elastifile provides scalable, elastic cloud storage. It is built to support both high-bandwidth and I/O intensive workloads and to provide elastic multi-cloud file system storage. In addition to supporting HPC workloads, Elastifile is marketed as a distributed persistent store for containerized environments also. The key components of the Elastifile solution are:

Elastifile Cloud File System (ECFS), a high-performance distributed cloud that works with popular cloud providers such as AWS, Microsoft Azure, and Google Cloud Platform (GCP).
Elastifile Cloud Connect – providing bi-directional mobility between file storage and object storage enabling data sharing between on-premise and cloud-storage environments.
Elastifile Clear Tier – leveraging the cloud provider's object store as secondary storage tier to reduce cost while still making the files read/write accessible via the ECFS file system.

An advantage of ElastiFile is that it is cloud provider agnostic and will work what whatever storage solution you have on-premise (NFS, Lustre, etc.). It supports hybrid cloud and cloud bursting use cases by replicating local file system data to a cloud provider's object store and making file system related metadata available to ECFS. This architectural approach is beneficial because it helps reduce costs by using inexpensive object storage. For life sciences customers, Elastifile can be a good choice because of its suitability for working with various genomics pipelines, and its support for both high storage bandwidth and IOPS.

NetApp CloudSync –As explained earlier, if you're already operating NetApp filers on-premise and want to stay within the NetApp ecosystem CloudSync is a SaaS solution offered by NetApp. It works with Cloud Volumes ONTAP optimized for both AWS and Azure. NetApp CloudSync can synchronize data continuously between on-premise NAS filers, cloud storage, and object stores with efficient transfer protocols. Similar to Avere Systems, this is a good solution for cloud bursting use cases because data can continue to reside on premise but be transparently accessible in the cloud.

Avere Systems – Microsoft's Avere® Systems is a solution that lets customers integrate legacy NAS environments with public or private object stores implementing a global file system accessible via NFS or SMB. With this design, an on-premise FXT Edge Filer sits in front of your existing NAS filer. Additional FXT Edge Filers can sit in other locations, or Virtual FXT (vFXT) edge filers are available for major cloud platforms like Azure, AWS, and GCP providing transparent access to files. Avere Flashmirror® provides data replication across heterogeneous filers to ensure data availability. Similar to Elastifile, Avere has the advantage that it minimizes disruption to the on-premise NAS environment.

We've focused on the cloud solutions above because they are common in HPC environments or offer HPC-specific features like high bandwidth parallel I/O or high IOPS. There are several other hybrid cloud and cloud file systems solutions as well including Nasuni Cloud File Services, Panzura Freedom NAS, and SoftNAS Cloud.

File Replication solutions

File replication solutions are in a different category because they are file system and cloud frame work agnostic. These solutions are useful for more basic requirements such as keeping segments of a directory tree synchronized offering eventual consistency.

Rsync – Rsync is an open-source solution that is freely available. Unlike more complex commercial solutions, rsync is appealing because of its simplicity. It makes no assumptions about the underlying file systems. It is also flexible allowing only segments of the file system to be replicated recursively over multiple network transports.

For HPC users that are technically inclined and don't need a lot of sophisticated features, rsync is worth a look because it is free, easy to implement and doesn't have a lot of pre-requisites. It can also work with multiple file systems used in HPC environments discussed below. A good technical article explaining how to implement rsync is here.

Rclone – Rclone is an open-source data synchronization tool similar to rsync. The key difference is that while Rsync is oriented specifically around file system synchronization, Rclone provides versatile synchronization services across multiple cloud services and protocols.

Rclone supports diverse data stores such as Box, Dropbox, and Microsoft OneDrive, protocols like FTP and HTTP and file systems and object stores likeCeph, OpenStack Swift, and Amazon S3. Often it is more efficient from a cost and performance standpoint to move data directly to and from a cloud provider's object store as explained above, and Rclone can be used for this purpose.

AWS DataSync –AWS DataSync is a new service from Amazon. It provides an easy way to transfer or synchronize data between on-premise NFS environments to AWS S3 or Amazon Elastic File System (EFS). It also provides a performance-optimized network transfer protocol for faster data movement and TLS encryption. AWS DataSync provides another useful data transfer mechanism that can be triggered via Navops Launch's automation both for clusters that persist in the cloud, as well as transient clusters deployed to run specific workloads for a short time.

HPC-Friendly File Systems

While some HPC sites use NFS for file sharing, others require file systems that offer higher levels of performance and scalability. Some popular HPC filesystems are discussed below in the context of hybrid cloud deployments.

It's worth mentioning that in some cases these HPC file systems are competing with file systems offered by cloud providers such as Amazon's Elastic File System (EFS). Although EFS is not marketed as a high-performance file system, it offers reasonable bandwidth and decent IOPS performance along with easy backup to S3, so HPC users will need to consider factors like performance, cost, and convenience when making a section.

WekaIO – For users needing a commercially supported, high-performance parallel file system that works on-premise and in the cloud WekaIO Matrix is a compelling choice. It provides impressive benchmark results, and as a software-only solution that runs on commodity hardware, it is easy to deploy and manage. Matrix presents a standard POSIX interface with integrated tiering to on-premise or cloud-based object stores including AWS S3, Swift or Ceph. At this time, WekaIO Matrix targets mainly AWS for cloud deployments. The fact that Matrix supports storage in S3 is attractive to cloud users because storage costs are lower in the S3 object store than equivalent capacities stored in AWSEBS as explained earlier.

Something to keep in mind is that WekaIO Matrix does not specifically address data synchronization in hybrid environments, but it presents a POSIX interface and can be used with other data replication solutions. For example, in Altair Grid Engine environments, Navops Launch might be used to automate data movement (using Navops Launch Applets) to create a seamless user experience for HPC users. Rsync might be used stand-alone to keep data synchronized, or be triggered by Grid Engine or Navops Launch to synchronize data in advance of jobs running in the cloud

Lustre – Lustre is an open-source file system with a long heritage in HPC. It started as a research project in 1999 at Carnegie Mellon University leading to the formation of Cluster File Systems (CFS) in 2001. Lustre supports thousands of client nodes, tens of petabytes of storage and hundreds of servers.

Stewardship of Lustre has evolved as CFS changed hands between Sun Microsystems, Oracle, Whamcloud, Intel, and DDN presently. Through all this business transition, the non-profit Open Scalable File Systems Inc. (OpenSFS) and European Open File Systems (EOFS) has maintained the Lustre community portal with the mission of keeping Lustre open and free.

Lustre has a vibrant community, and remains popular in higher-end HPC centers and is actively maintained with current releases. Lustre runs on 60% of the top 500 supercomputers. DDN (Whamcloud) continue to provide cloud marketplace offerings and deployment recipes for Lustre in the cloud support AWS and Azure.

As far as data replication is concerned, Lustre borrows from open-source rsync described previously relying on lustre rsync to keep Lustre files synchronized across clusters. As with other HPC file systems, customers can use Altair Grid Engine and Navops Lunch to handle data movement on an as-needed basis rather than fully replicating data between on-premise environments and the cloud.

Lustre is a popular HPC file system, but it's worth pointing out that Lustre along with the similar file systems with a heritage in HPC relies on block storage (relatively more expensive in the cloud). Modern cloud file systems and commercial competitors are delivering equal or better performance using less expensive cloud-based object storage, so this is a case where free software is not necessarily more economical in the cloud.

In November of 2018, AWS announced the availability of AWS FSx for Lustre. Users can associate their AWS FSx file system with an S3 bucket for seamless access. AWS will automatically copy S3 data to FSx for Lustre as needed, and write results back to S3 or other low-cost data stores. This is an important development for AWS customers who need a high-performance parallel file system but don't want the hassle of managing their own Lustre deployment.

BeeGFS – BeeGFS is also worth discussing since it's another popular HPC file system particularly in Europe where it has its roots. It was developed at the Fraunhofer Institute in Germany and was previously known as FhGFS. It is regarded as easier to manage than Lustre and often favored by small and medium-sized HPC users for this reason. The software is open-source and available free, and commercial support is available from the Fraunhofer Institute and ThinkParQ the commercial entity behind BeeGFS. The software can be deployed on-premise or in public clouds. Automated installers are available via the AWS marketplace. On Azure, an Azure Resource Manager template is available to support BeeGFS installations.

BeeGFS supports the notion of mirroring across multiple BeeGFS clusters on aper-directory basis (referred to as "buddy groups"). Although this functionality might be used to achieve cross-cloud replication in theory, the primary use-case for mirroring is high-availability in on-premise environments. Like Lustre, BeeGFS requires block storage.

GlusterFS – GlusterFS is a popular open-source scale-out file system with a long heritage in HPC. Previously known as GNU Cluster, the project has its HPC roots with a team that worked on the Thunder project at Lawrence Liver more National Labs (LLNL) in 2003. Gluster was released as open-source software starting in 2007 and became a popular file system among HPC users need a scale-out file system. Red Hat is the principal maintainer of GlusterFS after acquiring Gluster in 2011. The technology forms the basis of Red Hat Storage Server. It is attractive to many HPC users because it is open-source, viewed as stable and reliable, and is commercially supported by Red Hat.

GlusterFS supports replication features including Replicated Volumes (synchronous data replication between local clusters for HA) and Geo-replication (asynchronous, eventually consistent data replication across clusters and clouds). GlusterFS is also a credible parallel file system for high-bandwidth reads and writes, but file systems like WekaIO Matrix or BeeGFS are more suitable for applications that need high IOPS. A well-written and detailed comparison between GlusterFS and BeeGFS is provided here written by Harry Mangalam at the University of California at Irvine.

GlusterFS users can leverage Gluster's native cross-cloud replication features or and/or Altair Grid Engine and Navops Launch to automate data movement for various HPC workloads in GlusterFS environments.

Summary

For HPC users deploying solutions to the cloud, there are a variety of commercial and open-source solutions. Usually, the right choice depends on the nature of your applications, how data is stored presently, and the level of sophistication and performance needed from cloud file systems.

At Altair we've gained considerable experience helping customers migrate high-performance applications to the cloud and solving data related problems. Feel free to contact us to discuss your cloud data management challenges. We'd also be interested in learning how others are solving these and other HPC cloud data management challenges.

Featured Articles