When thinking about high availability for your BeeGFS parallel file system, you need to consider data durability and data availability. Durability means that your data won’t be lost because of drive failures, bit rot, or some other form of corruption. Availability means that data is accessible whenever you need it. A NetApp® E-Series storage system provides both from the perspective of a BeeGFS management, metadata, or storage server. But BeeGFS involves multiple servers working in tandem to provide a single namespace to the BeeGFS clients. Even though server hardware failures, kernel panics, and other such 3 a.m. headaches are relatively rare, we recognize that sometimes you need assurance that your data will be accessible whenever you need it.
The Current State of High Availability in BeeGFS
The only built-in high availability (HA) support for BeeGFS is known as Buddy Mirroring. It’s a shared-nothing architecture: Each node has access only to its own storage devices. Buddy Mirroring achieves high availability of the metadata and storage services — notably not the management service — by writing data to both a primary and secondary service and allowing reads from either. The main disadvantages of Buddy Mirroring are the requirement for double the underlying storage and a decrease in write performance. The lack of built-in high availability for the management service also creates a single point of failure that must be solved.
Evolving Our BeeGFS Solution Support
The use of enterprise storage systems such as E-Series enable an external shared-disk HA architecture (sometimes referred to as shared storage). We can then employ industry-proven open-source software components Corosync and Pacemaker for high availability at the BeeGFS application layer. This approach allows BeeGFS management, metadata, and storage services to fail over between multiple nodes in a cluster that share access to the same set of storage devices (for example, E-Series volumes). The biggest benefit with this approach is eliminating the requirement to duplicate data for redundancy. Instead, you can rely on your battle-tested and bulletproof (not literal, please don’t shoot it) E-Series storage system to provide data durability through RAID and/or Dynamic Disk Pools.
Technically, all this has always been possible for anyone to set up using the aforementioned software. However, the work needed to design, develop, and test a fully engineered HA solution for BeeGFS using these components takes time — time that many of you don’t have. Perhaps more important is having peace of mind that your environment is fully supported with experts available when you need them.
Since NetApp began offering support for BeeGFS in 2019, in talking with customers, we’ve noticed a recurring theme: the desire for a BeeGFS HA solution using a shared-disk architecture. In addition to offering support for BeeGFS, we’re proud to announce that support now includes a shared-disk HA solution for BeeGFS using Corosync and Pacemaker.
Why Use NetApp E-Series?
E-Series technology complements BeeGFS by providing several enterprise features that let you rest easy while getting the most out of your investment:
- Background media scans prevent bit rot by automatically checking and correcting media errors. This feature allows you to take advantage of RAID cost savings without worrying about silent data loss.
- Proactive drive monitoring can often minimize performance impact of failures by copying data to a hot spare before preemptively failing suspect drives. Call-home functionality automatically opens support cases when issues are detected and can ship replacement hardware if needed.
- Dual RAID controllers offload storage operations, eliminating performance issues where traditional RAID software competes for resources on the servers that are running BeeGFS.
- Because the BeeGFS server-side write cache must be disabled to ensure data integrity in case of failover, the E-Series cache (which is battery backed and mirrored between controllers) provides a performance boost, especially for bursty workloads.
- Support for T10-PI on some protocols for data protection helps in detecting data inconsistencies.
- Support for direct or fabric-attached 200Gb NVMe/IB, NVMe/RoCE, and iSCSI Extensions for RDMA (iSER) puts your data in the fast lane between BeeGFS servers and their storage systems.
- Other protocols, such as 12Gb SAS, 32Gb FCP, 32Gb NVMe/FC, 25Gb iSCSI, and 100Gb SCSI RDMA Protocol (SRP), provide cost-effective connectivity options to meet a range of performance requirements.
We also provide other integrations to simplify administration and deployment:
- Monitor your E-Series storage system alongside BeeGFS using Grafana.
- Quickly automate deployment at any scale using Ansible.
All this and more is backed by a team that brings years of E-Series and BeeGFS expertise to your environment. Together, BeeGFS and E-Series systems deliver a high-performing, highly available, and massively scalable file system at an extremely competitive price point.
More to come — stay tuned! Meanwhile, if you need a little background on BeeGFS, take a look at my earlier blog post, BeeGFS for Beginners. Or, if you’re ready to delve into details, read TR-4782: BeeGFS with NetApp E-Series Reference Architecture.