accelerate deployment of HA for BeeGFS with AnsibleBack in July, we announced a new high-availability solution for BeeGFS backed by NetApp® E-Series storage. If you haven’t read my introductory blog post I encourage you to start there for useful background before diving in. The first time I got hands-on with the deployment procedure I thought, “We’ve got to automate this process.” Don’t get me wrong, it’s a rock-solid document, filled with valuable details about how to deploy BeeGFS into a Linux HA cluster. But like many good technical documents, it suffers from the weight of its own breadth and depth. Luckily, just because documentation needs to describe a bunch of manual steps doesn’t mean we can’t create a more concise summary through automation.

Accelerating deployment with Ansible

With the 2.0 release of the NetApp E-Series BeeGFS Collection, we’re taking what we learned from automating BeeGFS deployments without HA, and setting out to bring the HA deployment guide to life using a new BeeGFS HA Ansible role.

 

Much of the guide comes down to defining resource groups for each BeeGFS service, along with dependent resources, such as floating IPs, and storage devices (E-Series volumes) that contain BeeGFS configuration and application data. These resource groups define the building blocks of our BeeGFS file system, with flexibility to scale out metadata and storage independently to meet any requirements.

 

We started by thinking about how to simplify defining these BeeGFS resource groups in an Ansible inventory file. Leveraging our deep in-house Ansible expertise, we devised a strategy that allows the BeeGFS service, network, and storage configuration for each resource group to be defined in one place, and then our automation takes care of the rest. We feel that this approach offers a cleaner and more user-friendly experience by focusing on the overall desired end state rather than individual pieces of that end state.

 

What do these BeeGFS resource groups look like in an Ansible inventory?

  • We start by specifying a name for the resource group; for example, storage_01:
    • Then we say what nodes should be used to run this resource group:
      • The first entry is the preferred node.
      • Additional nodes are prioritized in the order in which they are listed.
    • Then we specify what configuration applies to this resource group:
      • What type of service we want to deploy; for example, storage.
      • What port should be used for this service.
      • What floating IPs we want to use to access this service (as many as we need).
      • What E-Series volumes we want to create and use as our storage targets:
        • Specifying the desired E-Series storage system:
          • Designating the storage pool configuration:
            • The storage pool name.
            • The desired RAID level.
            • How many drives.
            • The volume quantity and sizes.

With that, we’ve described how we want to deploy a highly available BeeGFS service. Simply repeat for as many resource groups as needed to deploy the desired BeeGFS configuration. It’s worth noting that the nodes assigned to each resource group determine the failover strategy employed. For example, to minimize server hardware, you can choose to run in an active/active node configuration; or to optimize performance under failure, you can run in an active/passive, N+1, or N+M configuration (or a mix of these). The provided examples make it easy to get started building out your inventory.

 

We also make it easy to define cluster-level attributes, such as setting the cluster name, creating a username and password, setting up alerts, and configuring any of the available fencing agents. After you’ve crafted the inventory, simply pass it off to a playbook that imports our BeeGFS HA role to take care of:

  • Configuring the E-Series storage systems
  • Connecting E-Series volumes to each BeeGFS node
  • Tuning kernel and storage performance
  • Creating the HA cluster
  • Deploying the desired BeeGFS resource groups
  • Backing up the final HA configuration
  • Setting up BeeGFS clients

Through deployment and beyond

We understand that when you commit to infrastructure as code, it will be used to manage the environment for the life of your BeeGFS deployment. With this in mind, after deployment, the role is capable of:

  • Validating the deployed configuration and correcting configuration drift
  • Adding additional BeeGFS resource groups to new nodes, allowing the environment to scale seamlessly
  • Quickly pushing the existing configuration to new nodes after replacing a failed node
  • Adding and removing nodes from existing resource groups, enabling you to migrate to new hardware and decommission old hardware
  • Reverting the cluster to the preferred or desired state following failure of one or more nodes
  • Decommissioning and cleaning up the entire cluster (especially useful for getting hands on initially, and for testing)

With the BeeGFS HA role, we simplify deployment and management of a BeeGFS HA cluster with the flexibility to meet your BeeGFS requirements and desired HA architecture. You’ll know that your BeeGFS file system is highly available, your data is protected by enterprise-class E-Series storage systems, and the NetApp storage specialists have your back.

 

For more about deploying E-Series at scale for HPC using Ansible, check out my NetApp INSIGHT presentation, SPD-1438-2.

Joe McCormick

Joe McCormick is a software engineer at NetApp with over ten years of experience in the IT industry. With nearly seven years at NetApp, Joe's current focus is developing high-performance computing solutions around E-Series. Joe is also a big proponent of automation, believing if you've done it once, why are you doing it again.

Add comment