NetApp® ONTAP® FlexGroup volumes have been steadily gaining fans since they were introduced in ONTAP 9.1. The questions about FlexGroup volumes alone have been keeping me very busy. If you would like to add to the volume of questions (pun not intended), email us at flexgroups-info@netapp.com.

 

One question I’m asked often is “How do I benchmark FlexGroup volumes?”

 

Usually, this question is accompanied by the plaintive statement, “I tried to test throughput with dd and I saw worse performance!” There’s a good reason for that — dd, by itself, is a fairly awful test for FlexGroup performance. Here’s why.

 

Many people use dd with a single command on a single client. For example:

# dd if=/dev/zero bs=1048576 count=4096 conv=fsync of=/mnt/flexgroup/file.dat

That’s the way people are used to testing a NetApp FlexVol® volume, because it gives a decent idea of what kind of line speed to expect. (Pro tip: It’s not a great way to test FlexVol volumes, either.)

 

When you use dd on a FlexVol volume, it looks like this.

Pretty straightforward; the file you’re writing is going to a volume that does single-threaded metadata, to a place where you know it lives — a single node in a cluster. When I ran that test, I saw an average of 805MBps and a max of 958MBps:

The test completed in 6.5 seconds.

 

However, FlexGroup volumes take the FlexVol concept and multiply it across multiple CPU threads and multiple nodes in a cluster for a true scale-out solution. Because of that, a FlexGroup can provide far more performance capability to a single namespace than a FlexVol volume could. But when you run a single-threaded, single-client dd test to a NetApp FlexGroup volume, it looks like this:

Notice that only one FlexVol member is being used? And that member is on a remote node? When that happens, there’s a bit of a performance penalty, due to cluster network traversal and the FlexGroup’s remote access layer. That is shown in the single thread dd test on a FlexGroup that is created on a single node (to avoid the cluster network):

That averages out to around 699MBps, with a max of 758MBps. This test completed in 7.85 seconds.

 

The test run on the FlexGroup volume (above) is doing fewer IOPS and fewer MBps than the FlexVol volume. It gets a bit worse on a FlexGroup that spans multiple nodes if the dd test lands on a remote node:

That averages out to around 525MBps, with a max of 713MBps. This test finished in 8 seconds.

 

But that’s to be expected – a FlexGroup volume is not meant for a single-threaded, single-client operation. It’s meant to be hammered with clients and ingest requests! S that’s what we do here.

 

TR-4571 contains a link to some test scripts, and you can also visit my GitHub page for more information.

 

One of those scripts is a multithreaded dd test that lets you run a bunch of simultaneous dd operations on a single client or across multiple clients. It also runs reads of all the files and then deletes the files.

 

The test looks like this.

When we throw in a couple of clients doing 16 dd operations at a time per client, the performance looks more like this on a single node FlexGroup (two clients – one accessing remotely, one accessing locally). We’re looking at “data recv” as the metric, because “total recv” includes cluster network traffic.

 

Here’s the single-node FlexGroup:

Note: Reads got about 586MBps.

 

On a 2-node FlexGroup, we get about 200MBps more for writes:

Note: Reads got around 1GBps.

 

Compared to this, a single FlexVol volume lagged behind both FlexGroup configurations:

But it’s not just about throughput. CPU is higher for the 2-node FlexGroup because it’s averaged across two nodes. But that’s a good thing! We’re using CPU for real work rather than letting it sit idle.

 

The FlexGroup dd script also finished faster than the FlexVol script, for both the local and 2-node FlexGroups.

Testing Using Larger Read and Write Sizes

In ONTAP, it’s possible to configure an SMB or NFS server to allow larger read and write sizes to occur during operations. Generally speaking, the defaults of 64k for tcp-max-xfer-size in NFS servers and is-large-mtu-enabled set to false for SMB servers are acceptable for most workloads.

 

In some cases, your workload may require more data to be sent across the network in a single packet than the defaults allow. ONTAP 9.1 and later allow up to 1MB for reads and writes in NFS and SMB. When you use a larger read and write size, you see fewer IOPS, but potentially more MBps. As an analogy, how do you want to fight a fire?

 

With lots of small buckets and hoses (smaller read and write sizes)?

 

Or with one giant air drop (large MTU [maximum transmission unit] and read and write sizes)?

 

For most fires (buildings, houses, and so on) a bucket brigade — or the more modern fire hose — is enough. You wouldn’t call for an airdrop on a house fire! For a forest fire, though, you’d want the airdrop.

 

When testing by using dd, it’s important to have a good idea of what your existing workloads look like. Is your users’ day-to-day task based on simple copies and pastes of files? Do they require super-high performance? What is the average file size? What is the application block size?

 

In most cases, larger file sizes perform better with larger MTU.

 

For example, I changed the TCP size for NFS to 1MB in my cluster and remounted:

ontap9-tme-8040::*> nfs server modify -vserver DEMO -tcp-max-xfer-size 1048576

Warning: Setting "-tcp-max-xfer-size" to a value greater than 65536 may cause performance degradation for existing connections using smaller values. Contact technical support for guidance.

Do you want to continue? {y|n}: y

[root@stlrx2540m1-68 /]# mount -o nfsvers=3,wsize=1048576,rsize=1048576 10.193.67.218:/FGlocal /mnt/flexgroup

Then I ran the same single-threaded dd test from earlier. The block size is 1MB:

# time dd if=/dev/zero bs=1048576 count=4096 conv=fsync of=/mnt/flexgroup/file.dat

With 64k rsize and wsize, the test completed in just under 8 seconds. With 1MB wsize and rsize, it completed in just under 11 seconds. The average throughput with 64k packet sizes was 699MBps, with a max of 758MBps. For the 1MB packet size test, the average throughput on the storage was around 466MBps, with a max of 546MBps. Notice how the ops have dropped substantially – with the larger wsize and rsize, we’re now fitting more data into fewer ops.

When I change the size of the file from 4GB to 100GB, the throughput changes a bit. It starts out around the same as the previous tests (around 450MBps), but ramps up at the end with a maximum of 951MBps and ends up with an average of 531MBps. The test finished in 3 minutes, 19 seconds. The client reported 539MBps.

# time dd if=/dev/zero bs=1048576 count=102400 conv=fsync of=/mnt/flexgroup/file2.dat

102400+0 records in

102400+0 records out

107374182400 bytes (107 GB) copied, 199.246 s, 539 MB/s

real    3m19.250s

user    0m0.044s

sys     1m1.751s

When I bump up the block size of the dd command to 2x, the average throughput drops to around 518MBps from the client’s perspective. The test takes about 12 seconds longer.

# time dd if=/dev/zero bs=2097152 count=51200 conv=fsync of=/mnt/flexgroup/file2.dat

51200+0 records in

51200+0 records out

107374182400 bytes (107 GB) copied, 207.116 s, 518 MB/s

real    3m31.243s

user    0m0.023s

sys     1m1.204s

When I lower the block size of dd, the performance gets a bit worse, with average throughput at around 504MBps. The test takes 10 seconds longer.

# time dd if=/dev/zero bs=524288 count=204800 conv=fsync of=/mnt/flexgroup/file2.dat

204800+0 records in

204800+0 records out

107374182400 bytes (107 GB) copied, 213.16 s, 504 MB/s

real    3m41.572s

user    0m0.087s

sys     1m3.379s

Compare that to when I lower the block size back to 64k:

# time dd if=/dev/zero bs=524288 count=204800 conv=fsync of=/mnt/flexgroup/file1.dat

204800+0 records in

204800+0 records out

107374182400 bytes (107 GB) copied, 151.55 s, 709 MB/s

real    2m31.555s

user    0m0.071s

sys     1m0.049s

As you can see, variance in workload types and block sizes can drastically change how storage performs. Therefore you should test using multiple block sizes before deploying to find what fits best in your environment. If you know that an application writes at a specific block size, specify that block size.

 

For CIFS and SMB, when you use large MTU, the client and server negotiate the sizes.

File Placement

With the dd script doing multithreaded creations of files and folders, we can see how a FlexGroup attempts to evenly distribute workloads across multiple members. In the following example, I ran the script to do the following:

  • 1GB file creation
  • 8 folders
  • 8 files per folder
  • 64 files total

The FlexGroup was created on a single node with 8 member volumes and would look like this:

Although data is generally placed anywhere in a FlexGroup volume, it’s possible to locate where the data gets placed. With the diag privilege command volume explore, we can map out where inodes get created in a FlexGroup. When an inode is local, it appears like a normal inode in the output. For example:

entry 7:  inum 12237, generation 16715282, name "F4.dat" (8.3 "F4.DAT")

When it’s remote, the FlexVol member volume name where the inode points is appended to the inode number, like this:

entry 9:  inum flexgroup_local__0002.20149, generation 16716505, type 1, name "F6.dat" (8.3 "F6.DAT")

In this example, the member volume flexgroup_local__0002 is where F6.dat lives.

 

When folders are created in a FlexGroup, ONTAP attempts to place them remotely more often than files to avoid piling up too many parent folders into a single member volume. When I run the script to create 8 folders, the placement is mapped out like this:

Next, the files get created. If the capacity and inode ingest calculations and probabilities allow it, ONTAP tries to place files in the same member volumes as their parent volumes. Otherwise, the files are placed remotely.

 

In this example, we see that folder /FG/c2 was created on the first member volume in the FlexGroup. This is the directory listing for that folder from the client:

# ls -lah /mnt/flexgroup/c2

total 8.1G

drwxr-xr-x  2 root root 4.0K Oct 26 13:47 .

drwxr-xr-x 10 root root 4.0K Oct 26 13:45 ..

-rw-r--r--  1 root root 1.5K Oct 26 13:48 c2.log

-rw-r--r--  1 root root   24 Oct 26 13:48 c2.out

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F1.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F2.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F3.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F4.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F5.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:47 F6.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:47 F7.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:47 F8.dat

When we map out the inodes to see where the files land, this is what it looks like:

For folder/FG/c2, this is the directory listing:

# ls -lah /mnt/flexgroup/c4

total 8.1G

drwxr-xr-x  2 root root 4.0K Oct 26 13:47 .

drwxr-xr-x 10 root root 4.0K Oct 26 13:45 ..

-rw-r--r--  1 root root 1.5K Oct 26 13:48 c4.log

-rw-r--r--  1 root root   24 Oct 26 13:48 c4.out

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F1.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F2.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F3.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F4.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:46 F5.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:47 F6.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:47 F7.dat

-rw-r--r--  1 root root 1.0G Oct 26 13:47 F8.dat

And this is a graphical representation of the file mapping:

Even though it may appear that the files aren’t being evenly distributed, we can see from the output of volume show that the sizes and file counts are fairly even:

ontap9-tme-8040::*> vol show -vserver DEMO -volume flexgroup_local* -fields used,percent-used,size,files,files-used -sort-by used

vserver volume                size   used    percent-used files     files-used

------- --------------------- ------ ------- ------------ --------- ----------

DEMO    flexgroup_local__0002 2.50TB 85.80MB 5%           21251126  106

DEMO    flexgroup_local__0008 2.50TB 85.93MB 5%           21251126  106

DEMO    flexgroup_local__0003 2.50TB 86.05MB 5%           21251126  106

DEMO    flexgroup_local__0006 2.50TB 89.73MB 5%           21251126  107

DEMO    flexgroup_local__0004 2.50TB 90.34MB 5%           21251126  108

DEMO    flexgroup_local__0007 2.50TB 94.11MB 5%           21251126  107

DEMO    flexgroup_local__0005 2.50TB 94.19MB 5%           21251126  107

DEMO    flexgroup_local__0001 2.50TB 94.27MB 5%           21251126  114

DEMO    flexgroup_local       20TB   720.4MB 61%          170009008 861

9 entries were displayed.

And the flexgroup show output tells us that the next ingest will be done fairly evenly as well, given the nearly identical probabilities and target percentages:

With smaller files, the placement would be even more evenly distributed across member volumes.

Other (More Ideal) Tests

The GitHub site mentioned earlier also contains a script that creates tons of small folders and files, which can be used to measure metadata performance, as well as to see how a FlexGroup properly distributes files and folders across multiple member volumes. This script is a Python script that can be modified to create as many files and folders as you choose. This sort of test fits right into the sweet spot of a FlexGroup volume – high ingest, lots of small files and folders.

 

I ran the script to create about 550,000 objects in a FlexVol volume and a local FlexGroup and to measure completion time.

FlexVol results

# python file-create.py /mnt/flexvol

Starting overall work: 2017-10-19 13:34:54.324235

End overall work: 2017-10-19 13:36:30.166374

total time: 95.842195034

Local FlexGroup results

# python file-create.py /mnt/flexgroup

Starting overall work: 2017-10-19 13:54:08.542322

End overall work: 2017-10-19 13:55:08.462744

total time: 59.9204668999

2-node FlexGroup results

# python file-create.py /mnt/flexgroup_16

Starting overall work: 2017-10-19 14:01:39.284467

End overall work: 2017-10-19 14:03:23.262777

total time: 103.97836113

The 2-node FlexGroup actually lagged behind the FlexVol volume. Why? Well, cluster network, for starters. But also, we weren’t throwing enough data at the Flexgroup!

 

Then I tested two clients against the same FlexVol volume and FlexGroup.

FlexVol results – two clients

# python file-create.py /mnt/flexvol/client1

Starting overall work: 2017-10-19 14:05:46.145822

End overall work: 2017-10-19 14:08:59.321446

total time: 193.175681114

Local FlexGroup results – two clients

# python file-create.py /mnt/flexgroup/client2

Starting overall work: 2017-10-19 14:12:07.467014

End overall work: 2017-10-19 14:13:35.712949

total time: 88.2459900379

2-node FlexGroup results – two clients

# python file-create.py /mnt/flexgroup_16/client2

Starting overall work: 2017-10-19 14:14:11.654738

End overall work: 2017-10-19 14:15:53.021375

total time: 101.366689205

The 2-node FlexGroup now greatly outperforms the FlexVol volume and is actually faster than when using the single client. The single-node FlexGroup still performs best in this test, but it would start to lag behind once CPU or hardware resources like disk utilization became a bottleneck.

 

In addition, the NetApp Customer Proof of Concept lab is doing some great FlexGroup testing. They’re using vdbench in file system test mode. In that test, throughput was measured using a large directory structure of > 10m small files of various sizes. The workload was 75% read 25% write, sequential, with files chosen at random. This was the configuration.

Some of the results from that test:

Ultimately, when determining the best kind of test to run against a FlexGroup, it comes down to what your workload is doing. I hope that this blog helps you reconsider how you test Flexgroup volumes and guides you to getting the most out of your NetApp ONTAP storage system.

Justin Parisi

Justin Parisi is an ONTAP veteran, with over 10 years of experience at NetApp. He is co-host of the NetApp Tech ONTAP Podcast, Technical Advisor with the NetApp A-Team and works as a Technical Marketing Engineer at NetApp, focusing on FlexGroup volumes and all things NAS-related.