In my previous blog, I talked about the significant advantages that ONTAP AI provides for anyone that wants to deploy AI infrastructure quickly and get results fast. ONTAP AI combines:

  • NVIDIA DGX-1 AI supercomputers
  • NVIDIA’s fully optimized deep learning software stack
  • NetApp’s AFF A800 for breakthrough I/O performance
  • NetApp Data Fabric for edge to core to cloud connectivity and data management
  • Cisco Nexus 3232C switches

ONTAP AI is designed to deliver superior scalability and performance, offering up to 25x greater raw capacity, and 6x greater I/O performance. Last time, I mentioned that NVIDIA and NetApp have done a lot of work to characterize the performance of ONTAP AI. This time I’m going to dig into ONTAP AI performance metrics as promised.

How We Did Performance Testing

ONTAP AI compute performance scales through the addition of NVIDIA DGX-1 servers. I/O bandwidth scales through the addition of AFF A800 HA pairs. Each AFF A800 HA controller pair delivers bandwidth up to 25GBps. With up to 12 HA-pairs in a single cluster, that’s a maximum bandwidth of 300GBps.

 

Our test configuration for the performance graphs I show below included four DGX-1 servers and a single AFF A800 HA pair. Over time, we’ll test even larger configurations with additional DGX systems.

 

In our testing we used several well-known and well-characterized neural network models for image classification as a reference point:

  • ResNet-152 is generally considered to be the most accurate training model
  • ResNet-50 delivers better accuracy than AlexNet with faster processing time
  • VGG16 produces the highest inter-GPU communication.
  • Inception-v3 is another common TensorFlow model

These models were used to characterize the performance of ONTAP AI with the express goal of providing:

  • Infrastructure sizing information for architects
  • Neural network sizing information for data scientists

Performance Metrics for AI Infrastructure Architects

When you’re designing AI infrastructure, you have to deliver balanced data pipeline performance across storage systems, networks, and GPU hosts in order to avoid bottlenecks. As I’ve described in previous blogs (you’ll find a list of all blogs in this series at the bottom of this post), I/O bandwidth between storage and the training cluster is critical. Therefore, we focused on I/O bandwidth as a key metric for architects.

 

In the graphs below, the dark blue bars represent data movement from the data lake into the AI training cluster as training commences. Even though this is a fairly small dataset and a non-production neural net workload, the test configuration required ONTAP AI to deliver 5GB/sec peak throughput to load the training cluster.

While delivering this level of performance, the AFF storage CPUs are at ~18% utilization, showing there is plenty of headroom to support more data traffic from many more GPUs.

 

The graphs also include data from Flexible IO (fio), an I/O workload generator, that comes closer to exercising the system’s full I/O potential. The sustained bandwidth achieved in this setup was ~4GB/s. As the neural network models run, they write periodic checkpoints to storage and then occasionally re-read this information, generating random I/O.

 

The following graph shows that all GPUs in the setup are maintained at ~95% utilization and remains consistent regardless of how much data is coming from the storage system, demonstrating that storage access is not a bottleneck to GPU performance with these workloads.

These tests demonstrate the performance and scalability advantage of ONTAP AI. A single AFF A800 controller pair can provide 25GB/s, delivering 5x the level of performance needed in this case.

 

So how do you use this information in practice? If your POC demonstrates this type of bandwidth and controller utilization, you can use this information as a scaling factor to estimate your production requirements.

 

If you’re early in your AI journey, you may consider scaling down to another AFF platform to reduce cost. NetApp AI gives you complete flexibility to start small and scale out as your needs grow.

Performance Metrics for AI Data Scientists

In the world of data scientists, the metric of choice is images per second. While this technically relates to computer vision, it has become the de facto metric for spec’ing out neural network performance. The graph below compares the performance of 1-node, 2-node, and 4-node DGX-1 clusters. (Each DGX-1 has 8 GPUs.)

As you can see, performance scales linearly as GPUs are added. If a data scientist can spec out an AI job in terms of data set size and desired images per second and feed that information to the infrastructure team, the infrastructure team can use that information either:

  • For sizing new infrastructure
  • For allocating existing infrastructure (from a larger cluster) to meet the requirement

ONTAP AI Inferencing Performance

In addition to training performance, we also measured inferencing performance using the same ONTAP AI configuration. The following graph demonstrates the high-throughput inferencing performance of ONTAP AI using both Tensor and CUDA cores.

The Best of All Worlds

The sizing information and inferencing performance data we’ve gathered for ONTAP AI can enable you to map between the worlds of data science and AI infrastructure. ONTAP AI lets you start small and provides the ability to scale I/O bandwidth and capacity levels higher—300GB/s bandwidth and 79PB raw capacity—than other turnkey solutions.

 

If you contrast the NetApp all flash storage in ONTAP AI to possible alternatives, some clear distinctions are apparent. Legacy NFS systems and parallel file systems have been optimized for sequential I/O. Only NetApp can accommodate the combination of sequential and random I/O that occurs in AI. This is a by-product of the fact that NetApp spent years optimizing its systems to support highly random OLTP database workloads on NFS at a time when the rest of the industry felt workloads like Oracle and SAP could only run on block storage.

 

New and emerging NFS systems now available don’t come close to the bandwidth or capacity of the AFF A800 at full scale-out, making ONTAP AI the only choice for teams that wants to deploy AI infrastructure now with the certainty they have the headroom to meet future requirements.

 

Check out these resources to learn more about NetApp AI. The NetApp Validated Architecture provides additional information on ONTAP AI performance and sizing.

Previous blogs in this series:

  1. Is Your IT Infrastructure Ready to Support AI Workflows in Production?
  2. Accelerate I/O for Your Deep Learning Pipeline
  3. Addressing AI Data Lifecycle Challenges with Data Fabric
  4. Choosing an Optimal Filesystem and Data Architecture for Your AI/ML/DL Pipeline
  5. NVIDIA GTC 2018: New GPUs, Deep Learning, and Data Storage for AI
  6. Five Advantages of ONTAP AI for AI and Deep Learning

Santosh Rao

Santosh Rao is a Senior Technical Director for the Data ONTAP Engineering Group at NetApp. In this role, he is responsible for Data ONTAP technology innovation agenda for Workloads and Solutions ranging from NoSQL, Big Data, Deep Learning, and other 2nd and 3rd Platform Workloads.

He has held a number of roles within NetApp and led the original ground up development of Clustered ONTAP SAN for NetApp as well as a number of follow-on ONTAP SAN products for Data Migration, Mobility, Protection, Virtualization, SLO Management, App Integration and All Flash SAN. Prior to joining NetApp, Santosh was a Master Technologist for HP and led the development of a number of Storage and Operating System Technologies for HP including development of their early generation products for a variety of storage and OS technologies over the years.