In the first post in this blog series, I talked about the challenges of operationalizing deep learning and other AI pipelines. These challenges can include I/O performance bottlenecks, as well as issues with moving, copying, and managing growing datasets.


Whether you execute your AI workflow on premises or in the cloud, operational bottlenecks can extend the time needed to complete each training cycle, reducing the productivity of your pipeline and eating up valuable staff time.

In this post, I’ll talk about solutions to address I/O bottlenecks from edge to core to cloud, including:

  • Bottlenecks that slow down data ingest
  • Bottlenecks on premises
  • Bottlenecks in the cloud

The three phases at the core of the pipeline in particular—data prep, training, and deployment—create unique I/O requirements that must be addressed.

Eliminate Bottlenecks at the Edge

The amount of data generated by smart edge devices and a large number of ingestion points can overwhelm compute, storage, and networks at the edge, creating bottlenecks as data moves into your data center or the cloud.


By applying edge-level analytics, you can process and selectively pass on data during ingest. This requires infrastructure at the edge with high-performance, ultra-low-latency storage. Many NetApp customers are taking a hierarchical approach, with infrastructure at the last mile and sensors at the edge acting as endpoints. A manufacturing customer I spoke to recently uses this approach.


Sensors from manufacturing equipment feed data into infrastructure deployed in each plant to aggregate and analyze data, so it can be passed selectively up the chain. This approach also makes sense for autonomous vehicles (where each endpoint can generate up to 7TB of data per day), retail, and many other fields.


Locating NetApp® ONTAP® infrastructure near the edge allows you to manage data more efficiently in three ways:

  • Edge analytics can be applied, reducing the amount of data that needs to flow back from edge to core.
  • Smart data replication—such as NetApp SnapMirror®—eliminates the need to transmit data blocks that have previously been sent, reducing bandwidth requirements.
  • Storage efficiency technologies such as compression, compaction, and deduplication further reduce the amount of data sent over the wire.

NetApp addresses edge challenges cost effectively with our software-defined ONTAP Select software running on commodity hardware, enabling data aggregation, analytics, intelligent replication, and advanced storage efficiency with high density and a small footprint. Dedicated ONTAP storage systems can be deployed where greater I/O performance is required. NetApp is complementing ONTAP storage with our Plexistor technology on servers, which I’ll explain in more detail later in this post.

Eliminate Bottlenecks On Premises

Data Lake

If the core of your deep learning pipeline is on premises, as data flows in from the edge, it gets collected in a data lake. An improperly implemented data lake becomes a bottleneck as the amount of data grows. A data lake can take the form of a Hadoop deployment with HDFS, or it can be implemented by using either an object store or a file store. HDFS is not optimized for performance and typically maintains three copies of each data object, slowing write performance and increasing cost.


Object stores were originally intended for cloud archiving, not performance, but in many cases they have become the de facto data store for big data projects. For deep learning in particular, object stores leave a lot to be desired where performance is concerned.


Turning to file stores, many scale-out file systems are designed for HPC batch processing; they don’t deal well with small-file workloads. Data flowing into the data lake from smart edge devices tends to be in the form of many small files, which these systems are not optimized for, so performance suffers.


NetApp All Flash FAS (AFF), especially when using FlexGroup volumes, overcomes the limitations of other data lake approaches. FlexGroup can deliver high performance for both bandwidth-oriented batch workloads and small-file workloads. The other data lake solutions mentioned—HDFS, object storage, and traditional file storage—may do one or the other, but they can’t deliver good performance for both sequential and random I/O.

Training Cluster

The current state of the art in deep learning training clusters is a scale-out cluster with 32 to 64 servers and 4 to 8 GPUs per server. From an I/O standpoint, you have to keep all those GPUs 100% busy. That means delivering a parallel I/O stream to each CPU core. In turn, each CPU core has an affinity to a GPU. It processes its stream, coalesces the I/O, and feeds the data to the GPU.


This process introduces I/O bottlenecks in the following ways:

  • Data has to be streamed quickly and efficiently from the data lake into the training cluster.
  • Up to 512 parallel I/O streams (32 to 64 servers, each with 4 to 8 GPUs) need to be kept fully loaded and staged to feed GPUs so that they never have to wait for data.

The ONTAP architecture uniquely satisfies both of these requirements. The data lake can be designed using hybrid FAS nodes, which can stream data into the training cluster with extremely high bandwidth. AFF nodes supporting the training cluster deliver bandwidth up to 18GB/sec per two-controller HA pair and sub-500 microsecond latencies, providing the bandwidth to support many I/O streams in parallel. Other flash solutions can’t achieve these latencies or bandwidth. NetApp also offers a technology roadmap that allows you to continue to grow the I/O performance of your deep learning pipeline as your needs grow.


Once training completes, the resulting inference models are put into a DevOps style repository and subjected to inference testing and hypothesis validation. This is where the extremely low latency of AFF running ONTAP comes into play.


With NetApp, a single storage architecture addresses all the performance needs for the core of your deep learning pipeline. Although this has immediate advantages, the current state of the art for most customers is to operate separate clusters for each stage of the pipeline. Big data pipelines may be in place with a data lake already deployed. You may want to implement just the new elements needed for deep learning as a separate project and copy data from phase to phase. As data continues to grow, however, you’ll need to further unify the pipeline. AFF also makes this unification possible. This will be a topic for discussion in the next blog.

Eliminate Bottlenecks in the Cloud

You may decide to deploy deep learning in the cloud for agility and ease of consumption. However, the same bottlenecks often apply when you run your deep learning pipeline in the cloud:

  • Can your data lake deliver the necessary performance for data ingest? Can it stream data into the training cluster?
  • Can your cloud provider deliver the I/O parallelism required for the training cluster?
  • How can you deliver the ultra-low latency required for finished inference models?
  • What if you need to ensure data sovereignty for sensitive data?

NetApp Private Storage (NPS) allows you to store your data near the cloud so that you can use public cloud compute capabilities and other services, while maintaining full control over your data. NPS brings the same architecture and the same performance described in the previous section to the public cloud. Data sovereignty issues are eliminated, and your data never gets locked into the cloud.


If your data lake absolutely must be in the cloud, Azure NFS as a Service (Azure NFSaaS) offers another alternative.

Future-Proof Your Deep Learning Pipeline

It’s almost certain that the size of deep learning datasets and the I/O requirements of your deep learning pipeline will continue to grow as you increase the number of servers, and as CPUs, GPUs, and purpose-built AI silicon continues to grow in power. The NetApp roadmap incorporates a number of elements that will enable you to scale I/O to keep pace. These include:

  • NVMe over Fabrics. Incorporating NVMe-oF as part of the AFF architecture will allow NetApp to drive latencies an order of magnitude lower.
  • Plexistor. In June 2017, NetApp acquired Plexistor, giving us a server-side storage technology that drives down latencies even further, extending the NetApp Data Fabric into the server. Plexistor can be deployed at the edge, in the core, and in the cloud, accelerating data ingest, edge analytics, and training.

Visit the NetApp NVMe page to learn more about NetApp’s vision and plans for these technologies. Watch a 20-minute video, “Creating the Fabric of a New Generation of Enterprise Apps,” presented by NetApp Strategist and Chief Evangelist Jeff Baxter.



Next time, I’ll examine NetApp technologies that simplify data management and smooth the flow of data across the entire deep learning pipeline from end to end.

Santosh Rao

Santosh Rao is a Senior Technical Director and leads the AI & Data Engineering Full Stack Platform at NetApp. In this role, he is responsible for the technology architecture, execution and overall NetApp AI business.

Santosh previously led the Data ONTAP technology innovation agenda for workloads and solutions ranging from NoSQL, big data, virtualization, enterprise apps and other 2nd and 3rd platform workloads. He has held a number of roles within NetApp and led the original ground up development of clustered ONTAP SAN for NetApp as well as a number of follow-on ONTAP SAN products for data migration, mobility, protection, virtualization, SLO management, app integration and all-flash SAN.

Prior to joining NetApp, Santosh was a Master Technologist for HP and led the development of a number of storage and operating system technologies for HP, including development of their early generation products for a variety of storage and OS technologies.

Add comment