In my previous post in this series, I talked about approaches for eliminating I/O bottlenecks across your entire deep learning pipeline whether the core of your pipeline is on-premises or in the cloud.
This time I want to talk more specifically about evolving approaches to infrastructure and data management that can help you cope better with massive data growth:
- Tiered data management
- Smart data movers
- Planning ahead for edge-to-cloud
- Architecting the core for faster hardware evolution
Eliminating bottlenecks (as discussed last time) and focusing on data management and careful infrastructure planning as described below can enable you to complete more iterations through your deep learning pipeline more quickly, leading to better models and decreasing time to implementation.
Tiered Data Management at the Edge
In my previous post I talked about the need for edge-level analytics to allow you to selectively pass on data from the edge. This gives you the ability to create tiers of data service based on the results of edge analytics.
With this approach, some data is prioritized—using either simple filtering or advanced analytics and AI—and passed efficiently into the deep learning pipeline. Other data is de-prioritized, and may either be discarded or managed with a different class of service. For example, in a variety of industrial processes thermographic cameras are used to establish a baseline for normal operating conditions. Once the baseline is established, AI algorithms filter the data so that normal state data is not persisted.
Depending on requirements, each tier of data can be processed with different transformations to achieve the necessary levels of storage efficiency and security. Low-priority data might be compressed, encrypted, and stored in a cloud repository for compliance or in case it’s needed for later processing.
The ability to process analytics at the edge is a function of the compute power available. We’re seeing compute and cloud vendors competing for the edge footprint with various strategies. For example, NVIDIA is bringing GPU power to the edge to enable AI for applications such as self-driving cars. One thing that all these solutions have in common from a data perspective is incorporation of commodity DAS, lacking intelligent data management. There’s an obvious need here for intelligent data storage such as NetApp ONTAP 9 on AFF and FAS, or with ONTAP Select. These technologies enable a Data Fabric that facilitates transformation and movement of data.
Many edge environments—such as self-driving cars and oil field deployments—are potentially harsh in terms of temperature, shock, and vibration, and could require ruggedized hardware. NetApp has partnered with Vector Data to deliver ruggedized solutions that utilize Vector Data hardware and ONTAP Select software. A hyperconverged option combines compute, storage, and networking in a single box or a storage-only solution is available. Both options are Data Fabric capable.
Smart Data Movers
When you look at what it takes to deliver data at high bandwidth from the edge, one of the key elements is having a smart data mover. In the most common architecture today, data is moved using full data moves in the form of S3 puts. This has the disadvantage of moving data wholesale without applying any data transformation.
Replacing this crude method with smart data movers that can coalesce the data, apply data transformations to reduce the data footprint, and then apply network transformations to move only changed blocks, can dramatically accelerate data movement, and reduce bandwidth requirements. This is how NetApp SnapMirror moves data in the Data Fabric.
Transitioning to Edge-to-Cloud
Many enterprises are using an edge-to-core strategy for deep learning today with the keen awareness that they may need to shift to an edge-to-cloud strategy in the future based on industry changes and competitive needs. For example, companies in the healthcare and financial services markets have been among the most hesitant to commit data to public cloud for reasons of data sovereignty, security, and compliance.
However, Amazon published several case studies highlighting successes for well-known financial and healthcare companies. Other companies in those verticals that I’ve talked to are already scrambling to see how they can evolve their own strategies to avoid being left behind.
Some applications of deep learning have a natural affinity for edge-to-cloud. For example, for applications where the endpoints are smart devices, including smartphones, tablets, health trackers such as Fitbit, and similar devices, there is a “virtual edge” in the form of a SaaS application in the cloud that aggregates data. In these cases, processing of AI and deep learning is naturally co-located in the cloud. But even these customers are starting to look at how they can use different clouds for different use cases, with minimal dependency on the cloud stack itself so that they can move easily from one cloud vendor to another. This also requires separation of data from the AI stack.
Evolving from an edge-to-core-to-cloud strategy to an edge-to-cloud strategy requires careful thinking about how you separate data from compute. Specifically, the ability to separate the dataset—and lifecycle management of that dataset—from the AI provider.
As we saw last time, one approach to this is NetApp Private Storage or NPS, which allows you to co-locate dedicated NetApp storage—including hybrid FAS systems and/or All Flash FAS (AFF) storage—in Equinix datacenters with high-speed access to multiple clouds and very high I/O throughput.
Another alternative is to utilize virtualized, cloud-based storage as a faster alternative to S3 and other object storage. You can use ONTAP Cloud today, with Plexistor as a potential option in the future.
A key element for enabling high-performance deep learning in the cloud is the ability to look beyond the limitations of S3. While it is the de facto storage protocol, it is not geared for performance. A software-defined solution can take advantage of the cloud footprint but also take advantage of new fast media including storage class memory (SCM). ONTAP Select and Plexistor are examples of NetApp’s ability to bring those capabilities to market.
Core Hardware Evolution
The advantage of the cloud is that you can consume a deep learning service without having to understand the intricacies of the hardware stack. However, you pay a price for that convenience in terms of loss of control. When you consider the core of the on-premises deep learning pipeline, one of the key trends is the continued evolution of the hardware required.
There’s an intense battle taking place over who will be the hardware vendor of choice for deep learning. While NVIDIA has the clear lead at the moment, there are many emerging technologies. Each cloud vendor is building its own hardware. Google’s Tensor Processing Unit (TPU) is one example. Many startups are also building custom AI hardware.
One trend to watch for is the ability to separate the server infrastructure from the GPU infrastructure, allowing the two to evolve independently. A solution that abstracts GPU hardware from server and storage hardware will be able to evolve more easily (and at lower cost) to take advantage of new developments.
NetApp offers converged infrastructure with a variety of server partners such as Cisco (FlexPod), Fujitsu (NFLEX), and other vendors. This means you can easily take advantage of NetApp storage in conjunction with a variety of server platforms when building out your deep learning cluster.
By following the guidelines I’ve outlined here you can:
- Implement intelligent data management at the edge to better cope with data growth
- Move data more intelligently and efficiently from the edge to the core or the cloud
- Be prepared to transition to an edge-to-cloud model if that becomes necessary
- Create a more agile deep learning hardware architecture that can evolve more quickly
Next time, I’ll examine NetApp technologies that simplify data management and smooth the flow of data across the entire deep learning pipeline from end to end.