Enterprises are eager to take advantage of artificial intelligence (AI) technologies as a means to introduce new services and enhance insights from company data. However, as data science teams move past proof of concept and begin to operationalize deep learning, many are experiencing issues with data management. They may struggle to deliver the necessary performance, and also find it challenging to move and copy data and to optimize storage for large and growing datasets.
The data flow necessary for successful AI isn’t isolated to the data center. As enterprises of all types embrace IoT and AI technologies, they face data challenges from edge to core to cloud.
For example, NetApp is partnering with several automotive companies that are gathering data from growing numbers of vehicles. This data is used to train the AI algorithms necessary for autonomous operation. In the process, they are literally driving IT technology to its limits.
Retailers are creating inference models based on data gathered from point-of-sale devices across hundreds of retail locations around the world. Late November to New Year’s Day is the busiest time of the year for most retailers, so it’s easy to imagine the huge spike in data that they are currently experiencing.
Some vendors would have you believe that the AI data challenge is only about delivering performance. But that’s primarily an issue in the core of the AI pipeline—the only place their solutions play. NetApp® Data Fabric technologies work together to encompass the entire data flow, from ingest to archive, ensuring your operational success while delivering optimal performance, efficiency, and cost at every phase.
In this blog series, I’ll talk about AI infrastructure challenges and describe how NetApp can help you build a data pipeline to enable deep learning. Because deep learning is the most demanding AI workflow in terms of both computation and I/O, a data pipeline designed for deep learning will also accommodate other AI and big data workflows.
You can find out more about the advantages of NetApp solutions for AI from the infographic: 10 Good Reasons to Choose NetApp for Machine Learning.
Data Flow in a Deep Learning Pipeline
Let’s start by considering the workflow necessary in a deep learning pipeline, as shown in the following figure.
- Data ingest. Ingestion usually occurs at the edge—for example, capturing data streaming from cars or point-of-sale devices. Depending on the use case, IT infrastructure might be needed at or near the ingestion point. For instance, a retailer might need a small footprint in each store, consolidating data from multiple devices.
- Data prep. Preprocessing is necessary to normalize data before training. Preprocessing takes place in a data lake, possibly in the cloud in the form of an S3 tier, or on premises as a file store or object store.
- Training. For the critical training phase of deep learning, data is typically copied from the data lake into the training cluster at regular intervals. Servers used in this phase use GPUs to parallelize operations, creating a tremendous appetite for data. Raw I/O bandwidth is crucial.
- Deployment. The resulting model is pushed out to be tested and then moved to production. Depending on the use case, the model might be deployed back to edge operations. Real-world results of the model are monitored, and feedback in the form of new data flows back into the data lake, along with new incoming data to iterate on the process.
- Archive. Cold data from past iterations may be saved indefinitely. Many AI teams want to archive cold data to object storage, in either a private or public cloud.
Many customers have attempted to build out this pipeline for deep learning, either in the cloud or on premises, by using commodity hardware and a brute-force approach to data management. Because of the prohibitive cost of moving data out of the cloud, once data is committed there, you will probably end up running the rest of the pipeline in the cloud as well. In either case, bottlenecks inevitably arise as production proceeds and the amount of data increases.
The biggest bottleneck occurs during the training phase, where massive I/O bandwidth with extreme I/O parallelism is needed to feed data to the deep learning training cluster for processing. Following the training phase, the resulting inference models are often stored in a DevOps-style repository where they benefit from ultra-low-latency access.
If data doesn’t flow smoothly through the entire pipeline, beginning with ingest, your deep learning pipeline will never achieve full productivity, and you’ll have to commit increasing amounts of staff time to manage the pipeline.
NetApp and the Deep Learning Pipeline
Only NetApp Data Fabric offers the necessary data management technology to satisfy the needs of the entire deep learning pipeline, from edge to core to cloud. Cloud providers don’t encompass the edge and may struggle with I/O performance. Other storage vendors attempt to solve the bandwidth problems during training, but can’t deliver ultra-low latency, and they lack the technology necessary to cover the entire workflow. This is where NetApp Data Fabric offers distinct advantages.
At the edge, NetApp offers ONTAP® Select, which runs on commodity hardware to enable data aggregation and advanced data management. Our upcoming Plexistor technology will facilitate ingest, especially where the rate of ingest is extremely high.
To address storage needs for both the data lake and training cluster, NetApp All Flash FAS (AFF) storage delivers both high performance and high capacity while reducing the need for time-consuming data copies. NetApp is working to deliver NVMe over Fabrics (NVMe-oF) and Plexistor to further extend AFF capabilities. NetApp Private Storage (NPS) delivers many of the same benefits for deep learning pipelines in the cloud.
For archiving cold data, NetApp FabricPool technology migrates data to object storage automatically based on defined policies.
In future blogs, I’ll explore each of these technologies. Next time, I’ll take a closer look at performance needs across the deep learning pipeline.