Companies and organizations of all sizes and across all industries are turning to artificial intelligence (AI), machine learning (ML), and deep learning (DL) to solve real-world problems, deliver innovative products and services, and to gain an edge in an increasingly competitive marketplace. As organizations increase their use of AI, ML, and DL, they face many challenges, including workload scalability, difficulty of deployment, and data availability. There are many frameworks in the marketplace that attempt to tackle the workload scalability and deployment difficulty hurdles. Most of these frameworks, however, fail to address the data availability challenge. Many of them feature proprietary data platforms that don’t offer proven enterprise-class reliability and don’t scale across different sites and regions. The NetApp AI Control Plane, NetApp’s full stack AI data and experiment management solution, is unique in that it addresses all three challenges.
NetApp AI Control Plane
With the NetApp AI Control Plane solution, you can rapidly clone a data namespace just as you would a Git Repo. Additionally, you can define and implement AI/ML/DL training workflows that incorporate the near-instant creation of data and model baselines for traceability and versioning. When you use this solution, you will be able to trace every single model training run back to the exact dataset that was used to train and/or validate the model. You can also swiftly provision Jupyter Notebook workspaces with seamless access to massive datasets. As this solution is targeted towards Data Scientists and MLOps Engineers, no NetApp or NetApp ONTAP expertise is required. With this solution, data management functions can be executed using simple and familiar tools and interfaces. Furthermore, this solution utilizes fully open source and free components. Thus, if you already have NetApp storage in your environment, you can implement this solution today.
The NetApp AI Control Plane solution consists of three major components – Kubeflow, Kubernetes, and the NetApp Data Fabric. The use of Kubernetes enables workload scalability and portability. Kubernetes is the industry standard container orchestration platform. It has become the standard platform for cloud-native deployments. As it is an open platform, Kubernetes allows you to scale your workloads across edge, private cloud, and public cloud sites without locking you in to a specific vendor or public cloud provider.
Another component of the NetApp AI Control Plane, Kubeflow, enables simplicity of deployment for AI workloads. Kubeflow is an up-and-coming AI/ML/DL toolkit for Kubernetes. As a Kubernetes-native framework, it provides a standard and open platform for deploying AI/ML/DL workloads. Kubeflow abstracts away the intricacies of Kubernetes, allowing Data Scientists and Developers to focus on what they know best – AI/ML/DL. With Kubeflow, Data Scientists no longer need to be Kubernetes Administrators. Kubeflow allows Data Scientists to define end-to-end AI/ML/DL workflows using a simple Python SDK. They don’t need to know how to define Kubernetes deployments in YAML or execute `kubectl` commands. Given that most Data Scientists are already familiar with Python through the use of AI frameworks such as TensorFlow and PyTorch, the learning curve is not steep. Additionally, Jupyter Notebooks are included with Kubeflow out of the box. A Team Lead or Administrator can provision and destroy Jupyter Notebook servers for Data Scientists and Developers on demand. When Kubeflow is deployed as part of the NetApp AI Control Plane solution, data volumes, potentially containing petabytes worth of data, can be presented to Data Scientists as simple folders within a Jupyter workspace. The Data Scientist is given instant access to all of their data from within a familiar interface. They never even need to know that the data resides on NetApp storage.
NetApp Data Fabric
The NetApp Data Fabric rounds out the solution by enabling data availability. The Data Fabric facilitates seamless data movement across edge, private cloud, and public cloud sites, all while providing enterprise-class data management and data protection capabilities. The NetApp Data Fabric provides data portability and scalability capabilities to go along with the workload portability and scalability capabilities offered by Kubernetes. No longer do Data Scientists have to wait for days while the datasets that they need for their AI projects are copied from site to site, from server to server, or even from workstation to workstation. Additionally, organizations no longer have to pay for the storage space required to store many different copies of the same dataset. Also eliminated are the headache of tracking changes across multiple different versions of the same dataset and the risk of losing track of a specific copy of a dataset. When the Data Fabric is paired with Kubernetes, AI/ML/DL workloads and petabytes worth of AI/ML/DL training data can be seamlessly scaled together across sites and regions.
The NetApp Data Fabric is, in fact, integrated directly with Kubernetes. Through the use of NetApp Trident, NetApp’s persistent storage provisioner for Kubernetes, data volumes stored within the Data Fabric can be presented to Kubernetes workloads in a Kubernetes-native format. To put it another way, Trident provides enterprise-class persistent storage for containers, in a cloud-native format. With Trident, Developers can perform data management functions using standard Kubernetes API calls and commands. Given that Kubeflow is a Kubernetes-native framework and is, thus, integrated with the standard Kubernetes APIs, this enables Data Scientists incorporate data management tasks directly into a Kubeflow pipeline workflow. For example, a Data Scientist can define a step within an AI/ML/DL model training pipeline that will trigger the creation of a Snapshot of the training dataset and trained model every time that the workflow is executed. In fact, a Data Scientist can even trigger the creation of a Snapshot from within a Jupyter Notebook for on-demand dataset and/or model versioning. As long as these snapshots are not deleted, it will always be possible to trace a specific trained model back to the exact training dataset that was used to train it. This provides storage-efficient traceability, as a Snapshot does not consume additional storage space until it starts to deviate from its source volume. Likewise, the same Data Scientist can rapidly provision a space-efficient clone of a dataset or model. This clone can subsequently be used as a dev/test workspace or for A/B testing.
The NetApp AI Control Plane is unique in that it addresses three of the major challenges faced by organizations when they attempt to increase their use of AI, ML, and DL: workload scalability, difficulty of deployment, and data availability. Through the use of the NetApp Data Fabric, data is always available whenever and wherever it is needed, and physical storage space is always utilized efficiently. Another component of the solution, Kubeflow, enables Data Scientists to quickly and easily define end-to-end AI workflows. Lastly, through the use of Kubernetes, AI workloads can be seamlessly scaled across regions and sites.
To learn more, refer to TR-4798 and be sure to view my GTC Digital on-demand session:
Read the NetApp GTC announcement blog by Kim Stevenson, Senior Vice President and General Manager, Foundational Data Services Business Unit, NetApp
Check out these additional resources to learn more about the NetApp AI Control Plane, or visit https://www.netapp.com/us/solutions/applications/ai-deep-learning.aspx.