In previous blogs in this series, I talked about the edge-to-core-to-cloud architecture and the fact that in a data pipeline there are different I/O characteristics for data flowing in from the edge versus data flowing from the data lake into the training cluster. In this blog, I want to go one level deeper and talk about the specific set of choices that you can make to smooth the flow of data into the training cluster.
If you think of the GPUs in your training cluster as a high-performance car, a good data pipeline is like the difference between taking that car out on a racetrack or taking it out on the freeway at rush hour. As you think about how to obtain maximal results from your AI/ML/DL deployment, your data pipeline is perhaps the single most important consideration, yet it often gets overlooked. The optimal data architecture takes into account I/O needs across the edge, data lake, and training cluster.
Filesystem Selection for the Data Pipeline
As I mentioned in the second post, object storage is not designed to deliver the level of performance that your data pipeline is going to require—now or in the future. File-based storage remains a superior choice, but there are many factors that you must consider:
- Which filesystem?
- A scale-out filesystem such as Lustre or GPFS
- HDFS, a commonly used big data filesystem
- NFS, the most widely deployed shared filesystem for technical applications for the last 30 years
- Ability to accommodate and federate both structured and unstructured data from a variety of data sources without sacrificing performance for any source
- Log and sensor data
- Databases, including RDBMS and NoSQL
- Random I/O for many types of databases: table scans, document and collection reads in NoSQL, columnar reads in columnar databases, key-value random reads in key-value databases
- Sequential I/O for in-memory databases and in-memory engines like Spark
- Email logs
- Home directories
- Many others
- Performance for small, random I/O versus sequential I/O
- Some data sources generate random I/O while others are sequential as shown above.
- Filesystem must be able to balance performance between both.
- Performance and capabilities of data movers
- Greatest performance
- Most efficient data movement
- Data lifecycle automation
- Intelligent filtering to determine what data goes to the core versus archive tiers
- Real-time performance for filtering decisions
- Support for the latest storage and memory media, providing orders of magnitude advances in performance and latency
The filesystem and data architecture you choose should account for all the factors that are important to your environment.
Data Flow into the Training Cluster
There are some nuances around the way that data flows into the training cluster that are important to understand. These factors affect:
- Where I/O is coalesced
- Requirements for a single namespace
- Metadata scaling
Data curation is a function of the data source. I/O coalescing can happen in two different locations:
- In the data lake as a part of data curation and transformation resulting in streaming I/O into the training cluster.
- In the training cluster itself, which results in random I/O from data lake.
When you’re dealing with an unstructured data lake, that’s a filesystem almost by definition. It has the ability to curate the data and lay it out as a set of coalesced file streams that can be nicely aligned with the training cluster, allowing data to stream directly into cluster CPUs to pre-load and feed GPUs.
On the other hand, when you’re dealing with data sources such as databases, sensor logs, file logs, emails, and so on it may be impossible to have nice curated reads that allow you to stream data into the cluster. In these cases, data is accessed via random reads and I/O coalescing happens in the training cluster itself.
Depending on the types of data sources you have, your data architecture may need to be able to deliver both large sequential reads and small random reads into the training cluster.
AI datasets have the potential to grow to massive size, leading to tremendous data sprawl. Accommodating this growth requires a scale-out filesystem with a single namespace and the ability to scale performance linearly to a single client node as well as to multiple client nodes accessing the same data in parallel. An architecture that can continue to scale as you add compute and capacity is going to be critical.
There can be different types of client access to this single namespace, each with implications for performance. Certain training models are what is known as “async.” The dataset is partitioned statically across training cluster nodes with single-node access to regions of the namespace, resulting in a “single client active” scenario.
Other training models run synchronously. The training model and its dataset have tight coupling and the dataset is shared across all cluster nodes with simultaneous access. This “multi-client active” scenario is the most demanding case from a performance standpoint.
There are other uses cases where a multi-layered neural network trains the layers of the network on different nodes. The nodes serve as a model pipeline where the model progresses from one node to the next. This results in the entire dataset being read repeatedly, one node at a time, in a “sweeping hand” style of access.
As you evaluate filesystems capable of addressing these usage patterns, you’ll find that NFS has been applied to a diverse range of workloads ranging from its roots of HPC and home directories, to databases such as Oracle and SQL running on NAS storage, to SAP—and more recently virtualization and big data. This long history of using NFS across a variety of workloads enables it to handle both the random and sequential I/O generated by diverse access patterns to the namespace, especially when combined with the benefits of all-flash storage in a linear scale-out cluster. As a relatively new filesystem, HDFS has had limited exposure to diverse data workloads and performance characteristics. Big data vendors have been undertaking significant (and proprietary) re-writes to deal with the performance needs in the transition from MapReduce to Spark; AI introduces another wrinkle in the HDFS story.
Relying on a big-data-specific filesystem like HDFS can mean more data copies and siloes as you find yourself doing yet another data copy from HDFS into a high performance scale-out filesystem for AI.
The same access patterns discussed in the preceding section also have implications for metadata performance. Each node in the training cluster may query metadata independently, so metadata access performance must scale linearly with the growth of the filesystem. Metadata access with filesystems such as Lustre and GPFS can become a bottleneck due to reliance on separate metadata servers and storage.
There are a variety of other factors that you should also take into account when selecting a filesystem for your data pipeline needs. These include:
- Ease of management
- Quality of Service (QoS)
- Cloning capabilities
- Ecosystem of client-side caching solutions
- Ability to perform in-place AI/DL with a unified filesystem across the data lake and AI/DL tiers
- Best-of-breed media support
- Future proofing
Ease of management
As you evaluate filesystems, it’s important to ask some questions related to management. Can the filesystem scale autonomously and automatically without management intervention? How much time and technical expertise does the filesystem take to manage? How easy is it to find people with the necessary expertise?
Scale-out filesystems such as Lustre and GPFS can be challenging to configure, maintain, monitor and manage. By comparison, NFS is easy to manage and NFS expertise is widespread.
Quality of Service
QoS can also be an important element of your data architecture. You may be building multi-tenant training clusters with price tags running into the millions of dollars. QoS plays a key role in your ability to deliver multi-tenancy, enabling multiple activities to share the same resources.
- Does the filesystem offer QoS?
- Is QoS integrated end-to-end?
- Can you apply limits and maximums on performance consumption across storage, networks, and compute to partition service levels for different training models?
Part of the multi-tenancy requirement is to satisfy different job functions within your organization. You may have a set of training models in various stages of development resulting in different use cases:
- Early training
- Model validation
- Production deployment
The ability to clone datasets and assign different QoS settings to each clone, allows you to provide different performance SLAs for different use cases. Space-efficient cloning is therefore a must-have for a multi-tenant cluster.
Use of a client-side cache helps further accelerate performance by providing a data buffer that enables uninterrupted data flow as the training dataset is accessed from training cluster nodes. A filesystem that supports an ecosystem of client caching products (whether open-source or commercial) can provide substantial advantages.
A variety of open-source and commercial options exist for NFS-based storage. Few if any client-side caching products currently exist for Lustre, GPFS, or HDFS. Almost none are open-source and freely available.
In-Place AI/DL with a Unified Filesystem
There are going to be situations where you want to use the same data to serve both big data analytics workloads and AI/ML/DL workloads. For AI that is applied in a post-process fashion—such as for surveillance, fraud detection, etc.—the right filesystem makes it possible to accomplish both workloads without the need for data copies. The dataset resides in a single location, and in-place analytics and in-place AI/ML/DL compute processing is applied (possibly with the use of client-side caching as just discussed) without copying data into dedicated filesystems for your data lake and training cluster.
However, if real-time performance is a key requirement or a key competitive differentiator, you will likely continue to need a dedicated data copy for the training cluster.
Support for state-of-the-art media and memory advances
Finally, you’ll want to pick a filesystem that’s able to support the latest advancements in media and memory so that the performance of your data pipeline can continue to evolve in lockstep with the technology roadmap. Is the filesystem optimized for flash today? Is it seamlessly extensible to support new technologies, and are vendors actively innovating in areas such as NVMe, NVMeOF, NVDIMM and 3D XPoint?
Flash today is capable of latencies around 500 microseconds. NVMeoF will take that down to 200 microseconds. NVDIMM, 3D XPoint, and persistent memory are poised to take latencies to sub 100microseconds, sub 10 microseconds, and eventually nanoseconds. Your data pipeline vendor needs to be making sustained investments to keep pace with this evolution, across server-based and shared-storage solutions.
Future Proofing your Data Architecture and Filesystem Choice
The whole AI field is evolving very quickly, but it can be impractical or impossible to re-build your architecture from scratch every 6 months to a year. As a final consideration, you should try to make technology choices that are as future-proof as possible. The ability to seamlessly and non-disruptively evolve different layers of technology such as filesystem, interconnect, deployment location, media and memory type within a chosen infrastructure provides long-term return on investment and ability to absorb technology evolutions as they occur.
Your choice of filesystem today will likely depend on your team’s existing comfort levels, skillset and prior expertise. You will likely want to factor in past deployment experience, existing deployments, and existing infrastructure.
As an example, if you’re comfortable with and looking to deploy on Fibre Channel or Infiniband, you may go with a SAN architecture and Lustre or GPFS. Over time, you may decide the 100GbE or 400GbE roadmap with NFS is better for your needs. A well thought through data architecture is able to accommodate and future proof the solution allowing you to seamlessly switch your filesystem without replacing infrastructure.
Similarly, you may choose NFS today but decide you need a SAN, NVMe, or NVMeoF-based filesystem or a persistent-memory-based data layout in the future. A future-proofed architecture allows you to evolve datastore technologies without needing to replace your entire deployed infrastructure.
Which Data Architecture Will You Choose?
The criteria outlined in this blog should give you a good foundation on which to select a filesystem and a data architecture well suited to your AI/ML/DL needs. We believe that the combination of NFS running on NetApp AFF storage is a leading contender based on our ability to address these needs and to evolve in place to accommodate the latest technologies.
In the next post, I’ll look at the bandwidth of some IO Intensive AI workloads and map that to what a Storage Solution would look like.
Want to know more about the factors involved in the architecture of a data pipeline for deep learning, specifically data management and the hybrid cloud? Then you’ll want to tune into a replay of my latest webinar, Architecting a Complete Data Infrastructure for AI and Deep Learning.