Autonomous driving vehicles were supposed to be here by 2020. But COVID-19 and the sheer complexity of the challenge of anticipating what other drivers and pedestrians will do has delayed the release of fully autonomous driving vehicles.
The solution is that these vehicles need to drive more miles in test mode and log more data so that they know what to do in unexpected situations.
Ultimately, the training needed for AI models depends on data. Huge amounts of data. And all of this data is presenting a tremendous problem, because having the correct data to train the AI model for all possible scenarios is critical.
These issues are some of the reasons that Gartner, a global research and advisory firm, placed autonomous vehicles in the 2019 Trough of Disillusionment for their yearly Hype Cycle.
StorageGRID is part of the solution
NetApp® StorageGRID® object storage brings three critical things to AI development workflows:
- Workflow automation integrations
Autonomous vehicle software development requires massive quantities of data that are generated in geographically diverse locations. We estimate that a single car generates 2TB of data per hour. See our technical report, NetApp StorageGRID Data Lake for Autonomous Driving Workloads , for how we calculate this number. Bringing those data assets together for labeling, processing, and model training is one of the biggest challenges that engineers face when they’re developing advanced driver assistance and autonomous vehicle capabilities.
As the quantity of data continues to grow into tens and hundreds of exabytes, traditional storage systems cannot scale cost-effectively to reach that capacity. And because of limits in their capacity and addressable space, these systems also impose operational complexity.
To overcome these limitations, object storage is quickly becoming the storage method of choice to manage these vast quantities of data. Object storage can provide nearly unlimited storage capacity in a single namespace and offers geographically distributed storage options to ease data movement from one location to another. And with simplified access semantics, object storage makes it easy to access data.
The NetApp StorageGRID enterprise-grade, software-defined object storage system delivers all these capabilities and integrates seamlessly with the other products that are needed for a complete data pipeline solution.
Scaling to keep up
One of the biggest challenges of keeping everything in sync is the ability to scale storage so that it can gracefully deal with the enormous volume of data coming at it. According to estimates, only a small percentage of data is ever going to be useful. Separating the meaningful data from the rest as early as possible becomes critical to prevent a “garbage in, garbage out” situation. For example, the AI system needs to distinguish every example of a deer is in the road, rather than going through hundreds of terabytes of “animal in the road” instances to find the ones that are deer.
If you add tagsthat annotate the object and provide context, the data becomes more usable to the AI data pipeline. With NetApp object storage, data that is ingested at remote locations can be automatically moved for further processing or archiving. Data that is created in a data center can be smoothly propagated across remote or public cloud resources to support any processing or workflow need.
In addition, StorageGRID solves the problem of managing data at scale by using information lifecycle management (ILM). ILM is a policy-driven engine that lets administrators decide where they want their data to be, how long it needs to be retained, and how efficiently it needs to be protected. As a result, your data is where you want it when you want it.
The importance of graceful geolocation
Another huge challenge of fully autonomous driving vehicles is where to store the data. If the AI data pipeline has to constantly monitor hundreds or even thousands of data repositories all over a given country, things quickly get complicated. This reduces agility, causes unnecessary data duplication, and increases the risk of mistakes.
This challenge occurs, for example, when the data storage facility is in a different location than the data training facility, or when the training takes place on premises, but the data is in the cloud.
With NetApp StorageGRID, you can have multiple sites in different locations that will all behave as a single global namespace. In other words, StorageGRID gives the application a single endpoint for obtaining the data. So, any application with the endpoint information and the right permissions can quickly and easily read or ingest the data it needs. Behind the scenes, the administrators know that the data is in multiple places, but the application doesn’t need to worry about it. The application accesses only a single endpoint to the object store.
This approach makes it easier for applications in an autonomous vehicle training data pipeline to access the data collaboratively from a single namespace. The same datasets can be accessed by apps that generate the data in remote sites and apps that model the deep learning/machine learning algorithms in core sites. In this way, you avoid the siloed architecture of the past.
StorageGRID also provides hybrid cloud integration points if you want to apply hyperscalers for better development agility or for better economics of cold storage archive options in the cloud.
Workflow integrations provide flexibility and make your data more manageable
With the explosion in data being generated, administrators cannot manage what happens with the data manually. Not only is manual management cost prohibitive; it does not scale.
Flexible ILM policies can automatically determine where and for how long data is stored. When data is ingested, you can apply a tag to the data with annotations in the metadata.
Then, at each step of data processing, you can use the tags to figure out what data has been worked with, or you can apply new tags that will be used in the next step. These streamlined workflows automate time-consuming steps and allow quality assurance to be done sooner — before the data is passed to the next step. Integration with Elasticsearch enables you to search the metadata tags as part of the processing pipeline or to gain new insights into the flow of data.
For example, as the images and text files from a vehicle are uploaded, the objects are also tagged with the location name and a source ID, making the data much more useful at the outset . And to automate data management, you can configure an ILM policy that maintains a copy of data for 7 days at the local site and creates another copy to retain indefinitely at the data center.
These automations make the whole process much more manageable.
And that’s not all…
Increasingly, organizations need to retain data for long periods of time for liability and training purposes. Workflows and automations mean that a process, not people, are determining what data is stored where and for how long. The StorageGRID enterprise-grade, software-defined object storage system makes regulatory compliance a breeze.
As if all this were not enough, you can optimize your storage economics (read: stop overpaying for storage) by creating your own storage fabric, using the hybrid cloud functionality in StorageGRID. To learn more, read about our platform services and our cloud storage pools.
For more information about StorageGRID, including news, videos, and datasheets, visit our StorageGRID webpage. And read the full technical report, TR 4851: NetApp StorageGRID Data Lake for Autonomous Driving Workloads.