Machine learning (ML) has become a growing workload for NetApp® ONTAP® powered all-flash storage arrays. As our engineering team has focused on delivering world-class quality for ONTAP software, we set out to learn more about ML and how it can help in our quality test engineering.
Selecting a Useful Test Case
Delivering a high-quality product like ONTAP software starts with its design. Then we have to test that product design to confirm that it meets our high standards. Our test engineering for ONTAP involves a wide array of complex test suites that run at scale, stressing the ONTAP systems in various ways, ranging from routine to extreme conditions. To test and to deploy ML models, we zeroed in on a systemic test of ONTAP with NFS (Network File System), which is also a popular choice for deploying ML workloads in real life.
Systemic NFS Testing
The preceding figure shows systemic NFS testing that uses more than 100 clients to generate I/O over NFS to a four-node ONTAP cluster, simulating a complex dataset that’s distributed across the cluster. Simultaneously, a control node orchestrates data administration and management operations such as NetApp FlexClone® cloning, NetApp Snapshot™ copy creation, volume moves, and NetApp SnapMirror® replication. These operations rapidly age the system to compress the real-world deployment scenario to within a span of 24 to 48 hours.
By the end of the test, each of these hundreds of clients has generated several log files that show the result of various administrative and management operations. Each has also generated I/O logs and packet trace files that can add up to hundreds of megabytes.
Although the test code itself can figure the result of most operations, test engineers typically spend a significant amount of time on examining the log files to confirm that everything in the test worked as intended. Given the complex nature of a test that involves simulation of real-world conditions and workloads, certain operations are expected to fail, and many should succeed. Triaging the test logs of the controller node and of the various clients that send I/O typically leads to the following issue categories:
- A defect in the product. This category is an issue in the product that is under testing. A bug should be filed on the product and must be rectified.
- A defect in the test code. A problem in the test script code causes this issue, and the test code must be fixed.
- An issue with the test infrastructure. Something in the setup, network, or lab infrastructure causes an error and must be corrected.
- A false alarm. What looks like an error is actually not an issue and can safely be ignored.
Developing an ML Model
Is it possible to create an ML model to triage the logs and the issues into the preceding four categories? That’s the question that we set out to explore. To develop an ML model, we started with putting together the foundational data that formed the input to the ML model, and we prepared that data for ML processing.
We collected a data sample of log files across several test iterations, and we chose more than 100 different clients that had some failures in their log files. A test engineer triaged each of the logs into one of the four issue categories. This set of 100 logs was then divided into two parts: 70 logs to train the ML model and 30 logs to test the model. “Training the ML model” means that the model is supposed to learn from the 70 logs and from their triage results to formulate an overall model or a set of rules. We can then use that information to arrive at overall triage results.
Before the data could be fed into the ML model, it first had to be anonymized. That way, the ML model could ignore the data that should be considered as noise and as not critical to the prediction. Following is an example of some data in the log file before and after anonymization:
Before: 20180428 080329[01_fcj/Replay_Cache/Real_IO/supervisor_118_scspr0468460011/118_worker] Updating NFS activity for nfs4_io_rg:nst_nfs4_hammer_locks_tcp
After: Updating NFS activity for nfs_io => 434
The “after” example represents the key text of the message and how often it appeared in the log file. The names of files or clients and processes are removed because they’re irrelevant to the triage.
For our testing, we wrote code in Python to further process the anonymized logs to build a feature vector table. In this table, every relevant message in the log files became a feature code, along with how many times that message occurred in the log file. Feeding this feature vector table into the Classification and Regression Trees (CART) algorithm in R language yielded an ML model that was ready to be tested. Running this model on the 30 logs of test data—again, anonymized and transformed into the feature vector table—produced the triage results of the trained ML algorithm.
Improving Predicted Triage
Comparison of the ML model’s predicted triage with the triage by a test engineer determines how accurate the model is. This comparison is represented in what’s called a confusion matrix. Our first confusion matrix generated a prediction accuracy of 92%, which was very encouraging, and it motivated us to go further along our journey. After repeated runs, we decided to deploy this model in our daily practice.
By using what we had learned, we modified the systemic NFS test suite. It now triggers the ML model at the end of the test run to predict and to present the triage results page with an option for a test engineer to confirm or to correct the model’s prediction. Any differences are stored and can be used to retrain the ML model to improvise further, resulting in even better triage accuracy.
Expanding the Use of ML Models
As we reflect on our work, the following graph represents the class of problems that ML can help solve. One dimension is the complexity of the problem, and the other is the repetitiveness of the problem—how often we need to solve the same issue over and over.
The systemic NFS test triage belongs to the class of problems that are moderate in complexity but highly repetitive. ML models can also be applied to scenarios that are highly complex but perhaps not as repetitive. This class can include situations in which problems are similar in nature, and the trained model from one problem can be used to predict or to solve a new but similar problem.
At NetApp, we strive to continually innovate and improve our offerings. So, we are continuing our journey with ML, now looking at ways to use ML models with new types of test suites and problems. To find out more about how we’re innovating to help our customers unleash the power of their data, read this technical report.