In my previous blog, I presented some key usability changes shipped in NetApp® OnCommand® Unified Manager 7.2 and the technology and processes that enable those improvements. In today’s blog I’ll review some of the performance capacity workflows available in Unified Manager 7.2 with the integration of Performance Manager. For this blog I connected with NetApp’s Tony Gaddis, one of our best performance troubleshooting experts. Tony and I met recently at NetApp’s Research Triangle Park office in North Carolina.
We started out by discussing frequently encountered performance issues and how they can be handled through OnCommand Unified Manager. The most typical scenario (some would argue that 90% of all issues fall into this category) is when an application is having performance issues and the application admin visits the storage admin to explore whether the storage is the root cause. The first step in that exploration is to review past baseline performance versus current performance. The storage admin pulls up a period with several representative business cycles (days or hours) and reviews the IOPS, latency, and throughput for the volume (application workload) and node (infrastructure). The next step is to compare that period with the current period, which is experiencing problematic application performance. Does the storage layer performance correlate with and make sense for the application owner? Does the latency now seem normal? If that’s the case, the application admin needs to look elsewhere for the culprit.
If the storage performance is indeed abnormal, the admin starts the research by checking into the infrastructure’s performance – nodes and disks. Unified Manager can help research when all workloads on a node suffer from high latency. (Some people call this a performance brownout.) This often occurs when one or more workloads drive higher IOPS than typical, consuming all available performance capacity on a node. At that point, all workloads “share the pain” and experience higher latency. This scenario used to be very prevalent with spinning disks, but with the migration of storage systems to all flash, the bottleneck is often node resources such as memory, network, and CPU.
Unified Manager can help you easily identify such scenarios and take steps to bring performance back under control. There are two simple metrics to review – node performance capacity and aggregate utilization. If node performance capacity is greater than 85%, then the controller processing resources are probably bottlenecked and causing latency to the user workloads. If the aggregate utilization for spinning disk is greater than 50%, then queuing is happening at the disk layer, increasing user latency. In that case, it’s necessary to reduce the load by either redistributing some of the workloads or using QoS to throttle back some of them.
Even when the infrastructure has sufficient performance available, workloads can still suffer performance issues, often due to misconfiguration. One issue that we often help customers with is QoS settings that are too restrictive. Setting QoS limits can be a hit-or-miss experience. Suppose that a customer has a 500Gb volume with a QoS policy limit set to 200 IOPS. That may seem large enough based on review of the average IOPS over a business day for the volume — but what if the workloads always spike at certain hours? The application supporting the volume may encounter variable latency due to the effect of QoS throttling. Let’s say that the application often requires 210 IOPS at peak times. That means that about 5% of IOPS will be throttled. The impact on average latency will be minor, because most IOPS are just fine, but 5% of IOPS will have high latency and may cause some application transactions to time out and fail. Unified Manager makes it easy to find out if this is the case.
Just navigate to the volume of interest and expand the latency view. Mousing over points of interest shows a breakdown of the latency sources, and if QoS is one of the main limiters, you may want to change your policy.
Simply navigate to the Volume Performance Detail tab and review the QoS data for a 24- or 72-hour period of normal application workloads. Under normal circumstances, there should be no throttling. The IOPS limit should be set high enough so that normal application workloads don’t trigger any throttling. If throttling is happening, it’s time to review the limit and probably increase it.
TIP: With NetApp ONTAP® 9.3 and QoS, ONTAP can now automate this step so that most customers don’t need to take this action to correct too-restrictive QoS due to growth of a volumes’ workload.
What is your story? Have you encountered any of the scenarios just described? How have they played out in your environment? Join the conversation and share your experience!
In my next blog, we’ll look at how Unified Manager can help you maximize the performance of your storage while maintaining adequate margins of safety, thus reducing the likelihood of performance issues.