Necessity is the mother of all invention. At least when you work in IT and support global resources that include approximately 5,000 servers, over 1,500 network devices, and 50+ PBs of storage. As a result, NetApp’s IT environment generates a constant flow of alerts. Our eternal challenge is getting better at identifying the root cause of the issues and preventing them from happening again.
Our event monitoring strategy plays an important role in addressing this challenge. We want to ensure critical alerts quickly rise to the top for immediate attention, while informational alerts can be analyzed separately for later action. To support this strategy, we needed to consolidate our alerts into a single ecosystem made up of individual, best-in-class components that would feed alerts into our incident management software for auto-ticketing.
This ‘single pane of glass’ strategy enables the NetApp resources on the infrastructure support team, called the Command Center, to quickly resolve critical issues 24×7 across the globe and not be sidetracked by non-urgent alerts. This approach improves IT’s responsiveness and focus, ultimately resulting in increased operational stability.
Redefining our Strategy
Our first step was developing an alerting process. Like most IT shops, we have a two-tier alerting system, but we classified our alerts in a slightly different way:
Reactive: This alert is the only type of alert to automatically be forwarded to the Command Center for immediate action. It is defined as “actionable” and requires attention by the team.
Proactive: These alerts are typically performance related, but less urgent and are not immediately forwarded to the Command Center for action. Dashboards are used to manage thresholds for the alerts at a broader level. The Command Center monitors the dashboards to address issues, such as storage capacity or CPU utilization, proactively with partner application support teams. These types of alerts remain a key volume driver for the Command Center, but our teams continue to focus on streamlining and automating these responses over time.
Over the course of about nine months, the process and support teams focused on understanding what existing alerts, thresholds, and events were most important and “actionable.” The result of this work was to position NetApp IT to implement a single, integrated service management and alerting ecosystem, with significantly less noise for those accountable for responding to the alerts.
Building an Ecosystem
Our plan was to create an event monitoring ecosystem that fed alerts into central incident management software. A single ecosystem would enable the sorting, tracking, and accurate routing of alerts from our IT systems into our incident management software through auto-ticketing. For storage events specifically this required we integrate multiple tools—Zenoss, Splunk, and NetApp OnCommand® Unified Management (OCUM)—into our ServiceNow incident management platform.
A major hurdle occurred once we began integrating our storage environment into the ecosystem. The storage alerts were a necessary part of the end-to-end alerting strategy, but configuring each individual storage controller to connect with Zenoss and ServiceNow presented an administrative and management challenge. OCUM, however, offered the ability to connect a single storage management tool into our existing alerting ecosystem. OCUM gave us the advantage of managing thresholds and administering alerts in a single tool, and Zenoss provided the ability to analyze and dedupe the critical alerts before they were auto-ticketed.
Improving IT Operational Stability
Our alerting strategy offers many benefits. The Command Center has greatly reduced its dependency on email for event notifications. Team members don’t need to sort through alerts to find the critical ones, dedupe alerts about the same issue from multiple sources, or run the risk of assuming someone has already addressed the issue. The team only receives alerts that specifically require action.
When a device goes offline, a storage volume becomes unavailable, or a storage system experiences a hardware failure, the team is positioned to respond appropriately. Therefore, urgent infrastructure issues are identified and fixed more rapidly, before they cause havoc in our IT environment, reducing the overall number and impact of P1 incidents.
Regardless of the incident management or event monitoring software being used, any IT organization can benefit from rationalizing the number of actionable alerts and adopting an integrated event monitoring ecosystem. By creating a strategy that enables fast action on high-priority issues, we’ve improved the efficiency and effectiveness of our Command Center. Ultimately, this approach has a direct impact on the operational stability of IT operations for our customers, partners, and employees.
The NetApp-on-NetApp blog series features advice from subject matter experts from NetApp IT who share their real-world experiences using NetApp’s industry-leading data management solutions to support business goals. Want to learn more about the program? Visit www.NetAppIT.com.