The crux of IT tech support is simple: How fast can you identify and successfully resolve an incident?  The complexity lies in the details, especially when handling hundreds of tasks per day.

 

Recently, NetApp IT began automating the process to resolve incidents without human intervention, particularly those first-level incidents that experience a high-volume of service tickets. The goal of response automation is to restart services, trigger a workflow, or gather log information—all before an engineer receives the ticket. We wanted to leverage automation to drive faster, more accurate responses to incidents, improve our service delivery—and take the first step toward self-healing in IT operations.

 

Our challenge was to create a platform that could be easily integrated into our existing ecosystem of monitoring and service management tools including the ServiceNow CMDB and various monitoring tools from Zenoss, Splunk, and NetApp OnCommand® Unified Manager. We knew what we wanted to achieve, but we first had to understand the process.

First, Understand the Process

The bulk of any automation project and its success lies in first understanding the process and workflows. Our Command Center engineers were challenged to define a standard response process for a handful of first-level incidents with a high-volume of service tickets. They started with relatively simple-to-resolve incidents, such as rebalancing storage capacity or restarting an offline application. The team used scripts as building blocks and applied the relevant responses as needed. Today they are building our library of automated responses while continuing to provide day-to-day IT support.

Then, Apply the Technology

The automated response process is designed to work within our existing ecosystem. When an incident is received by our Zenoss monitoring system, it creates a ticket in our ServiceNow service management platform. If the ticket is flagged with auto response enabled, a script is executed using Ansible. The script directs the affected application to run certain commands, collect the results, and place the information into the ticket for a tech support person to access.

 

With auto response, it is important to ensure the problem is really resolved. Until the Command Center becomes fully confident that the script is doing its job, team members will verify the resolution. Albeit a slow process, we’ve gained a huge head start in incident resolution. When the tech support person opens the ticket, s/he can review the results of the basic information gathered by the auto response and begin troubleshooting immediately. This eliminates the time delay that comes from running tests to diagnosing the issue.

 

For those incidents with auto response enabled, it takes 3 to 4 minutes (on average) to execute an automation script and approximately one day from when the ticket is opened to it being resolved (known as ticket duration). Without auto response, the average ticket duration was three days and it took an engineer approximately one hour to assess the situation. Other benefits of automating incident resolution include:

  • Elimination of human error, increase in productivity, and reduction of rework;
  • A more engaged Command Center team as they intentionally look for incidents where auto response can be enabled so they can troubleshoot more complex issues that will up level their skills;
  • Better reporting from an integrated ecosystem that reports volumes, success rates, areas of chronic failures, and more.

Auto Response (AR) Examples

Below are some auto response (AR) examples the NetApp IT team has implemented. These examples provide the benefit of ensuring the service is always up, prevents interruption, and/or eliminates manual intervention by the IT operations teams to check and restore service.

  • Auto restart of Kitchen Police service for Storage Operations: Occurs when Zenoss detects the Kitchen Police service is down on the Storage Admin server, automatically opens a Service Now ticket and invokes AR to restart the service.
  • Auto restart of WebLogic service from Out-of-Memory alert for IAM Operations: Occurs when Splunk forwards an Out-of-Memory alert on IAM servers, then Zenoss will create the event and automatically open a Service Now ticket and invoke AR to restart the WebLogic service on the impacted server.
  • Auto restart the Tivoli Workload Scheduler (TWS) service on servers: Occurs when Zenoss detects the TWS service is down on the server, it automatically opens a Service Now ticket and invokes AR to restart the TWS service.
  • Automation Response to auto restart of OpenShift nodes: Occurs when OpenShift detects the nodes are not ready, an event will be generated and captured by Splunk. Splunk forwards that event to Zenoss to have an AR generated to create an incident and trigger the script to automatically restart the node. Also attaches log to the incident.
  • Response automation for Fujitsu RMA process for Unix Operations: Occurs when a new Service Request email is received from Fujitsu to NetApp IT Service Now, automatically opens a Service Now ticket and invokes AR to collect the required logs from the impacted server and upload to Fujitsu FTP server.

The NetApp-on-NetApp blogs feature advice from subject matter experts from NetApp IT who share their real experiences using NetApp’s industry-leading data management solutions to support business goals. Visit www.NetAppIT.com to learn more.

Andy Kranjec

Andy Kranjec is the Senior IT Manager of Infrastructure Operations at NetApp. Andy and his team provide process, technical, and administrative leadership to the organization responsible for day-to-day operations and execution of continuous service improvement, change management, and problem management processes.

Han Jo

Han Jo is a senior IT engineer, specializing in automation and monitoring tools at NetApp.

Add comment