Infrastructure monitoring tools collect enormous amounts of big data about compute, network and storage environments.   We do this for the simple purpose of wanting to make sure that we keep our infrastructure running smoothly (and avoid outages).   However, it is becoming more and more impossible for a human to analyze all of this data and get value from it.  We can no longer rely on the old tried and true methods alone of setting thresholds and alerts.  There is simply too much data and not always a known value for when to ask for an alert.

 

Machine Learning to the rescue! Wait…what is machine learning?

Machine Learning is the study of computer algorithms that improve automatically through experience.   One of the hottest approaches these days is using Bayesian logic.  The core math is attributed to Thomas Bayes from the 18th Century.  So geeks have had a long time to ponder this one.

 

Simple example of Machine Learning

Bayesian logic uses the knowledge of prior events to predict future events.   Taking a simple example….imagine every day for the past 10 years you have left your house at 8:00AM and it has taken 20 mins to drive to work..  If I asked you to bet how long your commute will be tomorrow if you leave at 8:00…based on the past you would be wise to guess 20 mins.

 

If however, it actually took 40 mins we would call that an unexpected behavior or in fancy words it is called an anomaly.  Often the algorithms will also provide some kind of score to indicate just how anomalous (unexpected) the value was….for example a commute time of 40 mins would be more anomalous than 21 mins in my example.

 

Back to Infrastructure monitoring…

So we can feed all of the big data that we capture about our infrastructure into Machine Learning and have the software watch for unexpected behavior!   So now we do not have to manually program our alerts with thresholds – we just feed the software more and more data and it will get smarter and smarter and tell us when something is unexpected – cool huh?

 

So what’s the downside?

First….you need lots of computer power to make them run.  Second is unexpected behavior is not always bad!   Sometimes things can be just acting normally but strange or better than expected.  So we risk being alerted when our attention is not warranted.  On the other hand sometimes an abnormal event is super helpful and can save our bacon.

 

So, is this fake news?!

Nope.  We just have work to do to optimize how we harness this power.  The good news is that we have integrated machine learning in OnCommand Insight (OCI) for more than 2 years now and have hundreds of customers that have given us feedback (good and bad) but all that informs the future.

 

Things we’ve learned along the way

Close relationships with customers and soliciting their feedback has allowed our development team to take advantage of a few cool tricks that are enabled by machine learning:

  1. Make it easy to pick the amount of data that you feed into the engine to make sure that you optimize the cost of resources on the most important stuff
  2. Warm up the learning – we feed the engine historical data to teach it before it comes online  – speeding the time to value
  3. Be careful not to pre-judge the severity of an anomaly – experts will still have an opinion
  4. We are not ready to ditch static thresholds but using both static thresholds and anomaly detection together is really powerful
  5. And lastly, pick a great machine learning engine (garbage in and garbage out still applies).  We chose to partner with a company named Prelert which is now a part of Elastic.co.

What’s next?

The features above are just the beginning of the cool things we will see that are made possible with machine learning. Here are some features I predict we will see in the near future:

  1. Create mechanisms to alert the user when anomaly scores themselves are worthy of attention
  2. Increase scale – the more we can process the better if the cost of the compute power is worth it – so we are looking at ways to make this happen including scaling out the machine learning engines on demand
  3. Apply machine learning to other domains – for example an anomaly in a charge back data indicating a financial issue?

If you want to learn more about how we use machine learning for datacenter infrastructure analytics visit NetApp.com to learn more about OnCommand Insight.

Kurt Sand