According to survey results via ESG, 54 percent of enterprise organizations (i.e., 1,000 or more employees) consider data analytics to be a top-five IT priority, and 38 percent plan on deploying a new data analytics solution in the next 12-18 months. As part of these solutions, a growing number of IT organizations are using the open source Apache Hadoop MapReduce framework as their foundation for their big data analytics initiatives.


Although the typical Hadoop cluster leverages commodity server nodes with internal disk drives for storage, this does introduce two challenges for growing enterprise customers.


First, data protection is handled in the Hadoop software layer. This means that every time a file is written to the Hadoop Distributed File System (HDFS), two additional copies are written in case of a disk drive or a data node failure. This not only impacts data ingest and throughput performance, but also reduces disk capacity utilization.


Second, high availability is limited based on an existing single point of failure in the Hadoop metadata repository, and according to respondents, downtime due to a node failure is a key concern. A majority of the ESG respondents indicated that three hours or less of data analytics platform downtime would result in a significant revenue loss or other adverse business impact.


As indicated in the figure below, the job completion time for each of the TeraSort runs was recorded as the amount of data generated, the number of data nodes, and the number of E-Series arrays increased linearly. As the number of data nodes grew and the volumes of data generated increased linearly, job completion time remained flat at approximately 30 minutes (+/-2), and the aggregate analytics throughput scaled linearly as data nodes and E-Series arrays were added to the cluster.


NetApp for Hadoop Performance Scalability

The benefit of NetApp was magnified as the size of the cluster and the amount of network traffic increased, as seen in the graphic below. Note how the “Replication 2 with NetApp” line, marked in green, increases linearly compared to the “Replication 3” line, marked in red. Also, the gap between the two increases as the cluster grows due to the increase in network traffic. The NetApp solution resulted in a peak aggregate throughput improvement of 62.5 percent during the 24-node test (4.629 vs. 2.849 GB/sec). The increase in cluster efficiency with NetApp not only reduced job completion times, but also increased aggregate throughput.

Increasing Hadoop Cluster Efficiency with NetApp

The lab validation report from ESG proves that the capacity and performance of the NetApp solution scaled linearly when data nodes and NetApp E-Series storage arrays were added to a Hadoop cluster. ESG Lab also has confirmed NetApp Open Solution for Hadoop reduces name node recovery time from hours to minutes, and NetApp E2660s with hardware RAID dramatically improved recoverability after simulated hard drive failures. A MapReduce job running during a simulated internal drive failure took more than twice as long (225 percent) to complete than during failure of a hardware RAID-protected E2660 drive.


Mike McNamara

Mike McNamara is a senior manager of product and solution marketing at NetApp with over 25 years of storage and data management marketing experience. Before joining NetApp over 10 years ago, Mike worked at Adaptec, EMC and Digital Equipment Corporation. Mike was a key leader driving the launch of the industry’s first unified scale-out storage system (NetApp), iSCSI and SAS storage system (Adaptec), and Fibre Channel storage system (EMC CLARiiON ). In addition to his past role as marketing chairperson for the Fibre Channel Industry Association, he is a member of the Ethernet Technology Summit Conference Advisory Board, a member of the Ethernet Alliance, a regular contributor to industry journals, and a frequent speaker at events.