ONTAP Recipes: Did you know you can…?
Easily create a data lake using Apache Hadoop and ONTAP storage
The term “data lake” can be defined as a centralized store for enterprise data, including structured, semi-structured and unstructured data, used by multiple enterprise applications.
This recipe highlights the steps how to create a data lake with Apache Hadoop on ONTAP.
- Determine hardware and network requirements using the Hortonworks cluster planning guide:
For evaluation purposes, a single server may be sufficient.
A two-node cluster can be configured with two servers: one master node and one worker node.
Larger clusters for actual production will require the following:
- 1 NameNode server
- 1 Resource Manager server
- Several worker node servers, each running both the DataNode and NodeManager services. The number of worker nodes will depend on the desired compute capacity. The planning guide should help with making that determination.
HA is recommended for production clusters, so you may also need a secondary NameNode server and a secondary ResourceManager server.
2. Determine the data set size. Since, data in a Hadoop cluster can grow quickly, that size should be increased by 20% or more, based on growth projections.
3. Apply the HDFS replication factor to the data set size to determine storage requirements.
For ONTAP storage, a minimum replication count of “2” is acceptable. To get storage requirements, multiply the data set size by “2”.
4. Calculate storage for each datanode so that the total data set will be spread evenly across them.
Per NetApp SAN best practices configure storage as follows:
a. Configure an SVM with the FC protocol enabled
b. Configure LIFs, aggregates, volumes and LUNs to meet storage requirements. Two LUNs per datanode, one LUN per volume should be sufficient.
5. Reference the Hortonworks Ambari Automated install documentation and then complete the Hadoop install:
https://docs.hortonworks.com/HDPDocuments/Ambari/Ambari-2.2.2.0/index.html
a. Determine which server operating system will be used and then configure your servers per the minimum system requirements.
b. On the storage array, create FC igroups and map storage LUNs to the datanodes.
c. On the datanode servers, partition the LUNs, and create the file systems on the datanode hosts.
d. Create mountpoints on the datanodes for the new file systems.
e. Mount the file systems on the datanodes.
f. Follow the procedure outlined in the Ambari documentation for preparing the environment, configuring the Ambari repository, installing the Ambari Server and deploying the HDP cluster. Once the Ambari Server has been installed, the deployment will be a guided, automated procedure.
After the Ambari Hadoop deployment has finished, data can be loaded into HDFS using a number of utilities, including Flume and Sqoop. We’re now able to harness all the power of ONTAP for Hadoop.
Below is a diagram showing an example of an ONTAP based data lake:
For more information, see the ONTAP 9 documentation center