Calculating Hadoop Cluster Capacity

Table of Contents:

The following instructions are meant for Integrate.io users attempting to calculating the number of Hadoop cluster nodes needed for processing their jobs.

Before estimating the cluster resources, you must consider the following parameters as they relate to your system:

Data size, compression type, and file format
Direct ingestion or heavy joins or CPU intensive

For more information on Integrate.io's native Hadoop HDFS connector, visit our Integration page.

Storage Considerations

The number of data nodes you need is determined by the size of the data, how it will be analyzed, and the number of replicas you will have. By default, Apache Hadoop has 3 copies.

In this case, if we want to store X GB of data we need X*3 GB of storage for the forecasted period.

Processing Considerations

In addition to having enough space to store your data, you will need room for data processing, computing, and miscellaneous other tasks.

We can assume that, on an average day, only 10% of data is being processed, and a data process creates three times temporary data. Therefore, you need to account for around 30% of your total storage as extra space.

Number of Data Nodes Required

The final calculation for the number of data nodes required for your system will be dependent on your JBOD (“just a bunch of disks”) capacity.

For example: Let's say that you need 500GB of space. If you have a JBOD of 12 disks, and each disk can store 6TB of data, then the data node capacity, or the maximum amount of data that each node can store, will be 72 TB. Data nodes can be added as the data grows, so to start with its better to select the lowest number of data nodes required.

In this case, the number of data nodes required to store 500GB of data equals 500/72, or approximately 7.

Note: Number of Data nodes* = (no. of disks*)