Block size in Hadoop

Saumya Singh
3 min readSep 29, 2020

Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored as independent units.

In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB. If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.

The block size is the smallest unit of data that a file system can store. If you store a file that’s 1k or 60Mb, it’ll take up one block. Once you cross the 64Mb boundary, you need a second block.

To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0. Hadoop Administrator have control over block size to be configured for Cluster.

If we are managing a cluster of 1 petabytes and block size is 64 MB, then 15+million blocks will create which is difficult for NameNode to manage. So block size is increased from 64MB to 128MB.

Note: Block size and Split size are little different don’t confuse them to be same. Split is a logical division of the input data while block is a physical division of data.

The namenode memory problem Every directory, file, and block in Hadoop is represented as an object in memory on the NameNode. As a rule of thumb, each object requires 150 bytes of memory.

Advantages of Block Size:

The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.

To improve the NameNode performance.

To improve the performance of MapReduce job since the number of the mapper is directly dependent on Block size.

The block is configurable and this can be changed by setting up the dfs.blocksize property in hdfs-site.xml.

Depending on the workload, it is very much possible and advisable to change the block size which gives you maximum throughput.

Issues with Block Size:

Issues with small block size:

  • Small block size also is a problem for Namenode since it keeps metadata of all blocks and it keeps metadata in memory. Due to small block size Namenode can run out of memory.
  • Too small block size would result in more number of unnecessary splits, which would result in more number of tasks which might be beyond the capacity of the cluster.

Issues with large block size:

  • The cluster would be underutilized because of large block size there would be fewer splits and in turn would be fewer map tasks which will slow down the job.
  • Large block size would decrease parallelism.

Conclusion:

So, basically when the block size is less, it will result in many splits, which will generate too many tasks beyond the capacity of the cluster. And if the block size is more, it will result in under-utilization of cluster, and there will be less parallel processing. Therefore, ideal 64/128 Mb block size is often recommended.

So, I think the smallest block size maybe can be of 150 bytes and maximum can be upto the size that data-node can store file of.

--

--