HDFS – Hadoop Distributed File System distilled

Like any other filesystem, Hadoop has got it’s own filesystem, Hadoop Distributed File System to deal with distributed parallel computing. It appears as a single disk in computer system and runs on top of other filesystems such as Ext3, Ext4, XFS. Hadoop Distributed File System is based Google’s File System, GFS

Features of HDFS

Fault Tolerant – Can handle disk crashes, node crashes etc.,

Distributed – Data is distributed across cluster of machines.

High Availability – with replication of data, if one node goes down, process can continue with data on the other node.

Commodity Hardware -No need for super computers, uses commodity unreliable hardware. Failures are common, so data is replicated across the machines.

HDFS Architecture


HDFS is good for…

Storing large files – If you are storing Terabytes and Petabytes of data and is good for Millions of large files than billions of small files. HDFS gives good performance with 100MB or more per file.

Streaming Data – HDFS is good with write once and read-many times patterns and it is optimized for streaming reads rather than random reads. Of course, append() operation added to Hadoop 0.21

HDFS is not so good for..

Low -latency reads – If you have to fire low latency queries which require small chunks of data based on conditions, it is not so good. Of course, HBase addresses this issue.

Large amount of small files – HDFS is good with millions of large files than billions of small files. For example, each block size can be 100MB or more. default block size is 64MB.

Multiple writers – HDFS has single writer per file and writes only at the end of the file, no support for arbitrary offset at least for now.

Block Replication in HDFS

Name node determines replica placement, replica placements are rack aware.

  1. HDFS should balance between reliability and performance. With the replication, HDFS, attempts to reduce bandwidth and attempts to improve reliability by putting replicas on the multiple racks.
  2. Default replication factor is 3. This can be configured through property dfs.replication  in config files.
  3. HDFS follows below replication strategy for default replication factor 3.
  4. However, this policy may change in future.
  5. Name node doesn’t directly write or read data which on e of the reason for HDFS’s scalability.
  6. Client interacts with Namenode to update Namenode’s HDFS namespace and retrieve block locations for writing and reading data.
  7. Client interacts directly with DataNode to read/write data.

Reading and writing files to HDFS.

I found this article is very lucid in explaining about reading and writing files to HDFS.


Links and References

  1. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
  2. http://hortonworks.com/hadoop/hdfs/
  3. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/hdfs-and-mapreduce.html
  4. https://developer.yahoo.com/hadoop/tutorial/module2.html
  5. http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

Leave a Comment