Like any other filesystem, Hadoop has got it’s own filesystem, Hadoop Distributed File System to deal with distributed parallel computing. It appears as a single disk in computer system and runs on top of other filesystems such as Ext3, Ext4, XFS. Hadoop Distributed File System is based Google’s File System, GFS
Features of HDFS
Fault Tolerant – Can handle disk crashes, node crashes etc.,
Distributed – Data is distributed across cluster of machines.
High Availability – with replication of data, if one node goes down, process can continue with data on the other node.
Commodity Hardware -No need for super computers, uses commodity unreliable hardware. Failures are common, so data is replicated across the machines.
HDFS is good for…
Storing large files – If you are storing Terabytes and Petabytes of data and is good for Millions of large files than billions of small files. HDFS gives good performance with 100MB or more per file.
Streaming Data – HDFS is good with write once and read-many times patterns and it is optimized for streaming reads rather than random reads. Of course, append() operation added to Hadoop 0.21
HDFS is not so good for..
Low -latency reads – If you have to fire low latency queries which require small chunks of data based on conditions, it is not so good. Of course, HBase addresses this issue.
Large amount of small files – HDFS is good with millions of large files than billions of small files. For example, each block size can be 100MB or more. default block size is 64MB.
Multiple writers – HDFS has single writer per file and writes only at the end of the file, no support for arbitrary offset at least for now.
Block Replication in HDFS
Name node determines replica placement, replica placements are rack aware.
- HDFS should balance between reliability and performance. With the replication, HDFS, attempts to reduce bandwidth and attempts to improve reliability by putting replicas on the multiple racks.
- Default replication factor is 3. This can be configured through property dfs.replication in config files.
- HDFS follows below replication strategy for default replication factor 3.
1231st replica will be stored on local rack.2nd replica will be on the local rack but, on the different machine.3rd replica will be on the different rack.
- However, this policy may change in future.
- Name node doesn’t directly write or read data which on e of the reason for HDFS’s scalability.
- Client interacts with Namenode to update Namenode’s HDFS namespace and retrieve block locations for writing and reading data.
- Client interacts directly with DataNode to read/write data.
Reading and writing files to HDFS.
I found this article is very lucid in explaining about reading and writing files to HDFS.
Links and References