Thursday, December 20, 2012

Easy Big Data - 1 word before Hadoop - HDFS

Problem – The amount of data stored growing up. We have more and more every day.. we bought 1 TB hard drive , then we need to buy 1 more etc.

Solution – To overcome the problems, an distributed file system was concieved that provided solution to the above problems.

  • When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines. [Block Storage]
  • When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to pierce the whole file. 
  • With the advent of block storage, the data access becomes distributed and leads to a faster retrieval/write 
  • As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine
  • Now, any solution that implements file storage as blocks needs to have the following characteristics

    • Manage the meta data information – Since the file gets broken into multiple blocks, somebody needs to keep track of no of blocks and storage of these blocks on different machines [NameNode]
    • Manage the stored blocks of data and fulfill the read/write requests [DataNodes]

    So, in the context of Hadoop –The NameNode is the arbitrator and repository for all metadata. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. All these component together form the Distributed File System  called as HDFS (Hadoop Distributed File System).


    So what we achieved? We work with distributed file system on tens of computers with 1 single point of entry.

    Source
    HDFS on Hadoop
    HDFS

    No comments:

    Post a Comment