Friday, December 21, 2012

Easy Big Data - Map Reduce - Inside Hadoop

What the hell is Map Reduce?

MapReduce is a concept that Hadoop is based on. And Hadoop is one of most popular Big-Data solutions,so... we have to know the basics in order to continue.
So Lets start with some problem:
Problem : Count number of words in paragraph.
As following :


So the algorithm will look like:
Read a word,
check whether the word is one of the stop words,
if not , add the word in a HashMap with key as the word and set the value to number of occurrences.
If the word is not found in HashMap,
               then add the word and set the value to 1.
 If the word is found, then
               increment the value and word the same in HashMap

The algorithm is serail.If its input is a sentence - it works perfect. But if its input will be wikipedia - it will work for century!
So probably we need to a diffirent solution....
MapReduce is a concept that Hadoop is based on. And Hadoop is one of most popular Big-Data solutions,so... we have to know the basics in order to continue.
So Lets start with some problem:

New solution:

Lets take the same problem and divide the same into 2 steps. In the first step, we take each sentence each and map the number of words in that sentence.


Once, the words have been mapped, lets move to the next step. In this step, we combine (reduce) the maps from two sentences into a single map.
Sentences were  mapped individually and then once mapped, were reduced to a single resulting map.

  • The whole process got distributed in small tasks that will help  in faster completion of the job
  • Both the steps can be broken down into tasks. In the first, instance, run multiple map tasks, once the mapping is done, run multiple reduce tasks to combine the results and finally aggregate the results

In other words, it's like you wanted to run it on seperate threads, and you need to find some solutio for doing it without locks.
This way you have 2 separate tasks : Map and reduce. and each one of them can run totally independent.

Adding HDFS

(For HDFS - Read HDFS on Rami on the web )
Now, imagine this MapReduce paradigm working on the HDFS. HDFS has data nodes that splits and store the files in blocks. Now, if map the tasks on each of the data nodes, then we can easily leverage the compute power of those data node machines.
So, each of the data nodes, can run tasks (map or reduce) which are the essence of the MapReduce. As each data nodes stores data for multiple files, multiple tasks might be running at the same time for different data blocks.

Comming Next - Hadoop - First steps

Source Hadoop

No comments:

Post a Comment