Showing posts with label HDFS. Show all posts
Showing posts with label HDFS. Show all posts

Thursday, April 25, 2013

HDFS Java API - Tutorial

In this tutorial i will list basic HDFS needed comamnds like:

Connectin to the filse system,creating directory, copy /delete/create files etc.

1. Connecting to HDFS file system:
Configuration config = new Configuration();
config.set("fs.default.name","hdfs://127.0.0.1:9000/");
FileSystem dfs = FileSystem.get(config);


2. Creating directory

Path src = new Path(dfs.getWorkingDirectory()+"/"+"rami");
dfs.mkdirs(src);


3. Delete directory or file:

Path src = new Path(dfs.getWorkingDirectory()+"/"+"rami");
Dfs.delete(src);


4. Copy files from local FS o HDFS and back:

Path src = new Path("E://HDFS/file1.txt");
Path dst = new Path(dfs.getWorkingDirectory()+"/directory/");
dfs.copyFromLocalFile(src, dst);

Or Back :
dfs.copyToLocalFile(src, dst);

Note : destination should be a path object that contains the directory to copy the source fiel to.
Source should be a path object that contains path to the file including thefile itself.


5.Create file:

Path src = new Path(dfs.getWorkingDirectory()+"/rami.txt");
dfs.createNewFile(src);


6. Reading file:

Path src = new Path(dfs.getWorkingDirectory()+"/rami.txt");
FSDataInputStream fs = dfs.open(src);
String str = null;
while ((str = fs.readline())!= null)
{
System.out.println(str);
}

7.Writing file:

Path src = new Path(dfs.getWorkingDirectory()+"/rami.txt");
FSDataOutputStream fs = dfs.create(src);
byte[] btr = new byte[]{1,2,3,4,5,6,7,8,9};
fs.write(btr);
fs.close();

Sunday, April 21, 2013

Moving data into Hadoop


If you  want to push all of your production server system log files into HDFS use  Flume



Apache Flume is a distributed system for collecting streaming data. It’s an Apache
project in incubator status, originally developed by Cloudera. It offers various levels of
reliability and transport delivery guarantees that can be tuned to your needs. It’s
highly customizable and supports a plugin architecture where you can add custom
data sources and data sinks.




Link:

If you need to automate the process by which files on remote servers are copied into HDFS use HDFS File slurper.
Feautures

  • After a successful file copy you can either remove the source file, or have it moved into another directory.
  • Destination files can be compressed as part of the write codec with any compression codec which extendsorg.apache.hadoop.io.compress.CompressionCodec.
  • Capability to write "done" file after completion of copy
  • Verify destination file post-copy with CRC32 checksum comparison with source
  • Ignores hidden files (filenames that start with ".")
  • Customizable destination via a script which can be called for every source file. Or alternatively let the utility know a single destination directory
  • Customizable pre-processing of file prior to transfer via script and all files are copied into that location.
  • A daemon mode which is compatible with inittab respawn
  • Multi-threaded data transfer

Link:

If you want to automate peredioc tasksfor downloading  content from an HTTP server into HDFS use:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Link:
http://oozie.apache.org/
If you want to import relational data using MapReduce use DBInputFormat class


You can do the same using scoop
http://sqoop.apache.org/

If you want moving data from HBase to HDFS you can use Export class of HBase


$ bin/run.sh org.apache.hadoop.hbase.mapreduce.Export \
stocks_example \ - table name
output - directory

Or specific column family:


$ bin/run.sh org.apache.hadoop.hbase.mapreduce.Export \
-D hbase.mapreduce.scan.column.family=details \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=\
org.apache.hadoop.io.compress.SnappyCodec \
stocks_example output



The Export class writes the HBase output in the SequenceFile format, where the HBase
rowkey is stored in the SequenceFile record key using org.apache.hadoop.hbase
.io.ImmutableBytesWritable, and the HBase value is stored in the SequenceFile record
value using org.apache.hadoop.hbase.client.Result.

Now its time to move to HDFS ( example of Stock records move):


import static com.manning.hip.ch2.HBaseWriteAvroStock.*;
public class HBaseExportedStockReader {
public static void main(String... args) throws IOException {
read(new Path(args[0]));
}


public static void read(Path inputPath) throws IOException {
      Configuration conf = new Configuration();
      FileSystem fs = FileSystem.get(conf);
       SequenceFile.Reader reader =new SequenceFile.Reader(fs, inputPath, conf);
       HBaseScanAvroStock.AvroStockReader stockReader =
     new HBaseScanAvroStock.AvroStockReader();
      try {
     ImmutableBytesWritable key = new ImmutableBytesWritable();
     Result value = new Result();
     while (reader.next(key, value)) {
          Stock stock = stockReader.decode(value.getValue(
           STOCK_DETAILS_COLUMN_FAMILY_AS_BYTES,
           STOCK_COLUMN_QUALIFIER_AS_BYTES));
         System.out.println(new String(key.get()) + ": " +ToStringBuilder.reflectionToString(stock,ToStringStyle.SIMPLE_STYLE));
}
} finally {
reader.close();
}
}
}


Monday, April 15, 2013

Hbase and Hadoop on Windows


After 2 days  and long night of deep investigations  finally i could run Hadoop and HBase on Windows  by installing Cygwin.

What has been done :
1. Cygwin installed
2. SSH confiured
3. Git plugin installed for eclipse + learned
4. Maven installed + learned
5. Hbase checked out, compiled and run - Tested by console
6. Toad for cloud installed - connected to HBase
7. Hadoop installed on cygwin and reconfigured
8. Hadoop plugin installed. It took 1 day and 1 night to understand what plugin of version 0.19.X works only  on eclipse europe 3.3.X whcoh works proper way with JDK 6 only . ( i had JUNO with JDK 1.7).Finally this was resolved.

Finally my account looks like this:


Full detailed tutorials with links to tutorials attached :
Google drive link for Tutorial

Friday, December 21, 2012

Easy Big Data - Map Reduce - Inside Hadoop

What the hell is Map Reduce?

MapReduce is a concept that Hadoop is based on. And Hadoop is one of most popular Big-Data solutions,so... we have to know the basics in order to continue.
So Lets start with some problem:
Problem : Count number of words in paragraph.
As following :


So the algorithm will look like:
Read a word,
check whether the word is one of the stop words,
if not , add the word in a HashMap with key as the word and set the value to number of occurrences.
If the word is not found in HashMap,
               then add the word and set the value to 1.
 If the word is found, then
               increment the value and word the same in HashMap

The algorithm is serail.If its input is a sentence - it works perfect. But if its input will be wikipedia - it will work for century!
So probably we need to a diffirent solution....
MapReduce is a concept that Hadoop is based on. And Hadoop is one of most popular Big-Data solutions,so... we have to know the basics in order to continue.
So Lets start with some problem:

New solution:

Lets take the same problem and divide the same into 2 steps. In the first step, we take each sentence each and map the number of words in that sentence.


Once, the words have been mapped, lets move to the next step. In this step, we combine (reduce) the maps from two sentences into a single map.
Sentences were  mapped individually and then once mapped, were reduced to a single resulting map.

  • The whole process got distributed in small tasks that will help  in faster completion of the job
  • Both the steps can be broken down into tasks. In the first, instance, run multiple map tasks, once the mapping is done, run multiple reduce tasks to combine the results and finally aggregate the results

In other words, it's like you wanted to run it on seperate threads, and you need to find some solutio for doing it without locks.
This way you have 2 separate tasks : Map and reduce. and each one of them can run totally independent.

Adding HDFS

(For HDFS - Read HDFS on Rami on the web )
Now, imagine this MapReduce paradigm working on the HDFS. HDFS has data nodes that splits and store the files in blocks. Now, if map the tasks on each of the data nodes, then we can easily leverage the compute power of those data node machines.
So, each of the data nodes, can run tasks (map or reduce) which are the essence of the MapReduce. As each data nodes stores data for multiple files, multiple tasks might be running at the same time for different data blocks.

Comming Next - Hadoop - First steps

Source Hadoop

Thursday, December 20, 2012

Easy Big Data - 1 word before Hadoop - HDFS

Problem – The amount of data stored growing up. We have more and more every day.. we bought 1 TB hard drive , then we need to buy 1 more etc.

Solution – To overcome the problems, an distributed file system was concieved that provided solution to the above problems.

  • When dealing with large files, I/O becomes a big bottleneck. So, we divide the files into small blocks and store in multiple machines. [Block Storage]
  • When we need to read the file, the client sends a request to multiple machines, each machine sends a block of file which is then combined together to pierce the whole file. 
  • With the advent of block storage, the data access becomes distributed and leads to a faster retrieval/write 
  • As the data blocks are stored on multiple machines, it helps in removing single point of failure by having the same block on multiple machines. Meaning, if one machine goes, the client can request the block from another machine
  • Now, any solution that implements file storage as blocks needs to have the following characteristics

    • Manage the meta data information – Since the file gets broken into multiple blocks, somebody needs to keep track of no of blocks and storage of these blocks on different machines [NameNode]
    • Manage the stored blocks of data and fulfill the read/write requests [DataNodes]

    So, in the context of Hadoop –The NameNode is the arbitrator and repository for all metadata. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. All these component together form the Distributed File System  called as HDFS (Hadoop Distributed File System).


    So what we achieved? We work with distributed file system on tens of computers with 1 single point of entry.

    Source
    HDFS on Hadoop
    HDFS