Sunday, April 21, 2013

Moving data into Hadoop


If you  want to push all of your production server system log files into HDFS use  Flume



Apache Flume is a distributed system for collecting streaming data. It’s an Apache
project in incubator status, originally developed by Cloudera. It offers various levels of
reliability and transport delivery guarantees that can be tuned to your needs. It’s
highly customizable and supports a plugin architecture where you can add custom
data sources and data sinks.




Link:

If you need to automate the process by which files on remote servers are copied into HDFS use HDFS File slurper.
Feautures

  • After a successful file copy you can either remove the source file, or have it moved into another directory.
  • Destination files can be compressed as part of the write codec with any compression codec which extendsorg.apache.hadoop.io.compress.CompressionCodec.
  • Capability to write "done" file after completion of copy
  • Verify destination file post-copy with CRC32 checksum comparison with source
  • Ignores hidden files (filenames that start with ".")
  • Customizable destination via a script which can be called for every source file. Or alternatively let the utility know a single destination directory
  • Customizable pre-processing of file prior to transfer via script and all files are copied into that location.
  • A daemon mode which is compatible with inittab respawn
  • Multi-threaded data transfer

Link:

If you want to automate peredioc tasksfor downloading  content from an HTTP server into HDFS use:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Link:
http://oozie.apache.org/
If you want to import relational data using MapReduce use DBInputFormat class


You can do the same using scoop
http://sqoop.apache.org/

If you want moving data from HBase to HDFS you can use Export class of HBase


$ bin/run.sh org.apache.hadoop.hbase.mapreduce.Export \
stocks_example \ - table name
output - directory

Or specific column family:


$ bin/run.sh org.apache.hadoop.hbase.mapreduce.Export \
-D hbase.mapreduce.scan.column.family=details \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=\
org.apache.hadoop.io.compress.SnappyCodec \
stocks_example output



The Export class writes the HBase output in the SequenceFile format, where the HBase
rowkey is stored in the SequenceFile record key using org.apache.hadoop.hbase
.io.ImmutableBytesWritable, and the HBase value is stored in the SequenceFile record
value using org.apache.hadoop.hbase.client.Result.

Now its time to move to HDFS ( example of Stock records move):


import static com.manning.hip.ch2.HBaseWriteAvroStock.*;
public class HBaseExportedStockReader {
public static void main(String... args) throws IOException {
read(new Path(args[0]));
}


public static void read(Path inputPath) throws IOException {
      Configuration conf = new Configuration();
      FileSystem fs = FileSystem.get(conf);
       SequenceFile.Reader reader =new SequenceFile.Reader(fs, inputPath, conf);
       HBaseScanAvroStock.AvroStockReader stockReader =
     new HBaseScanAvroStock.AvroStockReader();
      try {
     ImmutableBytesWritable key = new ImmutableBytesWritable();
     Result value = new Result();
     while (reader.next(key, value)) {
          Stock stock = stockReader.decode(value.getValue(
           STOCK_DETAILS_COLUMN_FAMILY_AS_BYTES,
           STOCK_COLUMN_QUALIFIER_AS_BYTES));
         System.out.println(new String(key.get()) + ": " +ToStringBuilder.reflectionToString(stock,ToStringStyle.SIMPLE_STYLE));
}
} finally {
reader.close();
}
}
}


What if you want to operate on HBase directly within your MapReduce jobs without an
intermediary step of copying the data into HDFS?


HBase has a TableInputFormat class that can be used in your MapReduce job to pull
data directly from HBase. You will use this InputFormat to write Avro files in HDFS.




public class HBaseSourceMapReduce extends
TableMapper<Text, DoubleWritable> {
private HBaseScanAvroStock.AvroStockReader stockReader;
private Text outputKey = new Text();
private DoubleWritable outputValue = new DoubleWritable();
@Override
protected void setup(
Context context)
throws IOException, InterruptedException {
           stockReader = new HBaseScanAvroStock.AvroStockReader();
}
@Override
public void map(ImmutableBytesWritable row, Result columns,
Context context)
throws IOException, InterruptedException {
          for (KeyValue kv : columns.list()) {
          byte[] value = kv.getValue();
          Stock stock = stockReader.decode(value);
          outputKey.set(stock.symbol.toString());
          outputValue.set(stock.close);
          context.write(outputKey, outputValue);
 }
}
public static void main(String[] args) throws Exception {
                Configuration conf = new Configuration();
                Scan scan = new Scan();
                scan.addColumn(STOCK_DETAILS_COLUMN_FAMILY_AS_BYTES,
                                         STOCK_COLUMN_QUALIFIER_AS_BYTES);

                Job job = new Job(conf);
                job.setJarByClass(HBaseSourceMapReduce.class);
                TableMapReduceUtil.initTableMapperJob(STOCKS_TABLE_NAME,
                scan,
                HBaseSourceMapReduce.class,
                ImmutableBytesWritable.class,
               Put.class,
               job);
               job.setNumReduceTasks(0);
.....

All examples are taken from book: Hadoop in Practice.








No comments:

Post a Comment