Showing posts with label API. Show all posts
Showing posts with label API. Show all posts

Saturday, April 27, 2013

HBase - Scan Best practice

What shall we know for efficent scan commands creation?

1. Example code:


Scan scan1 = new Scan();
ResultScanner scanner1 = table.getScanner(scan1);
for (Result res : scanner1) {
System.out.println(res);
}
scanner1.close();

2. When iterating ResultScanner we execute next() method.

3. Each call to next() will be a separate RPC for each row—even when you use the
next(int nbRows) method, because it is nothing else but a client-side loop over
next() calls

4. Would make sense to fetch more than one row per RPC if possible. This is called scanner caching and is disabled by default. 

5. Caching improves performance but impacts memory, since sing row can be constucted of hundreds columns and they will be fetched.And this should fit into client process.  Batching feauture limits number of fetched columns per bulk.


6. Improved example

Scan scan = new Scan();
scan.setCaching(caching);
scan.setBatch(batch);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner)

Example of fetches and RPCs ( from book :HBase,Defenitive Guide)


7. Bring only relevant information instead of full table.(column families/columns etc)

8.When performing a table scan where only the row keys are needed (no families,
qualifiers, values, or timestamps), add a FilterList with a MUST_PASS_ALL operator
to the scanner using setFilter().
The filter list should include both a FirstKeyOnlyFilter and a KeyOnlyFilter instance Using this filter combination will cause the region server to onlyload the row key of the first KeyValue (i.e., from the first column) found and return it to the client, resulting in minimized network traffic.


9. Make surethe input Scan instance  has setCaching() set to something
greater than the default of 1. Using the default value means that the map task will
make callbacks to the region server for every record processed. Setting this value
to 500, for example, will transfer 500 rows at a time to the client to be processed


10. You can disable WALon Put commands  by call writeToWAL(false). Just know what your are doing.The consequence is that if there is a region server failure there will be data loss

HBase client API summary

Single Put
HTable table = new HTable(conf, "testtable");
Put put = new Put(Bytes.toBytes("row1"));
put.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"),
Bytes.toBytes("val1"));
put.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual2"),
Bytes.toBytes("val2"));
table.put(put);


MultiPut(cache)
table.setAutoFlush(false)
table.put...
table.put...
table.put....
table.flushCommits();
Put can fail on exception. That why on catch better to do flushCommits();


MultiPut(batch)
List<Put> puts = new ArrayList<Put>();
Put put1 = new Put(Bytes.toBytes("row1"));
put1.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"),
Bytes.toBytes("val1"));
puts.add(put1);
Put put2 = new Put(Bytes.toBytes("row2"));
put2.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"),
Bytes.toBytes("val2"));
puts.add(put2);
Put put3 = new Put(Bytes.toBytes("row2"));
put3.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual2"),
Bytes.toBytes("val3"));
puts.add(put3);
table.put(puts);


SingleGet
Get get = new Get(Bytes.toBytes("row1"));
get.addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"));
Result result = table.get(get);


MultiGet
List<Get> gets = new ArrayList<Get>();
Get get1 = new Get(row1);
get1.addColumn(cf1, qf1);
gets.add(get1);
Get get2 = new Get(row2);
get2.addColumn(cf1, qf1);
gets.add(get2);
Get get3 = new Get(row2);
get3.addColumn(cf1, qf2);
gets.add(get3);
Result[] results = table.get(gets);


Delete(Single)
Delete delete = new Delete(Bytes.toBytes("row1"));
delete.deleteFamily(Bytes.toBytes("colfam3"));
table.delete(delete);


Delete(Multi)
List<Delete> deletes = new ArrayList<Delete>();
Delete delete1 = new Delete(Bytes.toBytes("row1"));
delete1.setTimestamp(4);
deletes.add(delete1);
Delete delete2 = new Delete(Bytes.toBytes("row2"));
delete2.deleteColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"));
delete2.deleteColumns(Bytes.toBytes("colfam2"), Bytes.toBytes("qual3"), 5);
deletes.add(delete2);
Delete delete3 = new Delete(Bytes.toBytes("row3"));
delete3.deleteFamily(Bytes.toBytes("colfam1"));
delete3.deleteFamily(Bytes.toBytes("colfam2"), 3);
deletes.add(delete3);
table.delete(deletes);


Batch
List<Row> batch = new ArrayList<Row>();

Put put = new Put(ROW2);
put.add(COLFAM2, QUAL1, Bytes.toBytes("val5"));
batch.add(put);

Get get1 = new Get(ROW1);
get1.addColumn(COLFAM1, QUAL1);
batch.add(get1);

Delete delete = new Delete(ROW1);
delete.deleteColumns(COLFAM1, QUAL2);
batch.add(delete);

Object[] results = new Object[batch.size()];
try {
table.batch(batch, results);
} catch (Exception e) {
System.err.println("Error: " + e);



Scan
Scan scan1 = new Scan();
ResultScanner scanner1 = table.getScanner(scan1);
for (Result res : scanner1) {
System.out.println(res);
}
scanner1.close();

So far, each call to next() will be a separate RPC for each row—even when you use the
next(int nbRows) method, because it is nothing else but a client-side loop over
next() calls
Thus it would make
sense to fetch more than one row per RPC if possible
Cache - number of rows
scan.setCaching(caching);
batch - number of columns
scan.setBatch(batch);

Thursday, April 25, 2013

HDFS Java API - Tutorial

In this tutorial i will list basic HDFS needed comamnds like:

Connectin to the filse system,creating directory, copy /delete/create files etc.

1. Connecting to HDFS file system:
Configuration config = new Configuration();
config.set("fs.default.name","hdfs://127.0.0.1:9000/");
FileSystem dfs = FileSystem.get(config);


2. Creating directory

Path src = new Path(dfs.getWorkingDirectory()+"/"+"rami");
dfs.mkdirs(src);


3. Delete directory or file:

Path src = new Path(dfs.getWorkingDirectory()+"/"+"rami");
Dfs.delete(src);


4. Copy files from local FS o HDFS and back:

Path src = new Path("E://HDFS/file1.txt");
Path dst = new Path(dfs.getWorkingDirectory()+"/directory/");
dfs.copyFromLocalFile(src, dst);

Or Back :
dfs.copyToLocalFile(src, dst);

Note : destination should be a path object that contains the directory to copy the source fiel to.
Source should be a path object that contains path to the file including thefile itself.


5.Create file:

Path src = new Path(dfs.getWorkingDirectory()+"/rami.txt");
dfs.createNewFile(src);


6. Reading file:

Path src = new Path(dfs.getWorkingDirectory()+"/rami.txt");
FSDataInputStream fs = dfs.open(src);
String str = null;
while ((str = fs.readline())!= null)
{
System.out.println(str);
}

7.Writing file:

Path src = new Path(dfs.getWorkingDirectory()+"/rami.txt");
FSDataOutputStream fs = dfs.create(src);
byte[] btr = new byte[]{1,2,3,4,5,6,7,8,9};
fs.write(btr);
fs.close();