Saturday, April 27, 2013

HBase - Scan Best practice

What shall we know for efficent scan commands creation?

1. Example code:


Scan scan1 = new Scan();
ResultScanner scanner1 = table.getScanner(scan1);
for (Result res : scanner1) {
System.out.println(res);
}
scanner1.close();

2. When iterating ResultScanner we execute next() method.

3. Each call to next() will be a separate RPC for each row—even when you use the
next(int nbRows) method, because it is nothing else but a client-side loop over
next() calls

4. Would make sense to fetch more than one row per RPC if possible. This is called scanner caching and is disabled by default. 

5. Caching improves performance but impacts memory, since sing row can be constucted of hundreds columns and they will be fetched.And this should fit into client process.  Batching feauture limits number of fetched columns per bulk.


6. Improved example

Scan scan = new Scan();
scan.setCaching(caching);
scan.setBatch(batch);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner)

Example of fetches and RPCs ( from book :HBase,Defenitive Guide)


7. Bring only relevant information instead of full table.(column families/columns etc)

8.When performing a table scan where only the row keys are needed (no families,
qualifiers, values, or timestamps), add a FilterList with a MUST_PASS_ALL operator
to the scanner using setFilter().
The filter list should include both a FirstKeyOnlyFilter and a KeyOnlyFilter instance Using this filter combination will cause the region server to onlyload the row key of the first KeyValue (i.e., from the first column) found and return it to the client, resulting in minimized network traffic.


9. Make surethe input Scan instance  has setCaching() set to something
greater than the default of 1. Using the default value means that the map task will
make callbacks to the region server for every record processed. Setting this value
to 500, for example, will transfer 500 rows at a time to the client to be processed


10. You can disable WALon Put commands  by call writeToWAL(false). Just know what your are doing.The consequence is that if there is a region server failure there will be data loss

No comments:

Post a Comment