Monday, May 6, 2013

What the hell is Bloom filter and why its needed in HBase?

We define hash functions that assign a  bits to the key value.
Put action :  we run hash functions over they key and as a result we get a bit mask that identefies the key
Get action: we run our key over the hash functions , getting the result as a bit mask then we check if the bit mask exist in the  global bit vector.

                                                       
As an example:
x,y,z - exist in the set
W - doesnt exist
Basically Bloom filter helps us to reduce scans for  key existence test.
If the mask doesn't exist - the key defenetly doesn't exist in the set
If it does exist - - then we shall go and make the full scan to check the key exist
And why we need it in the HBase?

"..They are stored in the meta data of each HFile when it is written and then never need to be updated because HFiles are immutable. While I have no empirical data as to how much extra space they require (this also depends on the error rate you choose etc.) they do add some overhead obviously. When a HFile is opened, typically when a region is deployed to a RegionServer, the bloom filter is loaded into memory and used to determine if a given key is in that store file. They can be scoped on a row key or column key level, where the latter needs more space as it has to store many more keys compared to just using the row keys (unless you only have exactly one column per row). 

In terms of computational overhead the bloom filters in HBase are very efficient, they employ folding to keep the size down and combinatorial generation to speed up their creation. They add about 1 byte per entry and are mainly useful when your entry size is on the larger end, say a few kilobytes. Otherwise the size of filter compared to the size of the data is prohibitive. .."


Bloom Filters can be enabled per-ColumnFamily. Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per Column Family

No comments:

Post a Comment