Saturday, April 20, 2013

Serialization of data for NoSQL

In order to serialzie requests or data that will be stored we shall use some serialization system.
After deep investigation and comparison of stability,serialization/deserialiation engine strucutre
Following was found :

Protocol buffer https://code.google.com/p/protobuf/
Avro - http://avro.apache.org/
Thrift -
http://thrift.apache.org/
http://diwakergupta.github.io/thrift-missing-guide/

Main comparison conclusion
1. Protobuff and Thrift need code generation, when Avro doesnt need it. Metadata simply defined on both sides.In addition to that, because of metadata file there is no need to sign datatypes inside of the buffer..

2. Thrift vs protocol buffer - The difference on benchmark results are in nano seconds for ser/deser times and storage difference is around 0.05%

Comparison links :
https://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking


NoSQL and BigData Books

In order to get into Big Data technologies i would recommend following books


For Non Computer science newbies
Head First : Java
DataBase systems : Practical approach

For Computer Science newbies
Seven databases in seven weeks
Proffessional NoSQL

Hadoop
Hadoop - The Definitive Guide
Hadoop in practice
Hadoop mapreduce cookbook
Hadoop operations
Hadoop real-world solutions cookbook
Mapreduce design_patterns

HBase
HBase - The.Definitive Guide
Hbase in action

Hive
Programming hive

Pig
Programming pig


Links:
Contact me if books are needed

Thursday, April 18, 2013

Zookeeper Summary

ZooKeeper is a distributed, open-source coordination service for distributed applications

naming.Its used for synchronization, configuration maintenance, and groups and


ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchal namespace which is organized similarly to a standard file system.


 Like the distributed processes it coordinates, ZooKeeper itself is
intended to be replicated over a sets of hosts called an ensemble


The name space provided by ZooKeeper is much like that of a standard file system. A name
is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name
space is identified by a path

So basically what happens here is that you create a tree which can be updated from any client of the system and the model is shared and all the actions are ordered.
For example if you want to implement destributed Queue - you create a znode, and put to it data with related paths, fro other side you retrieve those values and remove them


API:
Zookeeper Javadoc


General architecture

Every ZooKeeper server services clients. Clients connect to exactly one server to submit
irequests. Read requests are serviced from the local replica of each server database. Requests
that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single
server, called the leader. The rest of the ZooKeeper servers, called followers, receive
message proposals from the leader and agree upon message delivery. The messaging layer
takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic,
ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a
write request, it calculates what the state of the system is when the write is to be applied and
transforms this into a transaction that captures this new state.



Additional feautures

1. Watches concept  - Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes

2. Sequential Consistency - Updates from a client will be applied in the order that they were
sent.

3. Atomicity - Updates either succeed or fail. No partial results.

4. Single System Image - A client will see the same view of the service regardless of the
server that it connects to.

5. Reliability - Once an update has been applied, it will persist from that time forward until
a client overwrites the update.

6. Timeliness - The clients view of the system is guaranteed to be up-to-date within a
certain time bound.
7.  Observer - in order not o hurt write performence with many many clients use Observer nide

Observersforward these requests to the Leader like Followers do, but they then simply wait to hear the
result of the vote. Because of this, we can increase the number of Observers as much as we
like without harming the performance of votes




Architecture
Can eb run in standalone mode or replicated.A replicatedgroup of servers in the same application is called a quorum, and in replicated mode, all servers in the quorum have copies of the same configuration file


Performance
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release

What can be implemented
Shared Barriers
Shared Queues
Shared Locks
Two face commit
etc
Documentation, tutorials and examplesZookeeper documentation


Wednesday, April 17, 2013

Double check locking

 Ss a software design pattern used to reduce the overhead of acquiring a lock by first testing the locking criterion (the "lock hint") without actually acquiring the lock. Only if the locking criterion check indicates that locking is required does the actual locking logic proceed.

It is typically used to reduce locking overhead when implementing "lazy initialization" in a multi-threaded environment, especially as part of the Singleton pattern. Lazy initialization avoids initializing a value until the first time it is accessed.



class Foo {
    private Helper helper = null;
    public Helper getHelper() {
        if (helper == null) {
            helper = new Helper();
        }
        return helper;
    }
 
    // other functions and members...
}

The problem is that this does not work when using multiple threads. A lock must be obtained in case two threads call getHelper() simultaneously. Otherwise, either they may both try to create the object at the same time, or one may wind up getting a reference to an incompletely initialized object.
The lock is obtained by expensive synchronizing, as is shown in the following example.
// Correct but possibly expensive multithreaded version
class Foo {
    private Helper helper = null;
    public synchronized Helper getHelper() {
        if (helper == null) {
            helper = new Helper();
        }
        return helper;
    }
 
    // other functions and members...
}
However, the first call to getHelper() will create the object and only the few threads trying to access it during that time need to be synchronized; after that all calls just get a reference to the member variable. Since synchronizing a method can decrease performance by a factor of 100 or higher,[3] the overhead of acquiring and releasing a lock every time this method is called seems unnecessary: once the initialization has been completed, acquiring and releasing the locks would appear unnecessary. Many programmers have attempted to optimize this situation in the following manner:
  1. Check that the variable is initialized (without obtaining the lock). If it is initialized, return it immediately.
  2. Obtain the lock.
  3. Double-check whether the variable has already been initialized: if another thread acquired the lock first, it may have already done the initialization. If so, return the initialized variable.
  4. Otherwise, initialize and return the variable.
// Broken multithreaded version
// "Double-Checked Locking" idiom
class Foo {
    private Helper helper = null;
    public Helper getHelper() {
        if (helper == null) {
            synchronized(this) {
                if (helper == null) {
                    helper = new Helper();
                }
            }
        }
        return helper;
    }
 
    // other functions and members...
}
tuitively, this algorithm seems like an efficient solution to the problem. However, this technique has many subtle problems and should usually be avoided. For example, consider the following sequence of events:
  1. Thread A notices that the value is not initialized, so it obtains the lock and begins to initialize the value.
  2. Due to the semantics of some programming languages, the code generated by the compiler is allowed to update the shared variable to point to a partially constructed object before A has finished performing the initialization. For example, in Java if a call to a constructor has been inlined then the shared variable may immediately be updated once the storage has been allocated but before the inlined constructor initializes the object.[4]
  3. Thread B notices that the shared variable has been initialized (or so it appears), and returns its value. Because thread B believes the value is already initialized, it does not acquire the lock. If B uses the object before all of the initialization done by A is seen by B (either because A has not finished initializing it or because some of the initialized values in the object have not yet percolated to the memory B uses (cache coherence)), the program will likely crash.
One of the dangers of using double-checked locking in J2SE 1.4 (and earlier versions) is that it will often appear to work: it is not easy to distinguish between a correct implementation of the technique and one that has subtle problems. Depending on the compiler, the interleaving of threads by the scheduler and the nature of other concurrent system activity, failures resulting from an incorrect implementation of double-checked locking may only occur intermittently. Reproducing the failures can be difficult.
As of J2SE 5.0, this problem has been fixed. The volatile keyword now ensures that multiple threads handle the singleton instance correctly. This new idiom is described in [4]:
// Works with acquire/release semantics for volatile
// Broken under Java 1.4 and earlier semantics for volatile
class Foo {
    private volatile Helper helper = null;
    public Helper getHelper() {
        Helper result = helper;
        if (result == null) {
            synchronized(this) {
                result = helper;
                if (result == null) {
                    helper = result = new Helper();
                }
            }
        }
        return result;
    }
 
    // other functions and members...
}

Note the usage of the local variable result which seems unnecessary. For some versions of the Java VM, it will make the code 25% faster and for others, it won't hurt.[5]
If the helper object is static (one per class loader), an alternative is the initialization on demand holder idiom [6] See Listing 16.6 on [7]
// Correct lazy initialization in Java 
@ThreadSafe
class Foo {
    private static class HelperHolder {
       public static Helper helper = new Helper();
    }
 
    public static Helper getHelper() {
        return HelperHolder.helper;
    }
}
This relies on the fact that inner classes are not loaded until they are referenced.
Semantics of final field in Java 5 can be employed to safely publish the helper object without using volatile:[8]
public class FinalWrapper<T> {
    public final T value;
    public FinalWrapper(T value) { 
        this.value = value; 
    }
}
 
public class Foo {
   private FinalWrapper<Helper> helperWrapper = null;
 
   public Helper getHelper() {
      FinalWrapper<Helper> wrapper = helperWrapper;
 
      if (wrapper == null) {
          synchronized(this) {
              if (helperWrapper == null) {
                  helperWrapper = new FinalWrapper<Helper>(new Helper());
              }
              wrapper = helperWrapper;
          }
      }
      return wrapper.value;
   }
}
The local variable wrapper is required for correctness. Performance of this implementation is not necessarily better than the volatile implementation.

Tuesday, April 16, 2013

Hazelcast write behind

In order to implement write behind  solution based on Hazelcast grid feel free to use Persistence feature :


Persistence

Hazelcast allows you to load and store the distributed map entries from/to a persistent datastore such as relational database. If a loader implementation is provided, when get(key) is called, if the map entry doesn't exist in-memory then Hazelcast will call your loader implementation to load the entry from a datastore. If a store implementation is provided, when put(key,value) is called, Hazelcast will call your store implementation to store the entry into a datastore. Hazelcast can call your implementation to store the entries synchronously (write-through) with no-delay or asynchronously (write-behind) with delay and it is defined by the write-delay-seconds value in the configuration.
If it is write-through, when the map.put(key,value) call returns, you can be sure that
  • MapStore.store(key,value) is successfully called so the entry is persisted.
  • In-Memory entry is updated
  • In-Memory backup copies are successfully created on other JVMs (if backup-count is greater than 0)
If it is write-behind, when the map.put(key,value) call returns, you can be sure that
  • In-Memory entry is updated
  • In-Memory backup copies are successfully created on other JVMs (if backup-count is greater than 0)
  • The entry is marked as dirty so that after write-delay-seconds, it can be persisted.
Same behavior goes for the remove(key and MapStore.delete(key). If MapStore throws an exception then the exception will be propagated back to the original put or remove call in the form of RuntimeException. When write-through is used, Hazelcast will callMapStore.store(key,value) and MapStore.delete(key) for each entry update. When write-behind is used, Hazelcast will callMapStore.store(map), and MapStore.delete(collection) to do all writes in a single call. Also note that your MapStore or MapLoader implementation should not use Hazelcast Map/Queue/MultiMap/List/Set operations. Your implementation should only work with your data store. Otherwise you may get into deadlock situations.
Here is a sample configuration:
<hazelcast>
    ...
    <map name="default">
        ...
        <map-store enabled="true">
            <!--
               Name of the class implementing MapLoader and/or MapStore. 
               The class should implement at least of these interfaces and
               contain no-argument constructor. Note that the inner classes are not supported.
            -->
            <class-name>com.hazelcast.examples.DummyStore</class-name>
            <!--
               Number of seconds to delay to call the MapStore.store(key, value).
               If the value is zero then it is write-through so MapStore.store(key, value)
               will be called as soon as the entry is updated.
               Otherwise it is write-behind so updates will be stored after write-delay-seconds
               value by calling Hazelcast.storeAll(map). Default value is 0.
            -->
            <write-delay-seconds>0</write-delay-seconds>
        </map-store>
    </map>
</hazelcast>

Monday, April 15, 2013

Hbase and Hadoop on Windows


After 2 days  and long night of deep investigations  finally i could run Hadoop and HBase on Windows  by installing Cygwin.

What has been done :
1. Cygwin installed
2. SSH confiured
3. Git plugin installed for eclipse + learned
4. Maven installed + learned
5. Hbase checked out, compiled and run - Tested by console
6. Toad for cloud installed - connected to HBase
7. Hadoop installed on cygwin and reconfigured
8. Hadoop plugin installed. It took 1 day and 1 night to understand what plugin of version 0.19.X works only  on eclipse europe 3.3.X whcoh works proper way with JDK 6 only . ( i had JUNO with JDK 1.7).Finally this was resolved.

Finally my account looks like this:


Full detailed tutorials with links to tutorials attached :
Google drive link for Tutorial

Friday, January 25, 2013

In-Memory Data Grids - what and why's

The need

As we discussed previously , data storage need increases exponentially . We need more and more place to hold more and more data. We want it to be high available, scalable and super fast.  Here is where In Memory data grids come into the picture .

The model

The data model is distributed across many servers in a single location or across multiple locations.  This distribution is known as a data fabric.  This distributed model is known as a ‘shared nothing’ architecture.
  • All servers can be active in each site.
  • All data is stored in the RAM of the servers.
  • Servers can be added or removed non-disruptively, to increase the amount of RAM available.
  • The data model is non-relational and is object-based. 
  • Distributed applications written on the .NET and Java application platforms are supported.
  • The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers.

Starting point

  • VMware Gemfire                                                 (Java)
  • Oracle Coherence                                             (Java)
  • Alachisoft NCache                                              (.Net)
  • Gigaspaces XAP Elastic Caching Edition            (Java)
  • Hazelcast                                                           (Java)
  • Scaleout StateServer                                          (.Net)
  • IBM eXtreme Scale
  • Terracotta Enterprise Suite
  • Jboss (Redhat) Infinispan
Links
http://www.infoq.com/articles/in-memory-data-grids