In order to serialzie requests or data that will be stored we shall use some serialization system.
After deep investigation and comparison of stability,serialization/deserialiation engine strucutre
Following was found :
Protocol buffer https://code.google.com/p/protobuf/
Avro - http://avro.apache.org/
Thrift -
http://thrift.apache.org/
http://diwakergupta.github.io/thrift-missing-guide/
Main comparison conclusion
1. Protobuff and Thrift need code generation, when Avro doesnt need it. Metadata simply defined on both sides.In addition to that, because of metadata file there is no need to sign datatypes inside of the buffer..
2. Thrift vs protocol buffer - The difference on benchmark results are in nano seconds for ser/deser times and storage difference is around 0.05%
Comparison links :
https://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
Saturday, April 20, 2013
NoSQL and BigData Books
In order to get into Big Data technologies i would recommend following books
For Non Computer science newbies
Head First : Java
DataBase systems : Practical approach
For Computer Science newbies
Seven databases in seven weeks
Proffessional NoSQL
Hadoop
Hadoop - The Definitive Guide
Hadoop in practice
Hadoop mapreduce cookbook
Hadoop operations
Hadoop real-world solutions cookbook
Mapreduce design_patterns
HBase
HBase - The.Definitive Guide
Hbase in action
Hive
Programming hive
Pig
Programming pig
Links:
Contact me if books are needed
For Non Computer science newbies
Head First : Java
DataBase systems : Practical approach
For Computer Science newbies
Seven databases in seven weeks
Proffessional NoSQL
Hadoop
Hadoop - The Definitive Guide
Hadoop in practice
Hadoop mapreduce cookbook
Hadoop operations
Hadoop real-world solutions cookbook
Mapreduce design_patterns
HBase
HBase - The.Definitive Guide
Hbase in action
Hive
Programming hive
Pig
Programming pig
Links:
Contact me if books are needed
Thursday, April 18, 2013
Zookeeper Summary
ZooKeeper is a distributed, open-source coordination service for distributed applications
naming.Its used for synchronization, configuration maintenance, and groups and
ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchal namespace which is organized similarly to a standard file system.
Like the distributed processes it coordinates, ZooKeeper itself is
intended to be replicated over a sets of hosts called an ensemble
The name space provided by ZooKeeper is much like that of a standard file system. A name
is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name
space is identified by a path
So basically what happens here is that you create a tree which can be updated from any client of the system and the model is shared and all the actions are ordered.
For example if you want to implement destributed Queue - you create a znode, and put to it data with related paths, fro other side you retrieve those values and remove them
API:
Zookeeper Javadoc
General architecture
Every ZooKeeper server services clients. Clients connect to exactly one server to submit
irequests. Read requests are serviced from the local replica of each server database. Requests
that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single
server, called the leader. The rest of the ZooKeeper servers, called followers, receive
message proposals from the leader and agree upon message delivery. The messaging layer
takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic,
ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a
write request, it calculates what the state of the system is when the write is to be applied and
transforms this into a transaction that captures this new state.
Additional feautures
1. Watches concept - Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes
2. Sequential Consistency - Updates from a client will be applied in the order that they were
sent.
3. Atomicity - Updates either succeed or fail. No partial results.
4. Single System Image - A client will see the same view of the service regardless of the
server that it connects to.
5. Reliability - Once an update has been applied, it will persist from that time forward until
a client overwrites the update.
6. Timeliness - The clients view of the system is guaranteed to be up-to-date within a
certain time bound.
7. Observer - in order not o hurt write performence with many many clients use Observer nide
Observersforward these requests to the Leader like Followers do, but they then simply wait to hear the
result of the vote. Because of this, we can increase the number of Observers as much as we
like without harming the performance of votes
Architecture
Can eb run in standalone mode or replicated.A replicatedgroup of servers in the same application is called a quorum, and in replicated mode, all servers in the quorum have copies of the same configuration file
Performance
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release
What can be implemented
Shared Barriers
Shared Queues
Shared Locks
Two face commit
etc
Documentation, tutorials and examples : Zookeeper documentation
naming.Its used for synchronization, configuration maintenance, and groups and
ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchal namespace which is organized similarly to a standard file system.
Like the distributed processes it coordinates, ZooKeeper itself is
intended to be replicated over a sets of hosts called an ensemble
The name space provided by ZooKeeper is much like that of a standard file system. A name
is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name
space is identified by a path
So basically what happens here is that you create a tree which can be updated from any client of the system and the model is shared and all the actions are ordered.
For example if you want to implement destributed Queue - you create a znode, and put to it data with related paths, fro other side you retrieve those values and remove them
API:
Zookeeper Javadoc
General architecture
Every ZooKeeper server services clients. Clients connect to exactly one server to submit
irequests. Read requests are serviced from the local replica of each server database. Requests
that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single
server, called the leader. The rest of the ZooKeeper servers, called followers, receive
message proposals from the leader and agree upon message delivery. The messaging layer
takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic,
ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a
write request, it calculates what the state of the system is when the write is to be applied and
transforms this into a transaction that captures this new state.
Additional feautures
1. Watches concept - Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes
2. Sequential Consistency - Updates from a client will be applied in the order that they were
sent.
3. Atomicity - Updates either succeed or fail. No partial results.
4. Single System Image - A client will see the same view of the service regardless of the
server that it connects to.
5. Reliability - Once an update has been applied, it will persist from that time forward until
a client overwrites the update.
6. Timeliness - The clients view of the system is guaranteed to be up-to-date within a
certain time bound.
7. Observer - in order not o hurt write performence with many many clients use Observer nide
Observersforward these requests to the Leader like Followers do, but they then simply wait to hear the
result of the vote. Because of this, we can increase the number of Observers as much as we
like without harming the performance of votes
Architecture
Can eb run in standalone mode or replicated.A replicatedgroup of servers in the same application is called a quorum, and in replicated mode, all servers in the quorum have copies of the same configuration file
Performance
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release
What can be implemented
Shared Barriers
Shared Queues
Shared Locks
Two face commit
etc
Documentation, tutorials and examples : Zookeeper documentation
Wednesday, April 17, 2013
Double check locking
Ss a software design pattern used to reduce the overhead of acquiring a lock by first testing the locking criterion (the "lock hint") without actually acquiring the lock. Only if the locking criterion check indicates that locking is required does the actual locking logic proceed.
It is typically used to reduce locking overhead when implementing "lazy initialization" in a multi-threaded environment, especially as part of the Singleton pattern. Lazy initialization avoids initializing a value until the first time it is accessed.
It is typically used to reduce locking overhead when implementing "lazy initialization" in a multi-threaded environment, especially as part of the Singleton pattern. Lazy initialization avoids initializing a value until the first time it is accessed.
class Foo {
private Helper helper = null;
public Helper getHelper() {
if (helper == null) {
helper = new Helper();
}
return helper;
}
// other functions and members...
}
The problem is that this does not work when using multiple threads. A lock must be obtained in case two threads call getHelper()
simultaneously. Otherwise, either they may both try to create the object at the same time, or one may wind up getting a reference to an incompletely initialized object.
The lock is obtained by expensive synchronizing, as is shown in the following example.
// Correct but possibly expensive multithreaded version
class Foo {
private Helper helper = null;
public synchronized Helper getHelper() {
if (helper == null) {
helper = new Helper();
}
return helper;
}
// other functions and members...
}
However, the first call to getHelper()
will create the object and only the few threads trying to access it during that time need to be synchronized; after that all calls just get a reference to the member variable. Since synchronizing a method can decrease performance by a factor of 100 or higher,[3] the overhead of acquiring and releasing a lock every time this method is called seems unnecessary: once the initialization has been completed, acquiring and releasing the locks would appear unnecessary. Many programmers have attempted to optimize this situation in the following manner:
- Check that the variable is initialized (without obtaining the lock). If it is initialized, return it immediately.
- Obtain the lock.
- Double-check whether the variable has already been initialized: if another thread acquired the lock first, it may have already done the initialization. If so, return the initialized variable.
- Otherwise, initialize and return the variable.
// Broken multithreaded version
// "Double-Checked Locking" idiom
class Foo {
private Helper helper = null;
public Helper getHelper() {
if (helper == null) {
synchronized(this) {
if (helper == null) {
helper = new Helper();
}
}
}
return helper;
}
// other functions and members...
}
tuitively, this algorithm seems like an efficient solution to the problem. However, this technique has many subtle problems and should usually be avoided. For example, consider the following sequence of events:
- Thread A notices that the value is not initialized, so it obtains the lock and begins to initialize the value.
- Due to the semantics of some programming languages, the code generated by the compiler is allowed to update the shared variable to point to a partially constructed object before A has finished performing the initialization. For example, in Java if a call to a constructor has been inlined then the shared variable may immediately be updated once the storage has been allocated but before the inlined constructor initializes the object.[4]
- Thread B notices that the shared variable has been initialized (or so it appears), and returns its value. Because thread B believes the value is already initialized, it does not acquire the lock. If B uses the object before all of the initialization done by A is seen by B (either because A has not finished initializing it or because some of the initialized values in the object have not yet percolated to the memory B uses (cache coherence)), the program will likely crash.
One of the dangers of using double-checked locking in J2SE 1.4 (and earlier versions) is that it will often appear to work: it is not easy to distinguish between a correct implementation of the technique and one that has subtle problems. Depending on the compiler, the interleaving of threads by the scheduler and the nature of other concurrent system activity, failures resulting from an incorrect implementation of double-checked locking may only occur intermittently. Reproducing the failures can be difficult.
As of J2SE 5.0, this problem has been fixed. The volatile keyword now ensures that multiple threads handle the singleton instance correctly. This new idiom is described in [4]:
// Works with acquire/release semantics for volatile
// Broken under Java 1.4 and earlier semantics for volatile
class Foo {
private volatile Helper helper = null;
public Helper getHelper() {
Helper result = helper;
if (result == null) {
synchronized(this) {
result = helper;
if (result == null) {
helper = result = new Helper();
}
}
}
return result;
}
// other functions and members...
}
Note the usage of the local variable result which seems unnecessary. For some versions of the Java VM, it will make the code 25% faster and for others, it won't hurt.[5]
If the helper object is static (one per class loader), an alternative is the initialization on demand holder idiom [6] See Listing 16.6 on [7]
// Correct lazy initialization in Java
@ThreadSafe
class Foo {
private static class HelperHolder {
public static Helper helper = new Helper();
}
public static Helper getHelper() {
return HelperHolder.helper;
}
}
This relies on the fact that inner classes are not loaded until they are referenced.
Semantics of final field in Java 5 can be employed to safely publish the helper object without using volatile:[8]
public class FinalWrapper<T> {
public final T value;
public FinalWrapper(T value) {
this.value = value;
}
}
public class Foo {
private FinalWrapper<Helper> helperWrapper = null;
public Helper getHelper() {
FinalWrapper<Helper> wrapper = helperWrapper;
if (wrapper == null) {
synchronized(this) {
if (helperWrapper == null) {
helperWrapper = new FinalWrapper<Helper>(new Helper());
}
wrapper = helperWrapper;
}
}
return wrapper.value;
}
}
The local variable wrapper is required for correctness. Performance of this implementation is not necessarily better than the volatile implementation.
Tuesday, April 16, 2013
Hazelcast write behind
In order to implement write behind solution based on Hazelcast grid feel free to use Persistence feature :
Persistence
Hazelcast allows you to load and store the distributed map entries from/to a persistent datastore such as relational database. If a loader implementation is provided, when
get(key)
is called, if the map entry doesn't exist in-memory then Hazelcast will call your loader implementation to load the entry from a datastore. If a store implementation is provided, when put(key,value)
is called, Hazelcast will call your store implementation to store the entry into a datastore. Hazelcast can call your implementation to store the entries synchronously (write-through) with no-delay or asynchronously (write-behind) with delay and it is defined by the write-delay-seconds
value in the configuration.
If it is write-through, when the
map.put(key,value)
call returns, you can be sure thatMapStore.store(key,value)
is successfully called so the entry is persisted.- In-Memory entry is updated
- In-Memory backup copies are successfully created on other JVMs (if backup-count is greater than 0)
If it is write-behind, when the
map.put(key,value)
call returns, you can be sure that- In-Memory entry is updated
- In-Memory backup copies are successfully created on other JVMs (if backup-count is greater than 0)
- The entry is marked as
dirty
so that afterwrite-delay-seconds
, it can be persisted.
Same behavior goes for the
remove(key
and MapStore.delete(key)
. If MapStore
throws an exception then the exception will be propagated back to the original put
or remove
call in the form of RuntimeException
. When write-through is used, Hazelcast will callMapStore.store(key,value)
and MapStore.delete(key)
for each entry update. When write-behind is used, Hazelcast will callMapStore.store(map)
, and MapStore.delete(collection)
to do all writes in a single call. Also note that your MapStore or MapLoader implementation should not use Hazelcast Map/Queue/MultiMap/List/Set operations. Your implementation should only work with your data store. Otherwise you may get into deadlock situations.
Here is a sample configuration:
<hazelcast>
...
<map name="default">
...
<map-store enabled="true">
<!--
Name of the class implementing MapLoader and/or MapStore.
The class should implement at least of these interfaces and
contain no-argument constructor. Note that the inner classes are not supported.
-->
<class-name>com.hazelcast.examples.DummyStore</class-name>
<!--
Number of seconds to delay to call the MapStore.store(key, value).
If the value is zero then it is write-through so MapStore.store(key, value)
will be called as soon as the entry is updated.
Otherwise it is write-behind so updates will be stored after write-delay-seconds
value by calling Hazelcast.storeAll(map). Default value is 0.
-->
<write-delay-seconds>0</write-delay-seconds>
</map-store>
</map>
</hazelcast>
Monday, April 15, 2013
Hbase and Hadoop on Windows
After 2 days and long night of deep investigations finally i could run Hadoop and HBase on Windows by installing Cygwin.
What has been done :
1. Cygwin installed
2. SSH confiured
3. Git plugin installed for eclipse + learned
4. Maven installed + learned
5. Hbase checked out, compiled and run - Tested by console
6. Toad for cloud installed - connected to HBase
7. Hadoop installed on cygwin and reconfigured
8. Hadoop plugin installed. It took 1 day and 1 night to understand what plugin of version 0.19.X works only on eclipse europe 3.3.X whcoh works proper way with JDK 6 only . ( i had JUNO with JDK 1.7).Finally this was resolved.
Finally my account looks like this:
Google drive link for Tutorial
Location:
אשדוד, ישראל
Friday, January 25, 2013
In-Memory Data Grids - what and why's
The need
As we discussed previously , data storage need increases exponentially . We need more and more place to hold more and more data. We want it to be high available, scalable and super fast. Here is where In Memory data grids come into the picture .
The model
The data model is distributed across many servers in a single location or across multiple locations. This distribution is known as a data fabric. This distributed model is known as a ‘shared nothing’ architecture.
- All servers can be active in each site.
- All data is stored in the RAM of the servers.
- Servers can be added or removed non-disruptively, to increase the amount of RAM available.
- The data model is non-relational and is object-based.
- Distributed applications written on the .NET and Java application platforms are supported.
- The data fabric is resilient, allowing non-disruptive automated detection and recovery of a single server or multiple servers.
Starting point
- VMware Gemfire (Java)
- Oracle Coherence (Java)
- Alachisoft NCache (.Net)
- Gigaspaces XAP Elastic Caching Edition (Java)
- Hazelcast (Java)
- Scaleout StateServer (.Net)
- IBM eXtreme Scale
- Terracotta Enterprise Suite
- Jboss (Redhat) Infinispan
http://www.infoq.com/articles/in-memory-data-grids
Subscribe to:
Posts (Atom)