Tuesday, August 27, 2013

Actor Model description


Actor model


1. Thread problem
The traditional way of offering concurrency in a programming language is by using threads. In this model,
the execution of the program is split up into concurrently running tasks. It is as if the program is being
executed multiple times, the difference being that each of these copies operated on shared memory.
This can lead to a series of hard to debug problems, as can be seen below. The first problem, on the left, is
the lost-update problem. Suppose two processes try to increment the value of a shared object acc. They
both retrieve the value of the object, increment the value and store it back into the shared object. As these
operations are not atomic, it is possible that their execution gets interleaved, leading to an incorrectly
updated value of acc, as shown in the example.
The solution to this problems is the use of locks. Locks provide mutual exclusion, meaning that only one
process can acquire the lock at the same time. By using a locking protocol, making sure the right locks
are acquired before using an object, lost-update problems are avoided. However, locks have their own
share of problems. One of them is the deadlock problem, which is pictured on the right. In this example
two processes try to acquire the same two locks A and B. When both do so, but in a different order, a
deadlock occurs. Both wait on the other to release the lock, which will never happen.
These are just some of the problems that might occur when attempting to use threads and locks.

2. Actor model as solution
In the actor model, each object is an actor. This is an entity that has a mailbox and a behaviour. Messages
can be exchanged between actors, which will be buffered in the mailbox. Upon receiving a message, the
behaviour of the actor is executed, upon which the actor can: send a number of messages to other actors,
create a number of actors and assume new behaviour for the next message to be received.
Of importance in this model is that all communications are performed asynchronously. This implies
that the sender does not wait for a message to be received upon sending it, it immediately continues its
execution. There are no guarantees in which order messages will be received by the recipient, but they
will eventually be delivered.
A second important property is that all communications happen by means of messages: there is no shared
state between actors. If an actor wishes to obtain information about the internal state of another actor, it
will have to use messages to request this information. This allows actors to control access to their state,
avoiding problems like the lost-update problem. Manipulation of the internal state also happens through
messages.


Erlang and Scala have built in support for actor model
.
Link
http://savanne.be/articles/concurrency-in-erlang-scala/


Saturday, August 24, 2013

Scalding - WordCount example in local mode











Scala IDE based on eclipse
scalding on scala 2.9


How to run scalding on eclipse

1. install eclipse indigo ( preferable j2ee edition, but add maven plugin -m2e plugin fromupgrade repistory - Help ->Install new software))
2. In Help->Install New software -> add a site http://download.scala-ide.org/sdk/e37/scala29/stable/site
and install scala ide plugin
http://scala-ide.org/download/current.html
We will work with scalding template created byAmit Nithan
http://hokiesuns.blogspot.co.il/2012/07/running-your-scalding-jobs-in-eclipse.html
it already contains needed scalding dependencies
Onec you followed an article and scalding was tested in local mode your next step is to run it on real hadoop cluster.
In order to run Maven's package or other commands from eclipse do:
  • right-click project
  • run as
  • run configurations..
  • double click maven build (to create a new configuration)
  • give a name for configuration e.g. package
  • click variables
  • select "selected_resource_loc" and click ok
  • write your goal e.g. "package" or "clean package"
  • run
The next time when you want to package another project, you can use this configuration again:
  • right-click project
  • run as
  • run configurations..
  • select your maven configuration
  • run
ENJOY:)

Thursday, August 22, 2013

Java profilers

1. Yourkit
http://www.yourkit.com/

2. VisualVM
http://visualvm.java.net/

3. JProfiler
http://www.ej-technologies.com/products/jprofiler/overview.html

4.JProbe
http://www.javaperformancetuning.com/tools/jprobe/

Generate sequence diagram
https://code.google.com/p/jtracert/
http://jsonde.sourceforge.net/



NoSQL - types and use cases



In order to explain what NoSQL is and why its needed lets first introduce what was before it came into the big game...and before we knew RDBMS what provided us the main concept - ACID.
Atomicity- transactivity of actions - if we have a transaction that contains actions A,B and C - all those actions should success .If one of them fails - we shoudl rollback the previous one's to initial state.
Consistency - if tranascation A is ececuted and it should do 2 actions : 1. incease balance of account a1 in 200$  and action 2. should decrease account b1 in 100$ onec transaction is done (in success!) both numbers will be updated ( we dont care what happens during transaction execution. consistency is about what happens in the end)
Isolation - transactions that executed in parallel dont impact each other.
Durability - we dont care if the electiricty goes down in the whole area or the machine totally went down - once transaction was done - even if the system goes down - when it gets up again - the transaction result will be updated.

Great!the concept is clear and here main keyplayer come...
ORACLE , SQLServer, MySQL,DB2 etc..
Everything went fine..till the data started growing with huge speed.... then it started beeing clear that setting up 1 single machine is not enough anymore.In addition to that oracle licensing is based on CPU amount on the machine...making simple map shows us that something totally different should come and replace RDBMS....not yet! Let's shot a last bullet ..last nail to the grave of RDBMS...
CAP theorem(Brewer's theorem)
It says that its impossible in destributed computing provide Consistency,Isolation and Partition tolerance.
So lets summarize:
We have ACID (Oracle) and we want to add more machines to reach scalability.Then we go into destributed computing.And then comes CAP theorem and kicks our a**.
So what know?
We want ACID,we buy license but we cant buy it?
So one option is  to use open source RDBMS and save money...
But what if i say that you you are still in trouble? Your system based on RDBMS cant be infinitly scalible..
Once you reach to petabyte - you will be soooooo slow that you buisness will crush?
Then NoSQL comes to the game...
And what you have there?
*Open Source
*Greatly scalible
*Not Relational
*Destributed
system that during last 10 years proved itself in different companies.
We have 4 families:
And each family has its advantages and disadvantages:
Data ModelPerformanceScalabilityFlexibilityComplexityFunctionality
Key-Value Storeshighhighhighnonevariable (none)
Column Storehighhighmoderatelowminimal
Document Storehighvariable (high)highlowvariable (low)
Graph Databasevariablevariablehighhighgraph theory
Relational Databasevariablevariablelowmoderaterelational algebra.

So you have different databases and the you defenetly say : so which one to pick?
Thats what i've asked first time i saw it. Sometimes i needed graph and sometimes relational and sometimes key-value...
So the answer is simple : you can hold all of them... and combine them...

Sources;
http://en.wikipedia.org/wiki/NoSQL
http://db-engines.com/en/ranking
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/


PS: if you look into db-engines link you can see top DB engines rank.Sure!Oracle Number 1. But im almost sure that NoSQL and other technologies will make with oracle the same Linux did to Windows.Sure!ORacle will stay forever...but world around it will change..