Friday, January 10, 2014

Order the mess : ACID ,CAP theorem and NoSQL



Database transaction
Lets define a transaction as unit  of work that is reliable and recoverable. By definition transaction must support ACID features :
 Atomic - if transaction contains several actions it can occur in all-or-nothing manner.
Consistent - if i commit some action - all other actions see the change immediately.
Isolated - 2 transactions that run on the same data will see different copies until on of them commits changes.
Example
1 transactions reads some value when other updates it.Untill transaction B commits - transaction A will see an old value.
Durable -Once transaction is committed -even if the system crushes, the transaction results not going to get lost.Its achieved by managing transaction logs that can be replayed,

Great! we have 4 properties of a transaction.

All those properties achieved in different ways .


CAP theorem
Its set of  requirements for distributed system
Consistency- all servers in the distributed system will be in the same state.
Availability - you will be always be able to get data out of the system ( even if isn't consistent)
Partition tolerance - service will stay available even if some parts of he system crushes.

So its proved that all 3 of them can't be achieved at the same time.And its required to have 2 out of 3.

So lets make some order :
1. CAP are properties of distributed system.
2. ACID are futures of transaction .

Transactions are futures of relational databases. like MySQL, Oracle etc.

Big Data era.

Amounts of data  that is being generated in last years caused  scale and performance to be most important over some ACID futures. Providing Consistency for example causes the system to be much slower. And sometimes we can live with an idea that the data will not be consistency at the same second ( but finally yes!)

That causes the idea of other DB systems to get popular . Some of them provide parts of ACID , some of them don't have SQL as main interface. And we call them NoSQL databases.
Ideas of Relational storage is avoided in some of them.
But as main concept they provide 2 out of 3 CAP requirements, avoid ACID and solve some predefined problem in data storage, like graph relations, columnar storage etc.

Which means the main focus now is on data instead of the system.You look, investigate the data then you look for storage solution instead of taking some Relational storage and starting modelling the data .




http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
http://en.wikipedia.org/wiki/Eventual_consistency
http://www.cs.berkeley.edu/~istoica/classes/cs268/06/notes/20-BFTx2.pdf




Thursday, January 9, 2014

Couchbase : almost document-oriented database





After deep investigation of several NoSQL DB's Looks like couchbase is one of best options

1. It has JSON support 
like document-oriented DB but it doesn't indexes all the fileds on all levels of the document ( good and bad!).
It something between Document-Oriented DB and Key-Value store with an option for custom Indexes View creation which is actually a simple map reduce job that runs automaticly.

2. Auto-Sharding
In 1 click you can simply add servers.

3.Cross Cluster replication

4. Object level cache based on memcached

5.  Built in management and monitoring tool

6. Asynchronous persistence.


7. Benchmarks:
http://www.couchbase.com/presentations/benchmarking-couchbase
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-708169.pdf

8. Best practices

9. Great documentation

10. Supports ( not offically) GeoSearch

11. Good API's
Contains Bulk modes ( on Views)
CAS oprations ( optimistic locking)
and much more

It looks like it fits all avarage needs but during the work you need to pay attention to some things like:
1. Working with keys and values instead of indexes ( indexes defined by map - reduce jobs update incrementally  and you can not control it too much)
2. Compaction on views and buckets exist and can be controlled.
3. Don't define more than couple of buckets. ( Read attached best  practices paper)

have fun with couchbase.




Thursday, December 5, 2013

Hbase ? No! Kiji + Hbase!

I started using HBase about a 1.5 years ago. It took for me some time to learn the terminology ,configuration ,API's etc...
Then  i realized Kiji and started using it...The well documented framework with code example , tutorials and clear design definitely improved my life .

Just check this out
http://www.kiji.org/



KijiSchema

Provides a simple Java API and command line interface for importing, managing, and retrieving data from HBase ( thats the MAIN componenet!)
  • Set up HBase layouts using user-friendly tools including a DDL
  • Implement HBase best practices in table management
  • Use evolving Avro schema management to serialize complex data
  • Perform both short-request and batch processing on data in HBase
  • Import data from HDFS into structured HBase tables

 KijiMR
  • KijiMR allows KijiSchema users to employ MapReduce-based techniques to develop many kinds of applications, including those using machine learning and other complex analytics.
    KijiMR is organized around three core MapReduce job types: Bulk Importers, Producers and Gatherers.
  • Bulk Importers make it easy to efficiently load data into Kiji tables from a variety of formats, such as JSON or CSV files stored in HDFS.
  • Producers are entity-centric operations that use an entity’s existing data to generate new information and store it back in the entity’s row. One typical use-case for producers is to generate new recommendations for a user featuresbased on the user’s history.
  • Gatherers provide flexible MapReduce computations that scan over Kiji table rows and output key-value pairs. By using different outputs and reducers, gatherers can export data in a variety of formats (such as text or Avro) or into other Kiji tables.
KijiREST 
Provides a REST (Representational State Transfer) interface for Kiji, allowing applications to interact with a Kiji cluster over HTTP by issuing the four standard actions: GET, POST, PUT, and DELETE.

KijiExpress
 is a set of tools designed to make defining data processing MapReduce jobs quick and expressive, particularly for data stored in Kiji tables.

Scoring 
is the application of a trained model against data to produce an actionable result. This result could be a recommendation, classification, or other derivation.

Anyhow - No more pure Hbase API.Only Kiji!

Saturday, November 16, 2013

Apache Kafka 0.8 beta compiled in scala 2.10

After several changes of scala code and sbt configuration i successfully compiled and tested Kafka 0.8 (scala 2.10)


 Feel free to download and use:

https://drive.google.com/file/d/0B83rvqbRt-ksZDlOSzFiQTFpRm8/edit?usp=sharing


Saturday, October 12, 2013

Apache kafka on windows





After downloading apache kafka i failed starting it up on windows with existing scripts.
I made some adjustments for some of scripts and now the broker successfully starts.

Anyone who needs it feel free to do'nload:

https://docs.google.com/file/d/0B83rvqbRt-ksUEVUY1ZNTTRfcVk/edit?usp=sharing

or download updated scripts only:

https://docs.google.com/file/d/0B83rvqbRt-ksR2VPb2JzeHA1U2M/edit?usp=sharing

Just unrar , start zookeeper and server.






Fastest REST service




Targer
Handling 100K tps per core on 64 bit linux (RH) json based service.

After deep investigations

Comparisons of following webservers
undertow with Servlet
tomcat with Servlet
jetty with Servlet

Frameworks
play http://www.playframework.com/
spray.io http://spray.io/

Low level frameworks
Netty
ZeroMQ


Conclusions

1. In order to reach the load we have to release service io thread as soon as possible .
2. Request with single entry vs request with bulk mode.
3. zero calcualtion in io thread.

Thats how i  reached the performance torget.
Final architecture:
1. Jetty with servlet based serviec (POST implementation)
2. Bulk mode with 100K per request.
3. Release of request ASAP ( return 200 as soon as possible - then processing )
4. Still synchronous servlet.
5. Jmeter load testing .

Measuring
On server side by definining int[120] and doing System.currentTimeInMillis() / 1000 and incrementing apropriate variable in array :
myArray[System.currentTimeInMillis() / 1000] ++;
then printing once in 2 minutes and zeroing.

Limitation on single core
taskset -c 0 mycommand --option  # start a command with the given affinity
taskset -c 0 -p 1234             # set the affinity of a running process
BTW :
When process was already running taskes -p <PID>  didnt work.

Future investigation

1. Asynch servlet.
2. Akka based async service.
3.Netty + RESTEasy framework

Friday, October 4, 2013

REST Webservice on NETTY RESTEasy



I had to implement some simple REST service
Requirments were:
1. low latancy
2. highly scalable
3. robust
4. high throughput
5. simple
6. JSON based parameters passing
Ive started with Tomcat with servlet on top of it and got bad throughput.
Ive tried Jetty but still - throughpit was terreble.
Then i decided to use Netty with some REST back on top of it.
Will do benchmarking and update you soon....

Resteasy
http://www.jboss.org/resteasy

Service :


@Path("/message")
public class RestService {

@Path("/test")
@POST
@Consumes("application/json")
@Produces("application/json")
public Response addOrder(Container a)
{
return Response.status(200).entity("DONE").build();
}

}

@XmlRootElement
public class Container
{

private ArrayList<Parameter> parameters = new ArrayList<>();
public void AddParameter(Parameter p)
{
this.parameters.add(p);
}
@XmlElement
public ArrayList<Parameter> getParameters() {
return parameters;
}
public void setParameters(ArrayList<Parameter> parameters) {
this.parameters = parameters;
}

}

@XmlRootElement
public class Parameter {

private String name;
private int age;
@XmlElement
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
@XmlElement
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}

public class Server 
{
public static void main(String[] args) {
ResteasyDeployment deployment = new ResteasyDeployment();
  Map<String, String> mediaTypeMappings = new HashMap<String, String>();
  mediaTypeMappings.put("xml", "application/xml");
  mediaTypeMappings.put("json", "application/json");
  deployment.setMediaTypeMappings(mediaTypeMappings);
  NettyJaxrsServer netty = new NettyJaxrsServer();
   netty.setDeployment(deployment);
   netty.setPort(TestPortProvider.getPort());
   netty.setRootResourcePath("");
   netty.setSecurityDomain(null);
   netty.start();
  deployment.getRegistry().addPerRequestResource(RestService.class);

}

}

Client :

public class ClientMain {
public static void main(String[] args) {
Client client = ClientBuilder.newBuilder().build();
       WebTarget target = client.target("http://localhost:8081/message/test");
       Container c = new Container();
       Parameter param = new Parameter();
       param.setAge(11);
       param.setName("RAMI");
       c.AddParameter(param);
       Response response = target.request().post(Entity.entity(c, "application/json"));
       String value = response.readEntity(String.class);
       System.out.println(value);
       response.close();  
}
}