Thursday, December 5, 2013

Hbase ? No! Kiji + Hbase!

I started using HBase about a 1.5 years ago. It took for me some time to learn the terminology ,configuration ,API's etc...
Then  i realized Kiji and started using it...The well documented framework with code example , tutorials and clear design definitely improved my life .

Just check this out
http://www.kiji.org/



KijiSchema

Provides a simple Java API and command line interface for importing, managing, and retrieving data from HBase ( thats the MAIN componenet!)
  • Set up HBase layouts using user-friendly tools including a DDL
  • Implement HBase best practices in table management
  • Use evolving Avro schema management to serialize complex data
  • Perform both short-request and batch processing on data in HBase
  • Import data from HDFS into structured HBase tables

 KijiMR
  • KijiMR allows KijiSchema users to employ MapReduce-based techniques to develop many kinds of applications, including those using machine learning and other complex analytics.
    KijiMR is organized around three core MapReduce job types: Bulk Importers, Producers and Gatherers.
  • Bulk Importers make it easy to efficiently load data into Kiji tables from a variety of formats, such as JSON or CSV files stored in HDFS.
  • Producers are entity-centric operations that use an entity’s existing data to generate new information and store it back in the entity’s row. One typical use-case for producers is to generate new recommendations for a user featuresbased on the user’s history.
  • Gatherers provide flexible MapReduce computations that scan over Kiji table rows and output key-value pairs. By using different outputs and reducers, gatherers can export data in a variety of formats (such as text or Avro) or into other Kiji tables.
KijiREST 
Provides a REST (Representational State Transfer) interface for Kiji, allowing applications to interact with a Kiji cluster over HTTP by issuing the four standard actions: GET, POST, PUT, and DELETE.

KijiExpress
 is a set of tools designed to make defining data processing MapReduce jobs quick and expressive, particularly for data stored in Kiji tables.

Scoring 
is the application of a trained model against data to produce an actionable result. This result could be a recommendation, classification, or other derivation.

Anyhow - No more pure Hbase API.Only Kiji!

Saturday, November 16, 2013

Apache Kafka 0.8 beta compiled in scala 2.10

After several changes of scala code and sbt configuration i successfully compiled and tested Kafka 0.8 (scala 2.10)


 Feel free to download and use:

https://drive.google.com/file/d/0B83rvqbRt-ksZDlOSzFiQTFpRm8/edit?usp=sharing


Saturday, October 12, 2013

Apache kafka on windows





After downloading apache kafka i failed starting it up on windows with existing scripts.
I made some adjustments for some of scripts and now the broker successfully starts.

Anyone who needs it feel free to do'nload:

https://docs.google.com/file/d/0B83rvqbRt-ksUEVUY1ZNTTRfcVk/edit?usp=sharing

or download updated scripts only:

https://docs.google.com/file/d/0B83rvqbRt-ksR2VPb2JzeHA1U2M/edit?usp=sharing

Just unrar , start zookeeper and server.






Fastest REST service




Targer
Handling 100K tps per core on 64 bit linux (RH) json based service.

After deep investigations

Comparisons of following webservers
undertow with Servlet
tomcat with Servlet
jetty with Servlet

Frameworks
play http://www.playframework.com/
spray.io http://spray.io/

Low level frameworks
Netty
ZeroMQ


Conclusions

1. In order to reach the load we have to release service io thread as soon as possible .
2. Request with single entry vs request with bulk mode.
3. zero calcualtion in io thread.

Thats how i  reached the performance torget.
Final architecture:
1. Jetty with servlet based serviec (POST implementation)
2. Bulk mode with 100K per request.
3. Release of request ASAP ( return 200 as soon as possible - then processing )
4. Still synchronous servlet.
5. Jmeter load testing .

Measuring
On server side by definining int[120] and doing System.currentTimeInMillis() / 1000 and incrementing apropriate variable in array :
myArray[System.currentTimeInMillis() / 1000] ++;
then printing once in 2 minutes and zeroing.

Limitation on single core
taskset -c 0 mycommand --option  # start a command with the given affinity
taskset -c 0 -p 1234             # set the affinity of a running process
BTW :
When process was already running taskes -p <PID>  didnt work.

Future investigation

1. Asynch servlet.
2. Akka based async service.
3.Netty + RESTEasy framework

Friday, October 4, 2013

REST Webservice on NETTY RESTEasy



I had to implement some simple REST service
Requirments were:
1. low latancy
2. highly scalable
3. robust
4. high throughput
5. simple
6. JSON based parameters passing
Ive started with Tomcat with servlet on top of it and got bad throughput.
Ive tried Jetty but still - throughpit was terreble.
Then i decided to use Netty with some REST back on top of it.
Will do benchmarking and update you soon....

Resteasy
http://www.jboss.org/resteasy

Service :


@Path("/message")
public class RestService {

@Path("/test")
@POST
@Consumes("application/json")
@Produces("application/json")
public Response addOrder(Container a)
{
return Response.status(200).entity("DONE").build();
}

}

@XmlRootElement
public class Container
{

private ArrayList<Parameter> parameters = new ArrayList<>();
public void AddParameter(Parameter p)
{
this.parameters.add(p);
}
@XmlElement
public ArrayList<Parameter> getParameters() {
return parameters;
}
public void setParameters(ArrayList<Parameter> parameters) {
this.parameters = parameters;
}

}

@XmlRootElement
public class Parameter {

private String name;
private int age;
@XmlElement
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
@XmlElement
public int getAge() {
return age;
}
public void setAge(int age) {
this.age = age;
}
}

public class Server 
{
public static void main(String[] args) {
ResteasyDeployment deployment = new ResteasyDeployment();
  Map<String, String> mediaTypeMappings = new HashMap<String, String>();
  mediaTypeMappings.put("xml", "application/xml");
  mediaTypeMappings.put("json", "application/json");
  deployment.setMediaTypeMappings(mediaTypeMappings);
  NettyJaxrsServer netty = new NettyJaxrsServer();
   netty.setDeployment(deployment);
   netty.setPort(TestPortProvider.getPort());
   netty.setRootResourcePath("");
   netty.setSecurityDomain(null);
   netty.start();
  deployment.getRegistry().addPerRequestResource(RestService.class);

}

}

Client :

public class ClientMain {
public static void main(String[] args) {
Client client = ClientBuilder.newBuilder().build();
       WebTarget target = client.target("http://localhost:8081/message/test");
       Container c = new Container();
       Parameter param = new Parameter();
       param.setAge(11);
       param.setName("RAMI");
       c.AddParameter(param);
       Response response = target.request().post(Entity.entity(c, "application/json"));
       String value = response.readEntity(String.class);
       System.out.println(value);
       response.close();  
}
}

Tuesday, September 24, 2013

Tipical NoSQL Big data solution ( part 1)

Big data components

In flow

This is actually the data that gets into the system. It can be files , any kind of events or web pages.. We dont care .

Distributor

When we recieve our in flow we need to distribute it. The distribution can be based on replication of the data to several destination or distributed according to some data details.
Example: If log record contains word : event - send it to HDFS only.
Examples: Apache flume,Logstash ,Fluentd


Storages - Long term, short term 

Then we save the data to storages. We have several types of storages and each one has its pros and cons.


Long term 

We need it to hold the whole data and analyze it by batch processing. In most of cases it will be hadoop based HDFS storage and we use Map-Reduce / Hive / Pig jobs run and create some reports.
As you can understand - its heavy and slow process.


Short term 

If we need our data to be easuly and fast accessible we will use some high scalable database.We have several types here:
Examples : Redis, Riak, Dynamo, Gemfire
Examples: Vertica,MonetDB

Examples : MongoDB, Cassandra,CouchDB
Examples : Neo4J

Data ModelPerformanceScalabilityFlexibilityComplexityFunctionality
Key–value Storeshighhighhighnonevariable (none)
Column Storehighhighmoderatelowminimal
Document Storehighvariable (high)highlowvariable (low)
Graph Databasevariablevariablehighhighgraph theory
Relational Databasevariablevariablelowmoderaterelational algebra.
The data is much faster acessible and much more structurized.



Real time processing

This component in most of cases will be Storm (http://storm-project.net/).It will pull the data ( in our case we use Kafka( http://kafka.apache.org/) and process it based on Short term and fast access data.
Probably it's decision should be sent to some external systems  to notify end user.


End User 

Will use some  stack for visualizing the data.
It also can contain a service for  querying data.In most of cases it will be against short term storages.

Next part is comming...

Wednesday, September 11, 2013

Add auto generated field to Solr 4





11.       Define new type :
fieldType name="uuid" class="solr.UUIDField" indexed="true" />
22.       Add new field
<field name="rami" type="uuid" indexed="true" stored="true" default="NEW"/>
(parameter – default-“NEW” does the trick!)

33.                       <updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">rami</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

44.       To the relevant handler add the chain
Example: for /update/extract
  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
                  <str name="update.chain">uuid</str>
    </lst>
  </requestHandler>



Now u can executer / update/extract without passing filed “rami” and it will be automatically generated.