Tuesday, September 24, 2013

Tipical NoSQL Big data solution ( part 1)

Big data components

In flow

This is actually the data that gets into the system. It can be files , any kind of events or web pages.. We dont care .

Distributor

When we recieve our in flow we need to distribute it. The distribution can be based on replication of the data to several destination or distributed according to some data details.
Example: If log record contains word : event - send it to HDFS only.
Examples: Apache flume,Logstash ,Fluentd


Storages - Long term, short term 

Then we save the data to storages. We have several types of storages and each one has its pros and cons.


Long term 

We need it to hold the whole data and analyze it by batch processing. In most of cases it will be hadoop based HDFS storage and we use Map-Reduce / Hive / Pig jobs run and create some reports.
As you can understand - its heavy and slow process.


Short term 

If we need our data to be easuly and fast accessible we will use some high scalable database.We have several types here:
Examples : Redis, Riak, Dynamo, Gemfire
Examples: Vertica,MonetDB

Examples : MongoDB, Cassandra,CouchDB
Examples : Neo4J

Data ModelPerformanceScalabilityFlexibilityComplexityFunctionality
Key–value Storeshighhighhighnonevariable (none)
Column Storehighhighmoderatelowminimal
Document Storehighvariable (high)highlowvariable (low)
Graph Databasevariablevariablehighhighgraph theory
Relational Databasevariablevariablelowmoderaterelational algebra.
The data is much faster acessible and much more structurized.



Real time processing

This component in most of cases will be Storm (http://storm-project.net/).It will pull the data ( in our case we use Kafka( http://kafka.apache.org/) and process it based on Short term and fast access data.
Probably it's decision should be sent to some external systems  to notify end user.


End User 

Will use some  stack for visualizing the data.
It also can contain a service for  querying data.In most of cases it will be against short term storages.

Next part is comming...

Wednesday, September 11, 2013

Add auto generated field to Solr 4





11.       Define new type :
fieldType name="uuid" class="solr.UUIDField" indexed="true" />
22.       Add new field
<field name="rami" type="uuid" indexed="true" stored="true" default="NEW"/>
(parameter – default-“NEW” does the trick!)

33.                       <updateRequestProcessorChain name="uuid">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">rami</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

44.       To the relevant handler add the chain
Example: for /update/extract
  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
                  <str name="update.chain">uuid</str>
    </lst>
  </requestHandler>



Now u can executer / update/extract without passing filed “rami” and it will be automatically generated.

Thursday, September 5, 2013

Big Data analytics - tools



All the traditional players such as SAS, IBM SPSS, KXEN, Matlab, Statsoft, Tableau, Pentaho, and others are working toward Hadoop-based Big Data analytics. However, each of these software players has to balance their current technology and customer portfolio along with the incredulous pace of innovation occurring in the open-source community. Most of the tools have connectors that are high-speed connectors to move data back and forth between Hadoop and their tool/environment. With Big Data, the objective is to keep the data in place and bring the analytics processing to the data to avoid the bottleneck and constraints associated with data movement. Over time, each vendor will develop a strategy and approach to keep data in place and move their analytics processing to the data.
In the meantime, there are new commercial vendors and open-source projects evolving to address the voracious appetite for Big Data analytics. Karmasphere (https://karmasphere.com/) is a native Hadoop-based tool for data exploration and visualization. Datameer (http://www.datameer.com/) is a spreadsheet-like presentation tool. Alpine Data Miner (http://www.alpinedatalabs.com/) has a cross-platform analytic workbench.
R (http://cran.r-project.org/) is by far the most dominant analytics tool in the Big Data space. R is an open-source statistical language with constructs that make it easy for data scientists to explore and build models. R is also renowned for the plethora of available analytics. There are libraries focused on industry problems (i.e., clinical trials, genetics, finance, and others) as well as general purpose libraries (i.e., econometrics, natural language processing, optimization, time series, and many more). At this point, there are supposedly over two million R users around the globe and a commercial distribution is available via Revolution Analytics.
Open-source technologies include:
Reference: 
  • Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses