Rami On The Web: September 2013

Tuesday, September 24, 2013

Tipical NoSQL Big data solution ( part 1)

Big data components

In flow

This is actually the data that gets into the system. It can be files , any kind of events or web pages.. We dont care .

Distributor

When we recieve our in flow we need to distribute it. The distribution can be based on replication of the data to several destination or distributed according to some data details.

Example: If log record contains word : event - send it to HDFS only.

Examples: Apache flume,Logstash ,Fluentd

Storages - Long term, short term

Then we save the data to storages. We have several types of storages and each one has its pros and cons.

Long term

We need it to hold the whole data and analyze it by batch processing. In most of cases it will be hadoop based HDFS storage and we use Map-Reduce / Hive / Pig jobs run and create some reports.

As you can understand - its heavy and slow process.

Short term

If we need our data to be easuly and fast accessible we will use some high scalable database.We have several types here:

Key - Value Databases - http://en.wikipedia.org/wiki/NoSQL#Document_store

Examples : Redis, Riak, Dynamo, Gemfire

Columnar databases - http://en.wikipedia.org/wiki/Column-oriented_DBMS

Examples: Vertica,MonetDB

Document databases - http://en.wikipedia.org/wiki/Document-oriented_database

Examples : MongoDB, Cassandra,CouchDB

Graph databases -http://en.wikipedia.org/wiki/Graph_database

Examples : Neo4J

Data Model	Performance	Scalability	Flexibility	Complexity	Functionality
Key–value Stores	high	high	high	none	variable (none)
Column Store	high	high	moderate	low	minimal
Document Store	high	variable (high)	high	low	variable (low)
Graph Database	variable	variable	high	high	graph theory
Relational Database	variable	variable	low	moderate	relational algebra.

The data is much faster acessible and much more structurized.

Real time processing

This component in most of cases will be Storm (http://storm-project.net/).It will pull the data ( in our case we use Kafka( http://kafka.apache.org/) and process it based on Short term and fast access data.

Probably it's decision should be sent to some external systems to notify end user.

End User

Will use some stack for visualizing the data.
It also can contain a service for querying data.In most of cases it will be against short term storages.

Next part is comming...

Wednesday, September 11, 2013

Add auto generated field to Solr 4

11. Define new type :

fieldType name="uuid" class="solr.UUIDField" indexed="true" />

22. Add new field

(parameter – default-“NEW” does the trick!)

33. <updateRequestProcessorChain name="uuid">

</processor>

</updateRequestProcessorChain>

44. To the relevant handler add the chain

Example: for /update/extract

<requestHandler name="/update/extract"

startup="lazy"

class="solr.extraction.ExtractingRequestHandler" >

<str name="uprefix">ignored_</str>

<str name="fmap.a">links</str>

<str name="fmap.div">ignored_</str>

</lst>

</requestHandler>

Now u can executer / update/extract without passing filed “rami” and it will be automatically generated.

Thursday, September 5, 2013

Big Data analytics - tools

All the traditional players such as SAS, IBM SPSS, KXEN, Matlab, Statsoft, Tableau, Pentaho, and others are working toward Hadoop-based Big Data analytics. However, each of these software players has to balance their current technology and customer portfolio along with the incredulous pace of innovation occurring in the open-source community. Most of the tools have connectors that are high-speed connectors to move data back and forth between Hadoop and their tool/environment. With Big Data, the objective is to keep the data in place and bring the analytics processing to the data to avoid the bottleneck and constraints associated with data movement. Over time, each vendor will develop a strategy and approach to keep data in place and move their analytics processing to the data.

In the meantime, there are new commercial vendors and open-source projects evolving to address the voracious appetite for Big Data analytics. Karmasphere (https://karmasphere.com/) is a native Hadoop-based tool for data exploration and visualization. Datameer (http://www.datameer.com/) is a spreadsheet-like presentation tool. Alpine Data Miner (http://www.alpinedatalabs.com/) has a cross-platform analytic workbench.

R (http://cran.r-project.org/) is by far the most dominant analytics tool in the Big Data space. R is an open-source statistical language with constructs that make it easy for data scientists to explore and build models. R is also renowned for the plethora of available analytics. There are libraries focused on industry problems (i.e., clinical trials, genetics, finance, and others) as well as general purpose libraries (i.e., econometrics, natural language processing, optimization, time series, and many more). At this point, there are supposedly over two million R users around the globe and a commercial distribution is available via Revolution Analytics.

Open-source technologies include:

Apache Mahout, a scalable, Hadoop machine learning library, http://mahout.apache.org
Apache Lucene, a high-performance text search library, http://lucene.apache.org/core
Sofia ML, a fast machine learning library, http://code.google.com/p/sofia-ml
Vowpal Wabbit, a Yahoo! Research project for fast, parallel-learning algorithms, http://hunch.net/∼vw
Libocas, a library of support vector machine solvers, http://cmp.felk.cvut.cz/∼xfrancv/ocas/html
Apache Hamster, an MPI for Hadoop, https://issues.apache.org/jira/browse/MAPREDUCE-2911
Julia, a high-performance, parallel distribution analytics language for analytics computing, http://julialang.org/

Reference:

Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses