Rami On The Web: data

Big data components

In flow

This is actually the data that gets into the system. It can be files , any kind of events or web pages.. We dont care .

Distributor

When we recieve our in flow we need to distribute it. The distribution can be based on replication of the data to several destination or distributed according to some data details.

Example: If log record contains word : event - send it to HDFS only.

Examples: Apache flume,Logstash ,Fluentd

Storages - Long term, short term

Then we save the data to storages. We have several types of storages and each one has its pros and cons.

Long term

We need it to hold the whole data and analyze it by batch processing. In most of cases it will be hadoop based HDFS storage and we use Map-Reduce / Hive / Pig jobs run and create some reports.

As you can understand - its heavy and slow process.

Short term

If we need our data to be easuly and fast accessible we will use some high scalable database.We have several types here:

Key - Value Databases - http://en.wikipedia.org/wiki/NoSQL#Document_store

Examples : Redis, Riak, Dynamo, Gemfire

Columnar databases - http://en.wikipedia.org/wiki/Column-oriented_DBMS

Examples: Vertica,MonetDB

Document databases - http://en.wikipedia.org/wiki/Document-oriented_database

Examples : MongoDB, Cassandra,CouchDB

Graph databases -http://en.wikipedia.org/wiki/Graph_database

Examples : Neo4J

Data Model	Performance	Scalability	Flexibility	Complexity	Functionality
Key–value Stores	high	high	high	none	variable (none)
Column Store	high	high	moderate	low	minimal
Document Store	high	variable (high)	high	low	variable (low)
Graph Database	variable	variable	high	high	graph theory
Relational Database	variable	variable	low	moderate	relational algebra.

The data is much faster acessible and much more structurized.

Real time processing

This component in most of cases will be Storm (http://storm-project.net/).It will pull the data ( in our case we use Kafka( http://kafka.apache.org/) and process it based on Short term and fast access data.

Probably it's decision should be sent to some external systems to notify end user.

End User

Will use some stack for visualizing the data.
It also can contain a service for querying data.In most of cases it will be against short term storages.

Next part is comming...

Data Access optimization

Typical latency and bandwidth numbers for data transfer to and from different

devices in computer systems.

This sketch shows an overview of several data paths present in modern parallel

computer systems, and typical ranges for their bandwidths and latencies. The functional

units, which actually perform the computational work, sit at the top of this

hierarchy. In terms of bandwidth, the slowest data paths are three to four orders of

magnitude away, and eight in terms of latency. The deeper a data transfer must reach

down through the different levels in order to obtain required operands for some calculation,

the harder the impact on performance. Any optimization attempt should

therefore first aim at reducing traffic over slow data paths, or, should this turn out to

be infeasible, at least make data transfer as efficient as possible.

Optimization tips

Access memory in increasing addresses order.

In particular:

scan arrays in increasing order;
scan multidimensional arrays using the rightmost index for innermost loops;
in class constructors and in assignment operators (operator=), access member variables in the order of declaration.

Data caches optimize memory access in increasing sequential order.
When a multidimensional array is scanned, the innermost loop should iterate on the last index, the innermost-but-one loop should iterate on the last-but-one index, and so on. In such a way, it is guaranteed that array cells are processed in the same order in which they are arranged in memory

Memory alignment

Keep the compiler default memory alignment.
Compilers use by default an alignment criterion for fundamental types, for which objects may have only memory addresses that are a multiple of particular factors. Such criterion guarantees top performance, but it may add paddings (or holes) between successive objects.
If it is necessary to avoid such paddings for some structures, use the pragma directive only around such structure definitions.

Grouping functions in compilation units

Define in the same compilation unit all the member functions of a class, all the friend functions of such class, and all the member functions of friend classes of such class, except when the resulting file become unwieldy for its size.
In such a way, both the machine code resulting by the compilation of such functions and the static data defined in such classes and functions will have addresses near each other; in addition, even compilers that do not perform whole program optimization may optimize the calls among these functions.

Grouping variables in compilation units

Define every global variable in the compilation unit in which it is used more often.
In such a way, such variables will have addresses near to each other and to the static variables defined in such compilation units; in addition, even compilers that do not perform whole program optimization may optimize the access to such variables from the functions that use them more often.

Private functions and variables in compilation units

Declare in an anonymous namespace the variables and functions that are global to compilation unit, but not used by other compilation units.
In C language and also in C++, such variables and functions may be declared static. Though, in modern C++, the use of static global variables and functions is not recommended, and should be replaced by variables and functions declared in an anonymous namespace.
In both cases, the compiler is notified that such identifiers will never be used by other compilation units. This allows the compilers that do not perform whole program optimization to optimize the usage of such variables and functions

Source

Introduction to High Performance Computing for Scientists and Engineers,

Georg Hager, Gerhard Wellein

https://docs.google.com/open?id=0B83rvqbRt-ksNVlqSlMtMnFfZVE
Memory Optiomization

Rami On The Web

Tuesday, September 24, 2013

Tipical NoSQL Big data solution ( part 1)

Big data components

In flow

Distributor

Storages - Long term, short term

Long term

Short term

Real time processing

End User

Sunday, December 23, 2012

Data Access optimization

Data Access optimization

Optimization tips

Access memory in increasing addresses order.

In particular:

Memory alignment

Grouping functions in compilation units

Grouping variables in compilation units

Private functions and variables in compilation units