Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Tuesday, September 24, 2013

Tipical NoSQL Big data solution ( part 1)

Big data components

In flow

This is actually the data that gets into the system. It can be files , any kind of events or web pages.. We dont care .

Distributor

When we recieve our in flow we need to distribute it. The distribution can be based on replication of the data to several destination or distributed according to some data details.
Example: If log record contains word : event - send it to HDFS only.
Examples: Apache flume,Logstash ,Fluentd


Storages - Long term, short term 

Then we save the data to storages. We have several types of storages and each one has its pros and cons.


Long term 

We need it to hold the whole data and analyze it by batch processing. In most of cases it will be hadoop based HDFS storage and we use Map-Reduce / Hive / Pig jobs run and create some reports.
As you can understand - its heavy and slow process.


Short term 

If we need our data to be easuly and fast accessible we will use some high scalable database.We have several types here:
Examples : Redis, Riak, Dynamo, Gemfire
Examples: Vertica,MonetDB

Examples : MongoDB, Cassandra,CouchDB
Examples : Neo4J

Data ModelPerformanceScalabilityFlexibilityComplexityFunctionality
Key–value Storeshighhighhighnonevariable (none)
Column Storehighhighmoderatelowminimal
Document Storehighvariable (high)highlowvariable (low)
Graph Databasevariablevariablehighhighgraph theory
Relational Databasevariablevariablelowmoderaterelational algebra.
The data is much faster acessible and much more structurized.



Real time processing

This component in most of cases will be Storm (http://storm-project.net/).It will pull the data ( in our case we use Kafka( http://kafka.apache.org/) and process it based on Short term and fast access data.
Probably it's decision should be sent to some external systems  to notify end user.


End User 

Will use some  stack for visualizing the data.
It also can contain a service for  querying data.In most of cases it will be against short term storages.

Next part is comming...

Sunday, December 23, 2012

Data Access optimization

Data Access optimization

Typical latency and bandwidth numbers for data transfer to and from different
devices in computer systems.
This sketch shows an overview of several data paths present in modern parallel
computer systems, and typical ranges for their bandwidths and latencies. The functional
units, which actually perform the computational work, sit at the top of this
hierarchy. In terms of bandwidth, the slowest data paths are three to four orders of
magnitude away, and eight in terms of latency. The deeper a data transfer must reach
down through the different levels in order to obtain required operands for some calculation,
the harder the impact on performance. Any optimization attempt should
therefore first aim at reducing traffic over slow data paths, or, should this turn out to
be infeasible, at least make data transfer as efficient as possible.

Optimization tips

Access memory in increasing addresses order.

 In particular:

  • scan arrays in increasing order;
  • scan multidimensional arrays using the rightmost index for innermost loops;
  • in class constructors and in assignment operators (operator=), access member variables in the order of declaration.
Data caches optimize memory access in increasing sequential order.
When a multidimensional array is scanned, the innermost loop should iterate on the last index, the innermost-but-one loop should iterate on the last-but-one index, and so on. In such a way, it is guaranteed that array cells are processed in the same order in which they are arranged in memory

Memory alignment

Keep the compiler default memory alignment.
Compilers use by default an alignment criterion for fundamental types, for which objects may have only memory addresses that are a multiple of particular factors. Such criterion guarantees top performance, but it may add paddings (or holes) between successive objects.
If it is necessary to avoid such paddings for some structures, use the pragma directive only around such structure definitions.

Grouping functions in compilation units

Define in the same compilation unit all the member functions of a class, all the friend functions of such class, and all the member functions of friend classes of such class, except when the resulting file become unwieldy for its size.
In such a way, both the machine code resulting by the compilation of such functions and the static data defined in such classes and functions will have addresses near each other; in addition, even compilers that do not perform whole program optimization may optimize the calls among these functions.

Grouping variables in compilation units

Define every global variable in the compilation unit in which it is used more often.
In such a way, such variables will have addresses near to each other and to the static variables defined in such compilation units; in addition, even compilers that do not perform whole program optimization may optimize the access to such variables from the functions that use them more often.

Private functions and variables in compilation units

Declare in an anonymous namespace the variables and functions that are global to compilation unit, but not used by other compilation units.
In C language and also in C++, such variables and functions may be declared static. Though, in modern C++, the use of static global variables and functions is not recommended, and should be replaced by variables and functions declared in an anonymous namespace.
In both cases, the compiler is notified that such identifiers will never be used by other compilation units. This allows the compilers that do not perform whole program optimization to optimize the usage of such variables and functions




Source
Introduction to High Performance Computing for Scientists and Engineers,
Georg Hager, Gerhard Wellein