Big data components
In flow
This is actually the data that gets into the system. It can be files , any kind of events or web pages.. We dont care .
Distributor
When we recieve our in flow we need to distribute it. The distribution can be based on replication of the data to several destination or distributed according to some data details.
Example: If log record contains word : event - send it to HDFS only.
Examples: Apache flume,Logstash ,Fluentd
Storages - Long term, short term
Then we save the data to storages. We have several types of storages and each one has its pros and cons.
Long term
We need it to hold the whole data and analyze it by batch processing. In most of cases it will be hadoop based HDFS storage and we use Map-Reduce / Hive / Pig jobs run and create some reports.
As you can understand - its heavy and slow process.
Short term
If we need our data to be easuly and fast accessible we will use some high scalable database.We have several types here:
Key - Value Databases - http://en.wikipedia.org/wiki/NoSQL#Document_store
Examples : Redis, Riak, Dynamo, Gemfire
Columnar databases - http://en.wikipedia.org/wiki/Column-oriented_DBMS
Examples: Vertica,MonetDB
Document databases - http://en.wikipedia.org/wiki/Document-oriented_database
Examples : MongoDB, Cassandra,CouchDB
Graph databases -http://en.wikipedia.org/wiki/Graph_database
Examples : Neo4J
Data Model | Performance | Scalability | Flexibility | Complexity | Functionality |
---|---|---|---|---|---|
Key–value Stores | high | high | high | none | variable (none) |
Column Store | high | high | moderate | low | minimal |
Document Store | high | variable (high) | high | low | variable (low) |
Graph Database | variable | variable | high | high | graph theory |
Relational Database | variable | variable | low | moderate | relational algebra. |
The data is much faster acessible and much more structurized.
Real time processing
This component in most of cases will be Storm (http://storm-project.net/).It will pull the data ( in our case we use Kafka( http://kafka.apache.org/) and process it based on Short term and fast access data.
Probably it's decision should be sent to some external systems to notify end user.
End User
Will use some stack for visualizing the data.It also can contain a service for querying data.In most of cases it will be against short term storages.
Next part is comming...