Publish Date : 4/10/2017   Journal Name : Springer/The Journal of Supercomputing   Pages : 17
Atrak: A MapReduce based warehouse for big data

Abstract

As warehouse data volumes expand, single node solutions can no longer analyse the immense
volume of data. Therefore, it is necessary to use shared nothing architectures such as
MapReduce. Inter-node data segmentation in MapReduce creates node connectivity issues,
network congestion, improper use of node memory capacity, and inefficient processing
power. In addition, it is not possible to change dimensions and measures without changing
previously stored data and big dimension management.
In this paper, a method called Atrak is proposed, which uses a unified data format to make
Mapper nodes independent to solve the data management problem mentioned earlier. The
proposed method can be applied to star schema data warehouse models with distributive
measures. Atrak increases query execution speed by employing node independence and the
proper use of MapReduce. The proposed method was compared to established methods such
as Hive, Spark-SQL, HadoopDB and Flink. Simulation results confirm improved query
execution speed of the proposed method. Using data unification in MapReduce can be used in
other fields, such as data mining and graph processing.


Authors : Mohammad Hossein Barkhordari, Mahdi Niamanesh