Traditional data analytics tools are designed to deal with the asymmetrical type of data i.e., structured, semi-structured, and unstructured. The diverse behavior of data produced by different sources requires the selection of suitable tools. The restriction of recourses to deal with a huge volume of data is a challenge for these tools, which affects the performances of the tool's execution time. Therefore, in the present paper, we proposed a time optimization model, shares common HDFS (Hadoop Distributed File System) between three Name-node (Master Node), three Data-node, and one Client-node. These nodes work under the DeMilitarized zone (DMZ) to maintain symmetry. Machine learning jobs are explored from an independent platform to realize this model. In the first node (Name-node 1), Mahout is installed with all machine learning libraries through the maven repositories. The second node (Name-node 2), R connected to Hadoop, is running through the shiny-server. Splunk is configured in the third node (Name-node 3) and is used to analyze the logs. Experiments are performed between the proposed and legacy model to evaluate the response time, execution time, and throughput. K-means clustering, Navies Bayes, and recommender algorithms are run on three different data sets, i.e., movie rating, newsgroup, and Spam SMS data set, representing structured, semi-structured, and unstructured data, respectively. The selection of tools defines data independence, e.g., Newsgroup data set to run on Mahout as others cannot be compatible with this data. It is evident from the outcome of the data that the performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model. In addition, the proposed model can process any kind of algorithm on different sets of data, which resides in its native formats.

Performance evaluation of an independent time optimized infrastructure for big data analytics that maintains symmetry / Vats, S.; Sagar, B. B.; Singh, K.; Ahmadian, A.; Pansera, B. A.. - In: SYMMETRY. - ISSN 2073-8994. - 12:8(2020), p. 1274. [10.3390/SYM12081274]

Performance evaluation of an independent time optimized infrastructure for big data analytics that maintains symmetry

Pansera B. A.
2020-01-01

Abstract

Traditional data analytics tools are designed to deal with the asymmetrical type of data i.e., structured, semi-structured, and unstructured. The diverse behavior of data produced by different sources requires the selection of suitable tools. The restriction of recourses to deal with a huge volume of data is a challenge for these tools, which affects the performances of the tool's execution time. Therefore, in the present paper, we proposed a time optimization model, shares common HDFS (Hadoop Distributed File System) between three Name-node (Master Node), three Data-node, and one Client-node. These nodes work under the DeMilitarized zone (DMZ) to maintain symmetry. Machine learning jobs are explored from an independent platform to realize this model. In the first node (Name-node 1), Mahout is installed with all machine learning libraries through the maven repositories. The second node (Name-node 2), R connected to Hadoop, is running through the shiny-server. Splunk is configured in the third node (Name-node 3) and is used to analyze the logs. Experiments are performed between the proposed and legacy model to evaluate the response time, execution time, and throughput. K-means clustering, Navies Bayes, and recommender algorithms are run on three different data sets, i.e., movie rating, newsgroup, and Spam SMS data set, representing structured, semi-structured, and unstructured data, respectively. The selection of tools defines data independence, e.g., Newsgroup data set to run on Mahout as others cannot be compatible with this data. It is evident from the outcome of the data that the performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model. In addition, the proposed model can process any kind of algorithm on different sets of data, which resides in its native formats.
2020
Big data
Hadoop
Machine Learning
MapReduce
Response time
Symmetrical streaming
File in questo prodotto:
File Dimensione Formato  
Articolo_n_37_symmetry-868807-7.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.38 MB
Formato Adobe PDF
1.38 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12318/81334
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 43
  • ???jsp.display-item.citation.isi??? ND
social impact