An Enhancement of Big Data Classification with Minimum Consistent Subset and Virtual Machine Mapping

Gayathri Devi S

Abstract


Big data classification is performed a crucial function in the fields of science and engineering including medical, education and business. One of the complicated issues in the big data classification is the design and development of highly parallelized learning algorithms. Many algorithms were parallelized and the Decision tree model is one among them was the. Information entropy and ambiguity are used as the uncertainty measures for splitting the decision tree (DT) nodes. In order to overcome the over-partitioning problem in DT induction, ELMs are embedded as leaf nodes when the gain ratios of all the available splits are lesser than a given threshold.  To improve the parallel computation of ELM Tree model, optimization algorithms were used to determine optimal cut points for attributes to divide the dataset. But finding of optimal cut points for all attributes is unnecessary overhead. In this paper, a minimum consistent subset (MCS) is introduced to select optimal subset and optimal cut (or cut points) for subsets to avoid unnecessary computation and improper resource utilization. MCS makes use of hyper surface model, which is directly used to classify large database.  Such a hyper surface representation which is rectangular in shape is utilized to hold the samples and feature subsets, defined by the upper and lower bound. In order to specify the boundary, the optimal region is determined by optimization algorithms like Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Firefly algorithm (FA). MCS is implemented using the MapReduce framework, which is one of the current and powerful parallel programming models. The exact number of mappers and reducers is  also determined in this proposed model by using Neural Network (NN) based Virtual Machine Mapping approach for ELM tree classification in Map Reduce framework. The performance of proposed classifier is evaluated on three datasets and proved that performance metrics such as accuracy, precision and computation time of MCS primarily based ELM Tree and Virtual machine Mapping is a ways better than ELM tree model without MCS and Virtual Machine Mapping.

Keywords


ELM-Tree model, Minimum Consistent Subset, Hyper Surface, Virtual Machine Mapping, Dynamic Data Partition, MapReduce, Genetic Algorithm, Particle Swarm Optimization, Firefly Algorithm.

Full Text: PDF