Abstract
Outliers are regarded as noisy data in statistics, has turned out to be an important problem which is being researched in diverse fields of research and application domains. Many outlier detection techniques have been developed specific to certain application domains, while some techniques are more generic. Outlier detection aims to find patterns in data that do not conform to expected behaviour. It has extensive use in a wide variety of applications such as military surveillance for enemy activities, intrusion detection in cyber security, fraud detection for credit cards, insurance or health care and fault detection in safety critical systems. In our work, we investigate that there is need to develop an outlier detection solution for large amount of sensed data facts to optimize the processing of data mining. Sensed data is the output of sensor nodes consisting the real values after sensing. Existing solutions provide outlier detection only for static datasets and using clustering algorithms for normal data size. In our work, we have developed an outlier detection system which performs outlier detection of Intel sensed dataset using clustering algorithms DBScan and K-Means. Experimental study has been performed using java application and hadoop system.
Key-Words / Index Term
Outlier detection, Clustering, Hadoop
References
[1] M. Ester, H.P. Kriegel, J. Sander, X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise",KDD-96 Proceedings, , German, pp.226-231, 1996.
[2] K. Narita, H. Kitagawa, “Outlier Detection for Transaction Databases Using Association Rules”, In Proceedings of the Ninth International Conference on Web-Age Information Management,Washington, pp. 373-380, 2008.
[3] J. Wang, X. Su, "An improved K-Means clustering algorithm," 2011 IEEE 3rd International Conference on Communication Software and Networks, China, pp. 44-46, 2011.
[4] M. Bhandarkar, "MapReduce programming with apache Hadoop", IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta-GA, pp.1-1, 2010.
[5] M. Ding, L. Zheng, Y. Lu, L. Li, S. Guo, and M. Guo, “More convenient more overhead: the performance evaluation of Hadoop streaming”, In Proceedings of the ACM Symposium on Research in Applied Computation (RACS), USA, pp. 307-313, 2011.
[6] W. Zhao, H. Ma and Q. He, “Parallel K-Means Clustering Based on MapReduce”, Cloud Computing: First International Conference, CloudCom 2009, Beijing, China, Springer Berlin Heidelberg, pp. 674-679, 2009.
[7] Feng Wang, Jie Qiu, Jie Yang, Bo Dong, Xinhui Li, Ying Li, “Hadoop high availability through metadata replication”. In Proceedings of the first international workshop on Cloud data management (CloudDB `09). ACM- USA, pp. 37-44, 2009.
[8] R. Leonardo, F, Cordeiro, "Clustering very large multi-dimensional datasets with MapReduce", ACM SIGKDD international conference on Knowledge discovery and data mining, USA, pp.690-698, 2011.Y. X. Fu, W. Z. Zhao, H. F. Ma, "Research on Parallel DBSCAN Algorithm Design Based on MapReduce", Advanced Materials Research, Vols.301, Issue.303, pp. 1133-1138, 2011.
[9] W. Zhao, H.Ma, Q. He, “Parallel K-Means Clustering Based on MapReduce”, In Proceedings of the 1st International Conference on Cloud Computing , Springer-Verlag, Berlin, pp. 674-679, 2009.
[10] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Sixth symposium on Operating Systems design and implementation (OSDI), San Francisco, CA, pp. 213-220, 2004.
[11] M.F. Hornick, E. Marcadé, S. Venkayala, "Java Data Mining: Strategy, Standard, and Practice: A Practical Guide for Architecture, Design, and Implementation", Morgan Kaufmann, canada,pp.1-544, 2010.