Journal of Geo-information Science ›› 2019, Vol. 21 ›› Issue (1): 128-136.doi: 10.12082/dqxxkx.2019.180221

Previous Articles    

Parallel Spatial Outliers Mining based on C-SOM and Spark

Miaoxin PAN1,2,3(), Jiaxiang LIN4, Chongcheng CHEN1,*(), Xiaoyan YE1   

  1. 1. Key Lab of Spatial Data Mining and Information Sharing of Ministry of Education, Spatial Information Research Center of Fujian, Fuzhou University, Fuzhou 350108, China
    2. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China
    3. Fujian Provincial Engineering Technology Research Center for Public Service Big Data Mining and Application, Fuzhou 350117, China
    4. College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
  • Received:2018-05-03 Revised:2018-07-03 Online:2019-01-20 Published:2019-01-20
  • Contact: Chongcheng CHEN E-mail:pan_miaoxin@qq.com;chencc@fzu.edu.cn
  • Supported by:
    Key Science and Technology Plan Projects of Fujian Province, No.2015H0015;Fujian Provincial;Education Department Foundation, No.JAT160125;Social Science Youth Projects of Fujian Province, No.FJ2017C084

Abstract:

Spatial outlier mining can find the spatial objects whose non-spatial attribute values are significantly different from the values of their neighborhood. Faced with the explosion of spatial data and problems such as single machine performance bottleneck and difficult expansion, the traditional centralized processing mode has gradually failed to meet the needs of applications. In this paper, we propose a parallel spatial outlier mining algorithm and its prototype system which are based on Constrained Spatial Outlier Mining (C-SOM) and make full use of the advantages of a parallel computing framework Spark's fast memory computing and scalability. The parallel algorithm uses C-SOM algorithm as the core algorithm, executes the C-SOM algorithm on a Spark cluster composed of multiple nodes for a global dataset and many local datasets concurrently to get the global outliers and the local outliers. Datasets are divided into multiple regional datasets according to the administrative division. A region dataset is considered as a local dataset and the global dataset contains all of the selected local datasets to be mined. The lightweight prototype system implements the parallel algorithm based on Spark and adopts Browser/Server architecture to provide users with a visualized operation interface which is concise and practical. Users can select the region datasets and set the parameters of C-SOM algorithm on interfaces. The prototype system will execute the parallel algorithm on a Spark cluster and finally list both the global and local outliers which have the top largest outlier factor values so that users can make further analysis. At last, we use the soil geochemical investigation data from Fujian eastern coastal zone area in China and a series of artificial datasets to carry out experiments. The results of the soil geochemical datasets experiments validate the rationality and effectiveness of the parallel algorithm and its prototype system. The results of the artificial datasets experiments show that, compared to single machine implementation, our parallel system can support analysis for much more datasets and its efficiency is much higher when the number of datasets is big enough. This study confirms the local instability characteristics of spatial outliers and demonstrates the rationality, and effectiveness of the parallel algorithm and its prototype system to detect global and local spatial outliers simultaneously.

Key words: C-SOM, Spark, parallel computing, spatial outlier, data mining