地球信息科学学报 ›› 2019, Vol. 21 ›› Issue (1): 128-136.doi: 10.12082/dqxxkx.2019.180221

• 地理大数据时空模式挖掘的方法与应用研究 • 上一篇    

基于C-SOM和Spark的并行空间离群挖掘方法及应用

潘淼鑫1,2,3(), 林甲祥4, 陈崇成1,*(), 叶晓燕1   

  1. 1. 福州大学福建省空间信息工程研究中心空间数据挖掘与信息共享教育部重点实验室,福州 350108
    2. 福建师范大学数学与信息学院,福州 350117
    3. 福建省公共服务大数据挖掘与应用工程技术研究中心,福州 350117
    4. 福建农林大学计算机与信息学院,福州 350002
  • 收稿日期:2018-05-03 修回日期:2018-07-03 出版日期:2019-01-20 发布日期:2019-01-20
  • 通讯作者: 陈崇成 E-mail:pan_miaoxin@qq.com;chencc@fzu.edu.cn
  • 作者简介:

    作者简介:潘淼鑫(1987-),女,博士生,主要从事大数据挖掘与云计算研究。E-mail:pan_miaoxin@qq.com

  • 基金资助:
    福建省重点科技计划项目(2015H0015);福建省教育厅基金(JAT160125);福建省社科青年项目(FJ2017C084)

Parallel Spatial Outliers Mining based on C-SOM and Spark

Miaoxin PAN1,2,3(), Jiaxiang LIN4, Chongcheng CHEN1,*(), Xiaoyan YE1   

  1. 1. Key Lab of Spatial Data Mining and Information Sharing of Ministry of Education, Spatial Information Research Center of Fujian, Fuzhou University, Fuzhou 350108, China
    2. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China
    3. Fujian Provincial Engineering Technology Research Center for Public Service Big Data Mining and Application, Fuzhou 350117, China
    4. College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
  • Received:2018-05-03 Revised:2018-07-03 Online:2019-01-20 Published:2019-01-20
  • Contact: Chongcheng CHEN E-mail:pan_miaoxin@qq.com;chencc@fzu.edu.cn
  • Supported by:
    Key Science and Technology Plan Projects of Fujian Province, No.2015H0015;Fujian Provincial;Education Department Foundation, No.JAT160125;Social Science Youth Projects of Fujian Province, No.FJ2017C084

摘要:

空间离群挖掘可以发现空间数据集中非空间属性值与邻域中其他空间对象明显不同的空间对象。随着空间数据量的快速增加,传统集中式处理模式面临单机性能瓶颈、难以扩展等问题,已逐渐不能满足应用需要。因此,本文根据Spark并行计算框架,充分利用Spark快速内存计算和扩展性的优势,提出了一种基于考虑约束条件的空间离群挖掘算法(C-SOM)和Spark的并行空间离群挖掘算法和原型系统。该并行算法以C-SOM为核心,并行地在多个计算节点对全局数据集和各局部数据集执行C-SOM算法,得到全局离群和局部离群。轻量级的原型系统基于Spark实现了该并行算法,采用Browser/Server架构,提供给用户可视化的操作界面,简洁实用。最后,通过福建省东南沿海土壤化学元素调查数据和人工合成数据的离群分析,验证了该并行算法和原型系统的合理性、有效性和高效性。

关键词: C-SOM, Spark, 并行计算, 空间离群, 数据挖掘

Abstract:

Spatial outlier mining can find the spatial objects whose non-spatial attribute values are significantly different from the values of their neighborhood. Faced with the explosion of spatial data and problems such as single machine performance bottleneck and difficult expansion, the traditional centralized processing mode has gradually failed to meet the needs of applications. In this paper, we propose a parallel spatial outlier mining algorithm and its prototype system which are based on Constrained Spatial Outlier Mining (C-SOM) and make full use of the advantages of a parallel computing framework Spark's fast memory computing and scalability. The parallel algorithm uses C-SOM algorithm as the core algorithm, executes the C-SOM algorithm on a Spark cluster composed of multiple nodes for a global dataset and many local datasets concurrently to get the global outliers and the local outliers. Datasets are divided into multiple regional datasets according to the administrative division. A region dataset is considered as a local dataset and the global dataset contains all of the selected local datasets to be mined. The lightweight prototype system implements the parallel algorithm based on Spark and adopts Browser/Server architecture to provide users with a visualized operation interface which is concise and practical. Users can select the region datasets and set the parameters of C-SOM algorithm on interfaces. The prototype system will execute the parallel algorithm on a Spark cluster and finally list both the global and local outliers which have the top largest outlier factor values so that users can make further analysis. At last, we use the soil geochemical investigation data from Fujian eastern coastal zone area in China and a series of artificial datasets to carry out experiments. The results of the soil geochemical datasets experiments validate the rationality and effectiveness of the parallel algorithm and its prototype system. The results of the artificial datasets experiments show that, compared to single machine implementation, our parallel system can support analysis for much more datasets and its efficiency is much higher when the number of datasets is big enough. This study confirms the local instability characteristics of spatial outliers and demonstrates the rationality, and effectiveness of the parallel algorithm and its prototype system to detect global and local spatial outliers simultaneously.

Key words: C-SOM, Spark, parallel computing, spatial outlier, data mining