Parallel Spatial Outliers Mining based on C-SOM and Spark

  • PAN Miaoxin , 1, 2, 3 ,
  • LIN Jiaxiang 4 ,
  • CHEN Chongcheng , 1, * ,
  • YE Xiaoyan 1
Expand
  • 1. Key Lab of Spatial Data Mining and Information Sharing of Ministry of Education, Spatial Information Research Center of Fujian, Fuzhou University, Fuzhou 350108, China
  • 2. College of Mathematics and Informatics, Fujian Normal University, Fuzhou 350117, China
  • 3. Fujian Provincial Engineering Technology Research Center for Public Service Big Data Mining and Application, Fuzhou 350117, China
  • 4. College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
*Corresponding author: CHEN Chongcheng, E-mail:

Received date: 2018-05-03

  Request revised date: 2018-07-03

  Online published: 2019-01-20

Supported by

Key Science and Technology Plan Projects of Fujian Province, No.2015H0015

Fujian Provincial

Education Department Foundation, No.JAT160125

Social Science Youth Projects of Fujian Province, No.FJ2017C084

Copyright

《地球信息科学学报》编辑部 所有

Abstract

Spatial outlier mining can find the spatial objects whose non-spatial attribute values are significantly different from the values of their neighborhood. Faced with the explosion of spatial data and problems such as single machine performance bottleneck and difficult expansion, the traditional centralized processing mode has gradually failed to meet the needs of applications. In this paper, we propose a parallel spatial outlier mining algorithm and its prototype system which are based on Constrained Spatial Outlier Mining (C-SOM) and make full use of the advantages of a parallel computing framework Spark's fast memory computing and scalability. The parallel algorithm uses C-SOM algorithm as the core algorithm, executes the C-SOM algorithm on a Spark cluster composed of multiple nodes for a global dataset and many local datasets concurrently to get the global outliers and the local outliers. Datasets are divided into multiple regional datasets according to the administrative division. A region dataset is considered as a local dataset and the global dataset contains all of the selected local datasets to be mined. The lightweight prototype system implements the parallel algorithm based on Spark and adopts Browser/Server architecture to provide users with a visualized operation interface which is concise and practical. Users can select the region datasets and set the parameters of C-SOM algorithm on interfaces. The prototype system will execute the parallel algorithm on a Spark cluster and finally list both the global and local outliers which have the top largest outlier factor values so that users can make further analysis. At last, we use the soil geochemical investigation data from Fujian eastern coastal zone area in China and a series of artificial datasets to carry out experiments. The results of the soil geochemical datasets experiments validate the rationality and effectiveness of the parallel algorithm and its prototype system. The results of the artificial datasets experiments show that, compared to single machine implementation, our parallel system can support analysis for much more datasets and its efficiency is much higher when the number of datasets is big enough. This study confirms the local instability characteristics of spatial outliers and demonstrates the rationality, and effectiveness of the parallel algorithm and its prototype system to detect global and local spatial outliers simultaneously.

Cite this article

PAN Miaoxin , LIN Jiaxiang , CHEN Chongcheng , YE Xiaoyan . Parallel Spatial Outliers Mining based on C-SOM and Spark[J]. Journal of Geo-information Science, 2019 , 21(1) : 128 -136 . DOI: 10.12082/dqxxkx.2019.180221

1 引言

空间离群是指那些非空间属性值与其空间邻域中其他对象的非空间属性值显著不同的空间对象,尽管它们相对整个样本来说可能并没有显著不同[1,2]。例如,基于非空间属性房龄,在一个正在发展的都市区中,一座被老房子包围的新房子是一个空间离群。空间离群挖掘可以发现数据集中预料外的、隐含的知识,被广泛应用于许多地理信息系统和空间数据库中,应用领域包括生态环境、交通管理、公共安全、公共健康、气候、基于位置的服务等[3]
现有的空间离群挖掘算法大致可以分为2类,基于图的方法和定量测试方法[4]。基于图的方法在空间数据可视化的基础上,将空间离群突出显示出来,比较有代表性的算法有variogram clouds[5]、pocket plots[6]、Moran scatterplot[1,4,7]以及Shekhar等提出的针对图结构数据集的空间离群检测方法[8,9]。定量测试方法通过精确的测试来识别空间离群,又可以分为面向一维属性的方法和面向高维属性的方法[10]。面向一维属性的典型算法包括Scatterplot[11]、Z值、迭代R值、迭代Z值、迭代比率、中值、加权Z值[12]等。面向高维属性的方法沿用了许多面向一维属性方法的思想,主要算法有基于不同标度变量相异度的方法、基于马氏距离的多属性方法、基于相关系数的多属性方法、基于密度的方法等。由于传统的基于密度的离群挖掘算法已经相对成熟,局部离群检测的效率和准确性也较高,因而被广泛使用于空间离群挖掘实践中。以上这些方法没有考虑实际生活中客观存在的约束条件(如河流、桥梁等)对空间离群挖掘结果的影响,空间数据挖掘过程中若不考虑客观存在的一些约束条件,挖掘结果往往会发生错误或者不合理的情况。文献[10]提出了考虑约束条件的空间离群挖掘算法(C-SOM),构建考虑约束条件的Delaunay三角网来表达空间邻近关系,采用密度思想定义空间对象的离群因子,通过离群因子“序值图”寻找合理离群数目,取得了较好的效果。
传统集中式数据挖掘面临内存限制、处理速度慢、硬盘容量不足等问题,随着人类收集的空间数据量的快速增加,并行和分布式计算已经成为大数据处理、分析过程中不可或缺的关键技术[13,14],如网格计算[15],云计算[16],Hadoop[17],MapReduce[18,19],Spark[20,21]等等,通过将任务分解为可并发执行的多个子问题并在互联的多台节点上同时运行,突破了计算能力、存储能力等限制。学术界和工业界提出了许多分布式和并行数据挖掘算法。在离群挖掘的分布式和并行算法方面,文献[22]提出了一种高效的分布式离群点检测算法,文献[23][24]基于MapReduce实现离群数据的并行挖掘。针对空间离群挖掘的分布式和并行算法尚不多见,一些相关的典型研究工作有:文献[25]基于Hadoop来存储和检测时空数据,设计了分布式算法并发地进行离群挖掘;文献[26]基于Spark实现了一种分布式条件下的空间离群点挖掘算法。但这些算法没有对局部数据集进行离群挖掘,由于空间离群具有局部不稳定特征,局部离群挖掘在许多实际应用中更有意义。文献[3][27]提出了基于网格的并行与分布式空间离群挖掘算法,同步地检测全局离群和局部离群。在众多的并行计算框架中,Spark以其快速、高容错、高可扩展和易用的特点得到了广泛的应用。本文提出了一种基于Spark的并行空间离群挖掘方法,对全局数据集和各局部数据集,采用考虑约束条件的空间离群挖掘算法C-SOM,进行并行离群挖掘。

2 Spark并行计算框架

Spark是加州大学伯克利分校的AMP实验室开发一种开源的基于内存的快速通用可扩展的数据分析引擎,既拥有Hadoop MapReduce所具有的优点,又做到了MapReduce所没能做到的。MapReduce编程繁琐,速度慢。Spark则易于编程,而且快捷灵活,它将中间结果保存在内存中,而不像MapReduce需要频繁读写HDFS。与MapReduce相比,Spark在迭代计算方面,速度比MapReduce快1~2个数量级[28]。这使得Spark可以更好地处理迭代计算较多的机器学习、图形处理等任务。各种大数据公司,如Cloudera、MapR等,都已表示要以Spark取代MapReduce。
RDD(弹性分布式数据集)是Spark的编程基础,整个Spark生态系统中都是基于对RDD的操作完成的。RDD是只读的分布式的数据集合,这个数据集合被划分为多个分区并分散地存储在集群中的多个节点之上。RDD的获得只能通过两种途径:①在内存集合中或外部存储系统,通过程序创建RDD;②通过其他RDD的某种转换操作得到。RDD上的操作可以分为二大类,即Transformation(转换)和Action(动作)。Transformation是指从一种RDD转换为另一种RDD,它是一种延迟操作,需要通过Action来触发。Action操作是将RDD输出的操作,它会触发之前所有的Transformation,并向Spark集群提交作业,同时将数据输出到Spark系统中。

3 C-SOM算法

考虑约束条件的空间离群挖掘(C-SOM)将约束条件嵌入到空间数据挖掘过程中,使得对象之间的相似度受到邻近关系和约束条件的双重限制,因而与传统的空间数据挖掘不同[10]。C-SOM算法的主要过程是:首先,以四方边缘结构QuadEdge为核心数据结构,根据空间属性值构建考虑约束条件的Delaunay三角网来表达空间邻近关系,通过在Delaunay三角网中计算对象的k阶邻近来确定对象的空间邻域;然后,将属性约束嵌入到空间对象之间的属性距离计算之中,采用欧氏距离对空间对象之间的相似性度量进行计算,按照基于密度的思想定义一个衡量空间对象离群率的离群因子OF(Outlier Factor),并通过对象和邻域的比较确定各个空间对象的离群因子值;最后,将离群因子最大的若干个空间对象作为候选离群,并对候选离群进行分析与确认,获得最终的空间离群挖掘结果。C-SOM算法的具体设计和实现见文献[10]
普通Delaunay三角网和考虑河流、桥梁约束的Delaunay三角图如图1(a)和(b)所示。图1(b)中O1O2O2O3是河流表示障碍线段,边PQ为桥梁表示便利线段,边PF,PG,QC,QD,QE为便利路径,其余图形边为普通Delaunay边。约束Delaunay三角图的具体构建过程见文献[29]
Fig. 1 Common Delaunay triangulation and constrained Delaunay graph

图1 普通Delaunay三角网和约束Delaunay图

为使约束Delaunay图的空间邻近关系定义与普通Delaunay三角网的空间邻近关系定义保持一致,对约束Delaunay图的Delaunay距离 d T ' ( A , B ) 定义如式(1)所示。
d ' T ( A , B ) = d T ( A , B ) = 1 ( A , B V S ) 1 2 ( A V S B V Facility A V Facility B V S ) 0 ( A , B V Facility ) (1)
式中:VS表示空间对象顶点集合;VFacility表示空间便利顶点集合。若AB都是VS中的点且之间存在Delaunay图边,则与普通Delaunay三角网一样,AB的距离为1;若AB其中一个点在VS中而另一个点在VFacility中,则AB的距离为0.5;若AB都是VFacility中的点,则AB的距离为0。
对于给定的空间目标 A ' B ' ,若 A ' 沿着Delaunay图边移动到 B ' 所经过的最少Delaunay边的距离(按Delaunay距离 d T ' ( A , B ) 进行计算)之和为k,则称这2个目标之间的Delaunay距离为k,记为 d T ' A ' , B ' = k 。所有与给定目标P的Delaunay距离为k的目标集合,称为目标Pk阶邻近,记为NeighborkP)。以图1中的顶点C为例,其1阶邻近为Neighbor1C)={G, F, E, D, I, B}。这里讨论的目标 A ' B ' VS中的两个对象,对于VFacility中的目标,通常不进行目标的k阶邻近计算,涉及到便利顶点的Delaunay距离计算必须遵循 d T ' ( A , B ) 的定义。
属性约束用于确定参与空间离群挖掘的专题属性及相应的权重。若 A = { A 1 , A 2 , , A m } 为空间对象的mm≥1)个专题属性,各属性的权重为 ω = { ω 1 , ω 2 , , ω m } ,空间对象Pi的邻域为NN(Pi)(邻居的个数为相应地表示为 NN ( P i ) ),则对象相异度和离群因子分别定义如式(2)-式(4)。对象离群因子的值越大,表示与邻域对象的差别越大,就越可能是离群点。
定义1 空间对象PiPj的相异度diffPi,Pj):空间对象PiPj在专题属性值上的差异,值等于空间对象PiPj在专题属性值上的欧氏距离。
diff ( P i , P j ) = k = 1 m ω k [ A k ( P i ) - A k ( P j ) ] 2 (2)
定义2 空间对象Pi与邻域NN(Pi)相异度diffPi):空间对象Pi与其邻域NN(Pi)在专题属性值上的差异,值等于空间对象Pi与其所有邻居对象的相异度平均值。
diff ( P i ) = O NN ( P i ) diff ( P i , O ) NN ( P i ) (3)
定义3 空间对象Pi的离群因子OF(Pi):空间对象Pi的相异度与其邻居对象的相异度的比值平均:
OF ( P i ) = O NN ( P i ) diff ( P i ) diff ( O ) NN ( P i ) (4)

4 基于C-SOM和Spark的并行离群挖掘算法

4.1 算法设计与实现

由于空间数据集的地理分布特征,以及空间离群的局部区域性特征,空间离群往往需要从局部和全局2个角度进行刻画,挖掘各局部数据集中的空间离群点和全局数据集中的空间离群点。因此并行空间离群挖掘中,既需要对局部数据集进行空间离群的挖掘,也需要从全局数据集的角度对局部不稳定特征进行挖掘。
以C-SOM算法为核心的基于Spark的并行空间离群挖掘的基本过程(图2)。①存储系统中保存着以行政区划为划分依据的各地区数据集,如地区1数据集表示福州市空间数据集,地区2数据集表示泉州市空间数据集,将待挖掘地区的空间数据集读入内存,一个地区数据集作为一个局部数据集,同时将这些地区数据集在内存中按地区合并,即一个地区的数据集后跟着另一个地区的数据集,最后生成包含所有局部数据集的全局数据集;②将全局数据集和各局部数据集都转成RDD,以便Spark进行处理,每个地区数据集作为一个分区,全局数据集单独作为一个分区;③Spark将各个分区的离群挖掘任务分配给Worker集群,图2中一个Task表示一个分区的挖掘任务,集群里的各个Worker Node(WN)并发地调用C-SOM算法执行分配到的Task;④各个WN再把每个Task的计算结果(离群点)返回给Spark的集群管理节点(Master)。
Fig. 2 The general process of parallel spatial outlier mining based on Spark

图2 基于Spark的并行空间离群挖掘的基本过程

基于C-SOM和Spark的并行空间离群挖掘算法伪代码如下所示:
算法首先根据用户提供的数据集名称列表names将数据读入内存,得到按行政区划分的各局部数据集列表data,再将它们合并得到全局数据集total。然后,全局数据集和各局部数据集通过Spark转成RDD,全局数据集作为一个分区,各局部数据集分别作为一个分区。接着,对各个分区的数据执行C-SOM算法:根据约束条件constraintMap得到分区数据的约束Delaunay三角图,根据k值和约束Delaunay三角图得到每个空间对象的邻域,基于此计算每个空间对象的离群因子,进行倒序排序后,得到离群因子最大的outlierNum个空间对象作为该数据集的的离群点。最后将各个分区的离群点集中收集到Spark的Master,每个地区得到outlierNum个离群点,整个地区也得到outlierNum个离群点,全局离群点在局部地区离群挖掘结果中基本也是离群因子较大的离群点,将这些离群点都返回给用户,以供用户进行进一步分析。

4.2 算法的性能与效率分析

不妨假设,待挖掘的每个局部数据集含有m条空间数据,局部数据集的个数为n,Tmn)为全局数据集的计算时间,Tm)表示一个局部数据集的计算时间,则集中式挖掘的总时间T1=Tmn)+n×Tm)。对于基于Spark的并行挖掘系统,首先要初始化Spark服务,然后将各个分区的数据和计算任务分配给对应的WN,各个WN结束某个计算任务后将计算结果传输给Spark的Master。假设Spark的初始化与调度时间为GS,数据的网络传输时间为Comm,离群点的计算时间为Comp,GS由Spark自身框架决定。Comm包括全局数据集、局部数据集和各数据集计算结果的网络传输,主要由数据规模 m×n决定。假设WN的个数为d,即d台机器可同时执行挖掘任务,则Comp大致为Tmn)+(n/d)×Tm)。并行算法离群挖掘的总时间如下:
TSpark=GS+Comm+Comp=GS+Comm+Tmn)+(n/d)×Tm)(5)
要使TSpark<T1,即GS+Comm+Tmn)+(n/d)×Tm)< Tmn)+n×Tm),则要求GS+Comm<(1-1/d)×n× Tm),从理论上来说,若Spark的初始化与调度时间以及网络通信时间之和小于并行挖掘带来的效率提高,则并行算法将花费更少的执行时间。此外,分布式的架构也突破了内存的限制,通过扩展WN,可以支持对更多的数据进行挖掘。

5 基于C-SOM和Spark的并行离群挖掘原型系统

为了帮助数据分析人员快速分析数据,构建Spark上的并行离群挖掘,本文设计了一个轻量级的原型系统。系统的架构图如图3所示,该系统基于Browser/Server(B/S)架构,B/S架构使客户端无需安装应用程序,打开浏览器,即可使用并行数据挖掘服务。系统的服务端分为表现层、业务逻辑层、数据层和Spark层。表现层为JSP页面,接收用户请求,并提交给业务逻辑层的Servlet。Servlet先从数据层读取待挖掘的空间数据集,然后将挖掘任务提交给Spark,由Spark负责对各数据集进行并行离群挖掘。Spark包括Master节点和多个Worker Node节点。Master作为Spark的集群管理节点,接收Servlet提交的挖掘任务并进行任务分解,得到多个子任务后分发给Worker Node执行,然后收集Worker Node的计算结果,再将结果返回给Servlet,Servlet再把结果通过表现层展现给用户。
Fig. 3 The architecture diagram of parallel spatial outlier mining prototype system based on Spark

图3 基于Spark的并行空间离群挖掘原型系统架构

系统的典型界面如图4所示。图4(a)中,用户在Web界面上选择要挖掘的数据集并设置考虑 约束条件的空间离群挖掘算法参数,点击“提交”按钮即可开始执行并行空间离群挖掘。图4(b)显示了全局数据集和局部数据集的部分离群挖掘结果,其中“全局”下面的是整个地区数据集的离群挖掘结果,“fuzhou”下面的是福州地区数据集的离群挖掘结果,每个数据集返回了离群因子值最大的12个离群点,并按离群因子值的倒序排列,每一行代表一个离群点,包含了离群点的对象ID、横坐标、纵坐标和离群因子值。用户通过分析这些候选离群点及其对应属性值,结合实际情况,得到最终的离群点。
Fig. 4 Interfaces of data selecting, parameters setting and results in the prototype system

图4 原型系统中数据选择、算法参数设置和结果界面

6 土壤数据离群挖掘实例分析

地球表层土壤的化学元素与人类的生存和发展息息相关。随着中国福建沿海地区工农业的快速发展,土壤污染问题越来越严重,对人类的健康构成严重威胁。土壤调查和恢复活动主要是通过采集原位样品,寻找元素浓度异常的位置来分析地球化学元素,特别是镉、汞、砷、Cu、铅、锌、铬、镍等重金属元素以及其他一些对人体健康有害的元素,它们的含量分布反映了人为二次污染的真实情况。最终检测出的土壤异常点将为以后对环境污染现状的分析提供科学依据。

6.1 实验区概况

本实例的土壤数据主要来自中国地质调查局开展的福建省沿海经济带生态地球化学调查项目,严格按照2 km×2 km的间距采样。实验区选取福建的福州市和泉州市这两个经济发达、人口稠密、高度城市化的地区,福州地区主要包含福州的五区八县,泉州地区涵盖惠安和石狮。土壤数据集共有61个属性,包括采样点的横坐标、纵坐标及样本标志(样品ID、分析编号、地市、县区、样品原号),以及54项土壤化学元素和指标。本实例将横坐标和纵坐标作为土壤数据的空间属性来确定采样点的空间邻近关系,专题属性选取13项有害元素(砷,银,铍,镉,铬,铜,汞,镍,铅,锑,硒,铊,锌)作为非空间属性进行异常分析,以探索人类活动对土壤二次污染的影响。
由于福建地处多山地区,大范围内成土母质对福建地区土壤的形成影响明显,因此本文考虑成土母质作为可能影响土壤化学元素值和指标的一个约束条件,对成土母质类型按“岩类”进行划分,采用折线段表示岩类分界线。除了成土母质类型外,河流和海域对区域的分割也会影响土壤分布。由于福州和泉州内陆水域面积较小,其对土壤分布的不连续影响较小。因此,本文只将较宽的河流和海域作为影响土壤化学元素含量的一个约束条件,采用折线段对这些约束条件进行建模与表达。

6.2 实验流程和结果分析

6.2.1 土壤数据并行离群挖掘
以市级行政区划为划分依据,将实验区数据集划分为福州和泉州2个局部数据集,并由基于 C-SOM和Spark的轻量级原型系统进行并行空间离群挖掘。在原型系统的Web界面中选择待挖掘的数据集“福州表层土壤化学元素13有害元素”和“泉州表层土壤化学元素13有害元素”,选择空间约束“福州泉州河流母岩约束”,采用1阶邻近为地理对象的空间邻域,并返回各地区离群因子最大的12个土壤采样异常点,如图4(a)所示。点击“提交”按钮后,即可显示整个实验区和2个地区(福州、泉州)离群因子最大的12个土壤采样异常点,如图4(b)所示,整理后的候选土壤数据离群如表1所示。
Tab.1 Results of the parallel outlier mining for the experimental area

表1 实验区并行离群挖掘结果

序号 整个实验区 福州地区 泉州地区
对象ID(横坐标,纵坐标) 离群因子 对象ID(横坐标,纵坐标) 离群因子 对象ID(横坐标,纵坐标) 离群因子
1 2270(645 000, 2 849 000) 11.351 2270(645 000, 2 849 000) 11.659 200(673 000, 2 745 000) 3.610
2 592(767 000, 2 809 000) 10.410 592(767 000, 2 809 000) 10.509 158(687 000, 2 777 000) 3.503
3 2436(703 000, 2 773 000) 7.604 590(761 000, 2 811 000) 6.355 42(703 000, 2 773 000) 3.045
4 590(761 000, 2 811 000) 6.619 2271(649 000, 2 851 000) 4.476 170(669 000, 2 773 000) 2.855
5 2271(649 000, 2 851 000) 4.494 1208(777 000, 2 933 000) 3.898 161(677 000, 2 779 000) 2.825
6 2564(669 000, 2 773 000) 4.391 2190(689 000, 2 855 000) 3.668 181(689 000, 2 757 000) 2.532
7 2552(687 000, 2 777 000) 4.333 582(761 000, 2 817 000) 3.222 176(703 000, 2 763 000) 2.474
8 1208(777 000, 2 933 000) 3.874 579(757 000, 2 819 000) 3.115 17(673 000, 2 753 000) 2.218
9 2190(689 000, 2 855 000) 3.649 585(755 000, 2 813 000) 3.089 2(681 000, 2 771 000) 2.168
10 2594(673 000, 2 745 000) 3.509 1869(789 000, 2 831 000) 3.029 162(673 000, 2 767 000) 1.988
11 582(761 000, 2 817 000) 3.204 1835(671 000, 2 855 000) 2.987 129(679 000, 2 741 000) 1.928
12 266(737 000, 2 851 000) 3.083 1467(717 000, 2 907 000) 2.818 174(673 000, 2 779 000) 1.877
表1中,福州地区中对象ID为2270,592,590,2271,1208,2190,582和泉州地区中对象ID为200,158,42,170(对应于整个实验区中对象ID为2594,2552,2436,2564的采样点)在整个实验区空间离群挖掘时也被检测为离群,全局离群点中只有对象ID为266的采样点没有出现在各局部离群中。可以看出,全局离群点在局部离群挖掘结果中基本也是离群因子较大的离群点,但并不是各局部离群点的简单加成,二者在离群点位置和顺序上存在少量差异。这在一定程度上证实了空间离群的局部不稳定特征和将上述土壤采样点作为候选离群点的合理性,同时也验证了基于C-SOM和Spark的原型系统能够有效地进行并行空间离群挖掘。
6.2.2 并行算法性能与效率的验证分析
本实验对单机上执行集中式离群挖掘和多机上执行并行离群挖掘的性能和效率进行对比分析。实验环境是在1台惠普台式机(Intel Core i7-6700四核3.40 GHz CPU,16 GB内存)上搭建4台配置相同的虚拟机,每个虚拟机都是单核单线程, 2 GB内存。单机实验在其中1台虚拟机上进行。并行实验在4台虚拟机组成的Spark集群中进行,其中1台作为Master,其余3台作为Worker Node。随机生成n个局部数据集,第i个局部数据集位于起点为(i×10 000,i×10 000),长、宽都为10 000的方形区域内。每个局部数据集包含3000条空间数据,每条空间数据包含横坐标、纵坐标以及13个非空间属性。单机集中式离群挖掘和基于Spark集群的并行离群挖掘的时间效率如图5所示。
Fig. 5 Comparison of outlier mining efficiency between single machine implementation and parallel system

图5 单机实现和并行系统的离群挖掘效率对比

图5可知:①当局部数据集个数n为90时,单机系统就出现了内存溢出,而并行系统直到150个局部数据集仍然运作正常,这说明并行系统能够支持更大的数据量的离群挖掘;②当n比较小时,并行系统花费的时间多于单机系统,但随着n的增加,并行系统时间开销增加的幅度小于单机系统,当n达到80时,并行系统的时间开销已经小于单机系统,这主要是由于基于Spark的并行系统存在一定的初始化与调度时间以及网络通信时间,当n比较小时,并行挖掘带来的效率提高比不上这些额外增加的时间开销,但随着n的增加,这些额外的时间开销增长缓慢,而并行挖掘带来的效率提高越来越明显。可以看出,当n足够大时,并行系统的离群挖掘效率优于单机系统。

7 结论

本文设计了基于考虑约束条件的空间离群挖掘算法C-SOM和并行计算框架Spark的并行空间离群挖掘算法及其原型系统,充分利用了Spark的快速内存计算和扩展性的优势。并行算法以 C-SOM为核心,在多个计算节点并发地对分配到的数据集执行C-SOM算法以挖掘其离群点,再将各计算节点得到的离群点汇聚起来一并提供给用户进行分析。轻量级的原型系统基于Spark实现了该并行算法,并采用B/S架构,提供简单实用的可视化界面以方便用户进行数据分析。
通过福建省沿海的土壤化学调查数据13项有害元素含量的异常检测与分析,验证了基于C-SOM和Spark的并行空间离群挖掘算法及其原型系统的正确性和有效性。通过单机系统和并行系统进行人工合成数据空间离群挖掘的对比测试,证明并行系统能够有效地提高挖掘效率,当数据集个数足够大时,并行系统的时间开销小于单机系统。此外,由于Spark框架的扩展性,并行系统突破了单机内存的限制,能够支持对更多的数据进行挖掘。

The authors have declared that no competing interests exist.

[1]
Shekhar S, Lu C T, Zhang P.A unified approach to detecting spatial outliers[J]. GeoInformatica, 2003,7(2):139-166.Spatial outliers represent locations which are significantly different from their neighborhoods even though they may not be significantly different from the entire population. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and implicit knowledge, such as local instability. In this paper, we first provide a general definition of S-outliers for spatial outliers. This definition subsumes the traditional definitions of spatial outliers. Second, we characterize the computation structure of spatial outlier detection methods and present scalable algorithms. Third, we provide a cost model of the proposed algorithms. Finally, we experimentally evaluate our algorithms using a Minneapolis-St. Paul (Twin Cities) traffic data set.

DOI

[2]
Singh A K, Lalitha S.A novel spatial outlier detection technique[J]. Communications in Statistics: Theory and Methods, 2018,47(1):247-257.Abstract Spatial outliers are spatially referenced objects whose non spatial attribute values are significantly different from the corresponding values in their spatial neighbourhoods. In other words, a spatial outlier is a local instability or an extreme observation that deviates significantly in its spatial neighbourhood, but possibly not be in the entire dataset. In this paper, we have proposed a novel spatial outlier detection algorithm, LQ for multiple attributes spatial datasets and compared its performance with the well-known Mean and Median algorithms for multiple attributes spatial datasets, in the literature. In particular, we have applied the Mean, Median and LQ algorithms on a real dataset and on simulated spatial datasets of 13 different sizes to compare their performances. In addition, we have calculated AUC values in all the cases, which shows that our proposed algorithm is more powerful than the Mean and Median algorithms in almost all the considered cases and also plotted ROC curves in some cases.

DOI

[3]
Chen C C, Lin J X, Wu X Z, et al.Parallel and distributed spatial outlier mining in grid: Algorithm, design and application[J]. Journal of Grid Computing, 2015,13(2):139-157.There is an increasing interest in the field of parallel and distributed data mining in grid environment over the past decade. As an important branch of spatial data mining, spatial outlier mining can be used to find out some interesting and unexpected spatial patterns in many applications. In this paper, a new parallel & distributed spatial outlier mining algorithm (PD-SOM) is proposed to simultaneously detect global and local outliers in a grid environment. PD-SOM is a Delaunay triangulation (D-TIN) based approach, which was encapsulated and deployed in a distributed platform to provide parallel and distributed spatial outlier mining service. Subsequently, a distributed system framework for PD-SOM is designed on top of a geographical knowledge service grid (GeoKSGrid) developed by our research group, a two-step strategy for spatial outlier detection is put forward to support the encapsulation and distributed deployment of the geographical knowledge service, and two key techniques of the geographical knowledge service: parallel and distributed computing of Delaunay triangulation and the implementation of PD-SOM algorithm are discussed. Finally, the efficiency of the spatial outlier mining service is analyzed in theory, the practicality is confirmed by a demonstrative application on the abnormality analyzing of soil geochemical investigation samples from Fujian eastern coastal zone area in China, and the effectiveness and superiority of PD-SOM in a balanced, scalable grid environment are verified through the comparison with the popular spatial outlier mining algorithm SLOM, for the involvement of large amount of computing cores.

DOI

[4]
Lu C T, Chen D, Kou Y.Algorithms for spatial outlier detection[C]. Melbourne: Proceeding of 3rd IEEE International Conference on Data Mining, 2003.

[5]
Haslett J, Brandley R, Craig P, et al.Dynamic graphics for exploring spatial data with application to location global and local anomalies[J]. The American Statistician, 1991,45(3):234-242.We explore the application of dynamic graphics to the exploratory analysis of spatial data. We introduce a number of new tools and illustrate their use with prototype software, developed at Trinity College, Dublin. These tools are used to examine local variability090000anomalies090000through plots of the data that display its marginal and multivariate distributions, through interactive smoothers, and through plots motivated by the spatial auto-covariance ideas implicit in the variogram. We regard these as alternative and linked views of the data. We conclude that the most important single view of the data is the Map View: All other views must be cross-referred to this, and the software must encourage this. The view can be enriched by overlaying on other pertinent spatial information. We draw attention to the possibilities of one-many linking, and to the use of line-objects to link pairs of data points. We draw attention to the parallels with work on Geographical Information Systems.

DOI

[6]
Pannatier Y.Variowin: Software for spatial data analysis in 2D[J]. Statistics & Computing, 1996,11(7):531-534.

[7]
Anselin L.Local indicators of spatial association: LISA[J]. Geographical Analysis, 1995,27(2):93-115.The capabilities for visualization, rapid data retrieval, and manipulation in geographic information systems (GIS) have created the need for new techniques of exploratory data analysis that focus on the 090008spatial090009 aspects of the data. The identification of local patterns of spatial association is an important concern in this respect. In this paper, I outline a new general class of local indicators of spatial association (LISA) and show how they allow for the decomposition of global indicators, such as Moran's I, into the contribution of each observation. The LISA statistics serve two purposes. On one hand, they may be interpreted as indicators of local pockets of nonstationarity, or hot spots, similar to the Gi and G*i statistics of Getis and Ord (1992). On the other hand, they may be used to assess the influence of individual locations on the magnitude of the global statistic and to identify 090008outliers,090009 as in Anselin's Moran scatterplot (1993a). An initial evaluation of the properties of a LISA statistic is carried out for the local Moran, which is applied in a study of the spatial pattern of conflict for African countries and in a number of Monte Carlo simulations.

DOI

[8]
Shekhar S, Lu C T, Zhang P.Detecting graph-based spatial outliers[J]. Intelligent Data Analysis, 2002,6(5):451-468.

DOI

[9]
Shekhar S, Lu C T, Zhang P.Detecting graph-based spatial outliers: Algorithms and applications (a summary of results)[C]. San Francisco: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.

[10]
林甲祥. 考虑约束条件的分布式空间离群挖掘及其应用研究[D].福州:福州大学,2010.

[ Lin J X.Research on distributed spatial outlier mining in the presence of constraints and its applications[D]. Fuzhou: Fuzhou University, 2010. ]

[11]
Anselin L.Exploratory spatial data analysis and geographic information systems[J]. New Tools for Spatial Analysis, 1994,17:45-54.

[12]
Kou Y, Lu C T, Chen B.Spatial weighted outlier detection[C]. Philadelphia: Proceedings of the 6th SIAM International Conference on Data Mining, 2006.

[13]
Tsai C F, Lin W C, Ke S W.Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies[J]. Journal of Systems and Software, 2016,122:83-92.Mining with big data or big data mining has become an active research area. It is very difficult using current methodologies and data mining software tools for a single personal computer to efficiently deal with very large datasets. The parallel and cloud computing platforms are considered a better solution for big data mining. The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner. There are two common methodologies used to tackle the big data problem. The first one is the distributed procedure based on the data parallelism paradigm, where a given big dataset can be manually divided intonsubsets, andnalgorithms are respectively executed for the correspondingnsubsets. The final result can be obtained from a combination of the outputs produced by thenalgorithms. The second one is the MapReduce based procedure under the cloud computing platform. This procedure is composed of the map and reduce processes, in which the former performs filtering and sorting and the later performs a summary operation in order to produce the final result. In this paper, we aim to compare the performance differences between the distributed and MapReduce methodologies over large scale datasets in terms of mining accuracy and efficiency. The experiments are based on four large scale datasets, which are used for the data classification problems. The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset. In addition, the MapReduce procedure requires the least computational cost to process these big datasets.

DOI

[14]
Gan W S, Lin J C W, Chao H C, et al. Data mining in distributed environment: A survey[J]. WIREs Data Mining and Knowledge Discovery, 2017,7(6):e1216.

DOI

[15]
Luo P, Lu K, Shi Z Z, et al.Distributed data mining in grid computing environments[J]. Future Generation Computer Systems, 2007,23(1):84-91.The computing-intensive data mining for inherently Internet-wide distributed data, referred to as Distributed Data Mining (DDM), calls for the support of a powerful Grid with an effective scheduling framework. DDM often shares the computing paradigm of local processing and global synthesizing. It involves every phase of Data Mining (DM) processes, which makes the workflow of DDM very complex and can be modelled only by a Directed Acyclic Graph (DAG) with multiple data entries. Motivated by the need for a practical solution of the Grid scheduling problem for the DDM workflow, this paper proposes a novel two-phase scheduling framework, including External Scheduling and Internal Scheduling, on a two-level Grid architecture (InterGrid, IntraGrid). Currently a DM IntraGrid, named DMGCE (Data Mining Grid Computing Environment), has been developed with a dynamic scheduling framework for competitive DAGs in a heterogeneous computing environment. This system is implemented in an established Multi-Agent System (MAS) environment, in which the reuse of existing DM algorithms is achieved by encapsulating them into agents. Practical classification problems from oil well logging analysis are used to measure the system performance. The detailed experiment procedure and result analysis are also discussed in this paper.

DOI

[16]
Gkatzikis L, Koutsopoulos I.Migrate or not? Exploiting dynamic task migration in mobile cloud computing systems[J]. IEEE Wireless Communication, 2013,20(7):24-32.Contemporary mobile devices generate heavy loads of computationally intensive tasks, which cannot be executed locally due to the limited processing and energy capabilities of each device. Cloud facilities enable mobile devices-clients to offload their tasks to remote cloud servers, giving birth to Mobile Cloud Computing (MCC). The challenge for the cloud is to minimize the task execution and data transfer time to the user, whose location changes due to mobility. However, providing quality of service guarantees is particularly challenging in the dynamic MCC environment, due to the time-varying bandwidth of the access links, the ever changing available processing capacity at each server and the timevarying data volume of each virtual machine. In this article, we advocate the need for novel cloud architectures and migration mechanisms that effectively bring the computing power of the cloud closer to the mobile user. We consider a cloud computing architecture that consists of a back-end cloud and a local cloud, which is attached to wireless access infrastructure (e.g. LTE base stations). We outline different classes of task migration policies, spanning fully uncoordinated ones, in which each user or server autonomously makes its migration decisions, up to the cloud-wide migration strategy of a cloud provider. We conclude with a discussion of open research problems in the area.

DOI

[17]
Apache. Hadoop[EB/OL]. .

[18]
Dean J, Ghemawat S.Mapreduce: Simplified data processing on large clusters[J]. Communications of the ACM, 2008,51(1):107-113.

[19]
邬群勇,苏克云,邹智杰 .基于MapReduce的海量公交乘客OD并行推算方法[J].地球信息科学学报,2018,20(5):647-655.公交乘客出行OD能够反映居民出行特征和出行需求,是进行公交系统评价、调度和线路优化的重要基础数据,对城市规划具有重要的实用价值.现有公交OD推算方法多适用于少量公交数据,无法直接快速地推算海量公交乘客出行OD,因此本文提出了一种基于MapReduce的海量公交乘客OD并行推算方法.首先将公交数据从关系型数据库迁移至HBase数据库;接着利用MapReduce并行计算框架,根据HBase中IC卡数据的Region数量分成多个map任务,每个map任务中Map函数计算上车站点,Reduce函数将上车站点以用户为单位进行归并输出到HDFS;然后在上车记录数据的基础上,根据HDFS存储的块数量分成多个map任务,针对每个乘客的出行记录,综合考虑出行链方法和历史相似出行行为规律实现对公交乘客下车站点较为精确的推算.最后以厦门2015年6月13日至26日的IC卡数据和公交车辆GPS数据进行实例分析,共计算出295条公交线路,16879661条上车记录,14410058条完整OD记录,占IC卡数据的78.9%,计算效率相比传统方法有较大幅度提升.结果表明:该方法不仅可以较为准确地推算公交乘客上下车站点,而且计算效率较高.

[ Wu Q Y, Su K Y, Zou Z J.A mapreduce-based method for parallel calculation of bus passenger origin and destination from massive transit data[J]. Journal of Geo-information Science, 2018,20(5):647-655. ]

[20]
Apache. Spark[EB/OL]. .

[21]
景维鹏,霍帅起.基于自定义RDD的海量遥感图像并行镶嵌方法[J].地球信息科学学报,2017,19(10):1346-1354.图像镶嵌是遥感图像处理中的重要内容,在跨区域遥感图像分析中发挥重要作用。为了解决传统遥感图像并行算法中存在的计算节点利用率低、频繁数据I/O等问题,本文根据Spark分布式内存计算框架,充分利用Spark利于迭代数据处理的优势,提出了一种基于Spark自定义RDD(弹性分布式数据集)的并行镶嵌方法。该方法首先在集群的多个节点上通过相位相关法执行图像重叠区域估计操作,从而提高了图像重叠区域估计的多节点并行计算;然后,通过重写Spark中RDD的compute和get Partitions方法,自定义针对遥感图像处理的RDD,并将图像镶嵌中的重叠区域估计、图像配准和图像融合3个关键步骤作为自定义RDD的Transformation类型的操作算子;最后,通过隐式转换创建自定义RDD,并调用自定义RDD的操作算子实现图像镶嵌的并行处理。实验结果表明,与传统基于MPI的并行镶嵌算法相比,该方法在保证图像镶嵌效果的基础上,能够有效提高大数据量的图像镶嵌效率。

DOI

[ Jing W P, Huo S Q.A model of parallel mosaicking for massive remote sensing images based on self-defined RDD[J]. Journal of Geo-information Science, 2017,19(10):1346-1354. ]

[22]
王习特,申德荣,白梅,等. BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51.离群点检测是数据管理领域中的热点问题之一,在许多方面都有着广泛应用,如信用卡诈骗、网络入侵检测、环境监测等.目前现有的离群点检测算法大多针对集中式的处理环境.但随着数据规模的不断增长,传统的集中式算法处理效率受限,无法满足用户日益增长的需求.针对上述问题,文中提出了一种新型的分布式离群点检测算法.首先,在数据存储阶段(即预处理),提出了BDSP(Balance Driven Spatial Partitioning)数据划分算法.该算法可以有效地均衡每个计算节点的工作负载,并实现良好的过滤效果.此外,为划分所得到的每个块设计了一种全新的编码方式,可以快速地确定块与块之间的相邻关系,降低网络开销.基于BDSP算法,提出了BOD(BDSP-based Outlier Detection)分布式离群点检测算法.该算法包括2个步骤:在每个计算节点本地,利用R树索引进行批量过滤,快速地计算离群点并得到本地候选集;利用BDSP中提供的块编码确定需要相互通信的节点,使用少量的网络开销得到最终结果.最后,通过大量实验验证了文中所提出的BDSP和BOD算法的有效性.实验结果表明,相对于现有算法,文中算法可以显著地提高计算效率并大幅降低网络开销.

DOI

[ Wang X T, Shen D R, Bai M, et al.BOD: An efficient algorithm for distributed outlier detection[J]. Chinese Journal of Computers, 2016,39(1):36-51. ]

[23]
张继福,李永红,秦啸,等.基于MapReduce与相关子空间的局部离群数据挖掘算法[J].软件学报,2015,26(5):1079-1095.针对高维海量数据,在Map Reduce编程模型下,提出了一种基于相关子空间的局部离群数据挖掘算法.该算法首先利用属性维上的局部稀疏程度,重新定义了相关子空间,从而能够有效地刻画各种局部数据集上的分布特征;其次,利用局部数据集的概率密度,给出了相关子空间中的局部离群因子计算公式,有效地体现了相关子空间中数据对象不服从局部数据集分布特征的程度,并选取离群程度最大的N个数据对象定义为局部离群数据;在此基础上,采用LSH分布式策略,提出了一种Map Reduce编程模型下的局部离群数据挖掘算法;最后,采用人工数据集和恒星光谱数据集,实验验证了该算法的有效性、可扩展性和可伸缩性.

DOI

[ Zhang J F, Li Y H, Qin X, et al.Related-subspace-based local outlier detection algorithm using mapreduce[J]. Journal of Software, 2015,26(5):1079-1095. ]

[24]
任燕. 基于MapReduce与距离的离群数据并行挖掘算法[J].计算机系统应用,2018,27(2):151-156.数据挖掘技术是解决数据丰富而知识贫乏的有效途径, 离群数据挖掘是数据挖掘领域中的重要研究内容之一, 己广泛应用于网络入侵检测, 信用卡诈骗, 垃圾邮件的分析和基因突变分析等领域. 在高维海量数据中, 由于数据量大和维度高, 严重影响了离群数据挖掘的精度和效率. 本文在KNN基础上, 通过定义"解集"的概念, 在MapReduce编程环境下, 实现了一种基于距离的离群数据挖掘算法. 分别采用人工数据集和UCI数据集, 实验验证了该算法在不同条件下, 参数对算法性能的影响.

DOI

[ Ren Y.Parallel mining of distance-based outliers using mapreduce[J]. Computer Systems & Applications, 2018,27(2):151-156. ]

[25]
Yu D, Ping L, Li W.Spatio-temporal outlier detection based on cloud computing[J]. Journal of Computational Information Systems, 2014,10(13):5481-5488.

[26]
张卫平,刘纪平,仇阿根,等.一种分布式计算的空间离群点挖掘算法[J].测绘科学,2017,42(8):85-90.针对现有空间离群点挖掘算法无法适应大规模空间数据挖掘的需求,该文提出了一种分布式条件下的空间离群点挖掘算法。首先,该文针对集群上分布式计算和存储的特点提出使用空间填充曲线来划分数据集,加速寻找目标点的近似空间最近邻居。其次,使用信息熵的理论来定义空间离群系数,考虑到多维数据中不同属性对离群系数的影响具有差异性,该算法能够自动根据数据原有特点,计算各属性的权重;同时使用反距离权定义空间因素对离群系数的影响。最后,实验结果表明该算法在大规模的空间数据集中挖掘离群点的效率远高于传统算法,离群点的挖掘精度在90%以上。

DOI

[ Zhang W P, Liu J P, Chou A G, et al.A spatial outlier mining algorithm based on distributed computing[J]. Science of Surveying and Mapping, 2017,42(8):85-90. ]

[27]
姚明经,林甲祥,陈崇成,等.网格环境下分布式空间离群挖掘体系的设计与应用[J].地球信息科学学报,2011,13(3):383-390.空间离群是指空间数据集中那些非空间属性值与邻域中其他空间对象明显不同的空间对象。空间数据一般按地理分布存储具有海量特性,传统的集中式处理模式不能满足海量数据处理的效率和空间数据本身的安全性等要求。因此,在研究小组开发的地理知识服务网格平台GeoKS-Grid的基础上,本文针对分布式空间离群挖掘,提出了一个基于网格的分布式体系框架,制定了网格环境下分布式空间离群挖掘的策略,实现了具体的分布式空间离群挖掘算法。另遵循分布式空间数据挖掘的一般过程和网格服务通用、可重用和可组合的原则,将算法按合理粒度进行分解,并封装成多个基本的原子服务,进而以网格工作流的方式进行服务发现与组合,完成包括局部离群挖掘和全局离群挖掘在内的分布式空间离群挖掘。最后,通过福建省生态地球化学调查土壤数据离群分析实例,验证了服务或系统的合理性和有效性。

DOI

[ Yao M J, Lin J X, Chen C C, et al.Service and application of grid based distributed spatial outliersmining[J]. Journal of Geo-information Science, 2011,13(3):383-390. ]

[28]
Zaharia M, Chowdhury M, Das T, et al.Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. San Jose, USA, 2012.

[29]
Lin J X, Chen C C, Wu J W.CD-graph: Planar graph representation for spatial adjacency and neighbourhood relation with constraints[J]. International Journal of Geographical Information Science, 2013,27(10):1902-1923.Neighbourhood relational graphs are widely used in Geosciences. Given a set of spatial objects (vertices) in the plane together with a set of spatial obstacles and spatial facilitators in straight-line edges, the constrained Delaunay graph (CD-graph) is an undirected graph representing the spatial adjacency and neighbourhood relation of objects. CD-graph is an approximated triangulation of vertices with the following properties: (1) the obstacles are included in the graph as some barrier edges that block the connection of the objects on both sides of the obstacles, and the facilitators are included in the graph as some nontrivial edges that connect the objects that are broken by the obstacles; (2) it is as close as possible to the Delaunay triangulation (D-TIN). CD-graph can be used to represent the spatial adjacency and neighbourhood relation of objects with constraints. A theoretical contrast is conducted to differentiate CD-graph, arbitrary (unconstrained) D-TIN and constrained D-TIN. Meanwhile, a two-step constraint-embedding algorithm is proposed to build CD-graph in optimal time by using divide-and-conquer technique. Subsequently, the Voronoi diagram and D-TIN-based k-order neighbours is extended in CD-graph to express different scales of spatial adjacency and neighbourhood relation of objects. CD-graph can be widely used in geographical applications, such as spatial interpolation, spatial clustering and spatial decision support.

DOI

Outlines

/