地球信息科学理论与方法

一种面向栅格的空间-属性双重约束聚类方法

  • 刘敬一 , 1, 2 ,
  • 薛存金 , 2, 3, * ,
  • 樊彦国 1 ,
  • 孔凡萍 2 ,
  • 何亚文 1
展开
  • 1. 中国石油大学(华东)地球科学与技术学院,青岛 266580
  • 2. 中国科学院遥感与数字地球研究所 数字地球重点实验室,北京 100094
  • 3. 海南省地球观测重点实验室,三亚 572029
*通讯作者: 薛存金(1979-),男,博士,副研究员,主要从事海洋地理信息系统、海洋时空分析方法研究。E-mail:

作者简介:刘敬一(1992-),女,硕士生,研究方向为海洋时空聚类方法。E-mail:

收稿日期: 2016-07-12

  要求修回日期: 2016-11-02

  网络出版日期: 2017-04-20

基金资助

国家自然科学基金项目(41371385、41401439、41671401)

中国科学院青年促进会项目(2013113)

海洋动力遥感与声学重点实验室开放基金项目(KHYS1402)

A Raster-Oriented Clustering Method with Space-Attribute Constraints

  • LIU Jingyi , 1, 2 ,
  • XUE Cunjin , 2, 3, * ,
  • FAN Yanguo 1 ,
  • KONG Fanping 2 ,
  • HE Yawen 1
Expand
  • 1. School of Geosciences, China University of Petroleum (East of China), Qingdao 266580, China
  • 2. Key Laboratory of Digital Earth Science, Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, Beijing 100094, China
  • 3. Key Laboratory of Earth Observation, Sanya 572029, China
*Corresponding author: XUE Cunjin, E-mail:

Received date: 2016-07-12

  Request revised date: 2016-11-02

  Online published: 2017-04-20

Copyright

《地球信息科学学报》编辑部 所有

摘要

针对栅格数据,传统聚类方法大都基于专题属性进行聚类,分裂了栅格对象的空间特性与专题属性,而兼顾空间与专题属性的现有空间聚类方法又存在算法复杂、参数设置多等问题,因此本文提出了一种面向栅格的空间-属性双重约束聚类算法(A Raster-oriented Clustering Method with Space-Attribute Constraints, RoCMSAC)。RoCMSAC利用栅格数据空间邻域和空间连通特性,重新定义栅格簇的相似性度量准则,通过属性均质簇生成,空间相邻栅格簇合并和空间邻近栅格簇合并3个步骤对栅格数据进行空间-属性双重约束聚类。利用太平洋海域海表温度栅格数据对算法的可行性以及有效性进行验证,并与现有算法进行对比分析。通过实例验证与对比发现:① RoCMSAC方法能够保证栅格簇空间域的邻近性和属性域的均质性;② RoCMSAC方法可发现复杂形状的栅格簇,且算法时间复杂度低,需输入参数较少。

本文引用格式

刘敬一 , 薛存金 , 樊彦国 , 孔凡萍 , 何亚文 . 一种面向栅格的空间-属性双重约束聚类方法[J]. 地球信息科学学报, 2017 , 19(4) : 447 -456 . DOI: 10.3724/SP.J.1047.2017.0447

Abstract

For dealing with the raster datasets, most of the traditional clustering methods are based on the thematic attribute, which separate the integrities of spatial and thematic characteristics. However, the current clustering methods considering both spatial and thematic characteristics still have great problems such as complicated clusters, computational complexities and many input parameters, etc. Thus, this paper presents a Raster-oriented Clustering Method with Space-Attribute Constraints, named RoCMSAC. The core idea of RoCMSAC uses the spatial contiguities and the connectivity of raster datasets to redefine the similarity measure criterion. The RoCMSAC consists of three steps, i.e. the cluster generation with the homogeneous attributes, the cluster merging with the spatial contiguities and the cluster merging with the spatial vicinities. Finally, the feasibility and effectiveness of the algorithm are validated with the datasets of sea surface temperature in Pacific Ocean. The clusters from RoCMSAC are compared with those from K-Mean and DDBSC. The results show that: (1) RoCMSAC can detect any grid cluster with the complicated shape, which needs less time and fewer input parameters; (2) The clusters from RoCMSAC obtain both the proximity in spatial domain and the homogeneity in attribute one.

1 引言

对地观测技术的发展,使以卫星影像为主的栅格数据呈井喷式增长,这对数据的处理分析提出了新的挑战。聚类分析是数据挖掘分析的重要组成部分之一,不仅可以对数据进行预处理,还可以将不为人们认知的聚簇模式提取并进行科学分析,因此成为了处理分析大规模栅格数据的有效方法之一。聚类分析的目的是寻找隐藏在数据中的结构,按照某种相似性度量,尽可能把具有相同性质的数据归于同一类[1]。传统空间聚类方法大致可分为五类,即基于划分、基于层次、基于密度、基于网格、基于模型的聚类方法[2],而这些方法大都只考虑单一的空间位置聚类或专题属性聚类。由于地理空间对象具有空间和专题属性双重特性,这要求同一簇类既要在空间域上毗邻,又要在属性域上具有最大的相似度[3]。以海洋表面温度异常变化为例,中太平洋海洋温度异常升高可能是中太平洋厄尔尼诺(CP-El Nino)事件的指示器,而东太平洋海洋表面温度异常升高则是东太平洋厄尔尼诺(EP -El Nino)事件的指示器,引入空间因素来约束属性聚类具有重要意义。因此,面向栅格数据的聚类不仅要考虑专题属性特征的相似性,还应考虑对象的空间邻近性,即聚类结果中的各栅格簇在空间域上连续、在属性域上相近[4]
在解决上述双重聚类问题时,现有空间-属性双重约束聚类方法可归纳为3类:
(1)同一相似性度量准则,同时聚类。其将空间位置与非空间属性归一化,融合加权构造相似性度量公式,再利用常用聚类算法进行聚类。代表研究有坐标-属性一体化方法[5]、空间加权距离模糊C均值聚类方法[6]、核函数加权K-均值算法[7]、地图代数空间加权聚类算法[8]等。此类方法将空间坐标与属性加权纳入一个空间相似性度量公式中,其权重确定较为困难。同时,将空间坐标作为另一种专题属性处理,在一定程度上弱化了对象的空间特性。
(2)不同相似性度量准则,同时聚类。其在聚类的过程中分别考虑空间邻近与属性相近,大多在传统聚类算法的基础上进行扩展,较为常用的是DBSCAN(Density-Based Spatial Clustering of Applications with Noise)算法。相关研究有Birant等提出的利用双阈值约束空间与非空间信息的ST-DBSCAN算法[9];李光强等也基于相同的思想,利用空间与属性的双重距离来定义密度核心,提出了基于双重距离的空间聚类算法(Dual Distance Based Spatial Clustering,DDBSC)[3];Zhou等也在DBSCAN算法的基础上进行了双约束聚类方法的扩展探索[10]。这类方法受DBSCAN算法自身的局限,对参数设置较为敏感,算法的时间复杂度也较高,对空间分布不均匀和具有一定趋势性的专题属性数据的聚类效果不佳。
(3)空间、属性分别聚类后,再综合聚类。其从空间域和专题属性域2个方面分别进行聚类,然后综合生成最终聚类结果。代表研究成果有交互聚簇分类算法(Interlaced Clustering-Classification,ICC)[4],自适应确定参数的地理信息与空间优化约束算法(Bound INformation of Geography and Optimization spaces, BINGO)[11]。以及国内学者提出的基于Delaunay三角网的多层次约束的空间聚类方法[12],和基于空间点集Voronoi图的自组织双重聚类算法[13]等。此类方法会受所采用基础聚类算法的影响,同时针对大规模数据集会存在计算复杂度较高、参数设置较多等问题。
综上分析,目前双约束聚类算法大多面向矢量数据集,针对栅格数据集的双约束聚类问题,还需要深入探讨。由于栅格数据数据量较大,且专业性较强,对其进行空间-属性双重约束聚类,需要算法在保证聚类质量的前提下,具有较高的计算效率和较少的参数输入,且参数设置需要较少先验知识。基于此,本文从栅格数据的双重约束聚类出发,提出了面向栅格数据的空间-属性双重约束聚类算法(A Raster-oriented Clustering Method with Space-Attribute Constraints, RoCMSAC),实现聚类后栅格簇在空间域上邻近,在属性域上具有最大的相似度。

2 RoCMSAC方法

2.1 基本概念

设空间栅格数据集: P = { p 1 , p 2 , , p n } , n 1 ,n表示栅格数据行号与列号的乘积, p i ( i = 1,2 , , n ) 为栅格像元,Pi的空间属性为其行列号,非空间属性为其像元值。
定义1 栅格像元空间相邻:栅格数据集中 p i p j 为两栅格像元 ( i j ) ,若 p i p j 的β-邻域中(β一般取4、8等值[14]),则定义两栅格像元空间相邻,记为 p i ~ p j 图1(a)-(c))。
Fig. 1 Spatial adjacency and connectivity of pixels

图1 栅格像元空间相邻与空间连通示意图

定义2 空间连通:栅格数据集中 p i , p j , p k 为3个栅格像元 ( i j k ) ,若 p i ~ p j p j ~ p k ,则有 p i ~ p k 。若栅格子集中任意2个像元 p i p j ,至少存在一条空间邻近路径 p i ~ p k 1 , p k 1 ~ p k 2 ,…, p kn ~ p j ,则定义此栅格子集为空间连通(图1(d))。
定义3 栅格子簇:每一个栅格子簇中的栅格像元是空间连通的,且栅格像元间属性值(像元值)差异较小。
定义4 簇中心:由一个虚拟点表示,记为 O = ( x ̅ , y ̅ , a ̅ ) 。其中, x ̅ y ̅ 表示簇的空间位置中心, x ̅ 等于每个栅格像元的行号之和与簇中所有栅格像元个数的商, y ̅ 等于每个栅格像元的列号之和与簇中所有栅格像元个数的商; a ̅ 表示簇的属性中心,其等于簇中各栅格像元值的均值。
定义5 栅格子簇空间相邻:存在2个空间子簇 C 1 C 2 ,若 C 1 中栅格像元的β-邻域中存在 C 2 的栅格像元,则称这2个子簇空间相邻,否则两簇为空间独立(图2)。
Fig. 2 Spatial adjacency of clusters

图2 栅格簇空间相邻示意图

注:图(a)表示在合并前子簇1与子簇2、3、4空间相邻;图(b)表示子簇1与子簇3满足条件合并后新子簇与子簇2、4、5空间相邻

定义6 栅格子簇空间邻近:存在2个空间独立子簇 C 1 C 2 ,属性阈值 ξ a 表示2个空间邻近簇间属性距离的最大值,空间阈值 ξ s 表示2个空间邻近簇空间中心距离最大值, ξ a ξ s 为输入参数。若这2个独立子簇的属性距离小于 ξ a ,同时空间距离小于 ξ s ,则称这2个空间独立子簇空间邻近,记为 C 1 C 2 图3)。
Fig. 3 Spatial vicinity of grid clusters

图3 栅格簇空间邻近示意图

注:子簇1、2、3均为独立簇且属性距离在阈值ξa内,若空间阈值ξs=5,则有子簇1和子簇2为空间邻近,记子簇1➝子簇2

定义7 栅格簇:是一组栅格数据对象的集合,每一个栅格簇由满足聚类条件的空间邻近以及空间相邻的栅格子簇组成,簇中元素相似度高,簇间相似性差异较大。

2.2 算法设计

RoCMSAC算法主要包含3个核心步骤:① 生成属性均质簇。对数据进行纯属性聚类,利用“栅格像元空间相邻”的定义对栅格数据属性聚类结果进行遍历,生成同属性类空间连通簇;② 空间相邻簇合并。利用“栅格子簇空间相邻”的定义对栅格簇的相邻簇进行查询,对满足合并条件的相邻子簇进行合并;③ 空间邻近簇合并。利用“栅格子簇空间邻近”的定义对独立子簇进行进一步合并,对属性进行重新聚类划分,得到面向栅格数据的空间-属性双重约束聚类结果。具体方法流程如图4所示。
Fig. 4 The algorithm flow chart of RoCMSAC

图4 RoCMSAC算法流程

2.2.1 属性均质簇的生成算法
对栅格数据的属性进行聚类,可采用多种聚类算法,如K-均值,DBSCAN, SOM[15-17]算法等。本文采用经典聚类算法K-均值作为属性聚类基本算法,对栅格数据进行纯属性聚类。具体步骤为[15]
(1)选定期望的聚类个数K,并随机选择K个初始聚类中心。
(2)对所有数据计算它到各个聚类中心的距离,并分配到距离最近的中心。
(3)重新计算各个划分类的中心。
(4)迭代步骤(2)、(3),直到新的聚类中心与原中心相等或者二者差值小于一定阈值,算法结束。栅格数据每个栅格像元在属性聚类后,其属性被所属类别代替。设属性类别为 K = { 1,2 , , k } , k 2 ,继续进行生成属性均质簇。
(5)遍历栅格数据集,从第一个栅格开始,查询与其空间相邻的栅格像元,判断是否同属一个类别,如果是则进行合并、标记,合并完成后遍历第2个未被访问和标记的栅格像元,否则直接访问第2个未被访问和标记的栅格像元。
(6)重复步骤(5),直到所有的栅格像元全部被重新标记,生成各个属性类别的空间连通均质簇。
2.2.2 空间相邻簇合并算法
通过上述算法,聚类空间被划分为若干属性均质簇,一个属性类对应一个或多个空间连通簇。在此基础上,进行空间相邻栅格簇的合并,设子簇集为 C = { C 1 , C 2 , , C m } , m 2 , m 表示所有子簇的个数。每个子簇的簇中心表示为 O i = x i ¯ , y i ¯ , a i ¯ , i = 1,2 , , m ,属性阈值设为 ξ a 。空间相邻簇合并算法的步骤如下:
(1)遍历C中的子簇,对每个子簇的空间相邻栅格簇进行查询并存到数组N中,则第m个子簇的 t个空间相邻栅格簇集合表示为 N m = { C m 1 , C m 2 , ... , C m t } , t 0 ;
(2)计算每个子簇属性中心与其各个空间相邻栅格簇属性中心差异。采用绝对值距离进行衡量,即 D mt attri = a m ¯ - a t ¯ ,表示第m个子簇与其第t个空间相邻栅格簇的属性差异,将 D mt attri 小于阈值 ξ a 的空间相邻栅格簇进行属性差值排序,将属性差值最 小的空间相邻栅格簇与该子簇放入集合 M = ( C m , C m t , D mt attri ) ;
(3)遍历集合M,查找 D mt attri 最小的两个子簇进行合并,更新N、M,及各个子簇中心;
(4)循环执行步骤(2)、(3),直到没有新的合并簇产生,各个独立簇记为 C d = { C 1 d , C 2 d , , C z d } , z 1 ;
2.2.3 空间邻近簇合并算法
每个子簇与其满足条件的空间相邻栅格簇进行合并产生新簇后,第一次簇合并结束。合并结束后, C d 中的各个子簇是相互独立的,在大范围的栅格数据中,一定的空间阈值范围内,2个相互独立且属性接近的子簇可以被认为是空间邻近的,对空间邻近独立栅格子簇做进一步约束聚类,可以使栅格簇在空间上更为连续,聚类结果更加符合实际。因此,根据“栅格子簇空间邻近”的定义,进行空间邻近簇的合并,空间阈值设为 ξ s ,合并步骤如下:
(1)遍历 C d 中的每个独立子簇,采用绝对值距离计算簇间属性差异,将属性距离在阈值 ξ a 内的独立簇提取。
(2)对步骤(1)提取的独立子簇计算两两空间位置差异。采用欧式距离计算2个独立簇的空间差异,以 C d 中的子簇1和子簇2为例,2个簇的空间距离为 D 12 space = ( x 1 ¯ - x 2 ¯ ) 2 + ( y 1 ¯ - y 2 ¯ ) 2 ,若 D 12 space 小于 ξ s ,则将2个子簇进行合并,更新簇中心,否则不进行 合并。
(3)重复进行步骤(1)、(2),直到所有满足条件的独立子簇合并完成。
通过上述算法步骤,对每个栅格的类别归属进行了重新标记,根据标记完成新的聚类划分。

3 算法实例验证

3.1 试验区与数据

本文选择太平洋海域作为试验区域(100°E~60°W,50°S~50°N),1982年1月至2012年12月海表温度(Sea Surface Temperature,SST)为试验数据。SST数据空间分辨率为0.25º,时间分辨率为月,空间覆盖为全球。数据来源于美国国家海洋大气管理局(NOAA)的地球系统研究室(Earth System Research Laboratory, ESRL)的物理科学部(Physical Science Division)(http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.html)。

3.2 数据预处理

海表温度异常变化是全球气候变化的一个响应器和指标器,因此在进行聚类之前应该先去除其季节性成分。本文利用标准化月均距平算法-Z-score对原始SST栅格数据集进行处理[18-19],得到SST异常变化数据集,如图5所示。
Fig. 5 The monthly anomaly of SST in test area

图5 试验区域SST月均距平结果

注:图中标定的区域1、2、3、4是算法验证选定的对比区域

相关研究表明,对于太平洋区域的海表温度异常变化而言,SST≥0.5 ℃和SST≤-0.5 ℃是2个关注较多且有价值的范围[20-21]。因此,本文选取海表温度异常变化高于0.5 ℃和低于-0.5 ℃的栅格数据进行空间-属性双约束聚类处理,将大于-0.5 ℃并小于0.5 ℃的栅格数据划分为单独一类。

3.3 算法验证及对比分析

RoCMSAC方法需要确定属性阈值 ξ a 与空间阈值 ξ s 2个参数。参照文献[12],在确定β-邻域的值后,统计计算每个栅格像元与其β-邻域内像元属性最大差值,并计算这些差值的平均值,将其作为属性阈值 ξ a ;由于栅格数据的空间属性用行列号表示,2个栅格像元或栅格子簇的空间距离为其行列号或簇中心间的欧式距离,选择不同大小的 ξ s 能够反映不同的聚类程度。从极限来看,当 ξ s 取值很大时,RoCMSAC方法趋向于纯属性K-均值的聚类结果,当 ξ s 取值很小时,RoCMSAC方法的第三步合并就没有意义,经过大量的试验,当 ξ s 取值为2.5°(即10个栅格像素)时,其最终的聚类结果较好。设定β-邻域为8-邻域,属性阈值 ξ a 为0.2 ℃,空间阈值 ξ s 为2.5°,得到RoCMSAC聚类结果(图6),聚类后各栅格簇均值与标准差的统计结果如表1所示。由表1可知,RoCMSAC方法得到的聚类栅格簇标准差均较小,说明各簇簇内差异较小。结合图6表1可得,空间距离较近的簇,其簇间均值差异均较大,如簇3和簇11、簇2和簇7等,而均值差异较小的簇在空间上差异都较大,如簇2和簇3、簇5和簇6、簇17和簇18等,这说明各簇簇间差异较大,由此证明RoCMSAC方法的正确性。同时,与图5进行对比可知,RoCMSAC方法的聚类结果符合试验区域的海表温度异常变化属性空间分布的实际情况,聚类后栅格数据集被划分为空间邻近、属性相近的均质聚类簇,由此证明了RoCMSAC方法的有效性。
Fig. 6 Clustering results of RoCMSAC

图6 RoCMSAC方法聚类结果

Tab. 1 The mean and standard deviation of clusters of RoCMSAC

表1 RoCMSAC方法聚簇均值与标准差统计表

均值/℃ 标准差 簇号 均值/℃ 标准差
1 2.15 0.482 14 0.78 0.226
2 1.74 0.325 15 0.94 0.216
3 1.72 0.313 16 0.77 0.215
4 1.46 0.166 17 -1.10 0.279
5 1.78 0.281 18 -1.11 0.339
6 1.79 0.293 19 -1.03 0.247
7 0.83 0.212 20 -1.12 0.342
8 0.84 0.219 21 -1.36 0.449
9 0.68 0.133 22 -0.88 0.125
10 0.81 0.164 23 -1.04 0.271
11 0.88 0.215 24 -0.99 0.250
12 0.85 0.213 25 -1.01 0.155
13 0.77 0.183
除采用RoCMSAC方法外,本文还采用加权复合距离K-均值方法[6]和DDBSC方法[3]进行对比。加权复合距离K-均值聚类方法中的K值与RoCMSAC方法一致,属性与空间各自权重为0.5,聚类结果如图7所示。DDBSC方法的属性半径,空间半径以及为邻域个数阈值等参数参照文献[3]、[16]中的方法确定,其聚类结果如图8所示。
Fig. 7 Clustering results of k-mean with weighted distance

图7 加权复合距离K-均值聚类结果

Fig. 8 Clustering results of DDBSC

图8 DDBSC方法聚类结果

对比图6、7、8可知,3种方法在部分区域的聚类结果差异较大,为了对比各种方法,选取4个具有代表性的区域进行分析(图5),其中区域1为属性变化较小且不连续区域,区域2和区域3为属性变化较大的区域,区域4为属性渐变区域。图9-12展示了各种方法在4个对比区域的不同聚类结果,由左到右依次为RoCMSAC方法、加权复合距离K-均值方法和DDBSC方法聚类结果。其中“R-”、“K-”和“D-”分别表示RoCMSAC方法、加权复合距离K-均值方法和DDBSC方法的聚类结果簇的簇编号 前缀,簇编号与各方法的聚类结果图中的簇编号相对应。
Fig. 9 Comparisons of region 1

图9 区域1对比分析结果

Fig. 10 Comparison results of region 2

图10 区域2对比分析结果

Fig. 11 Comparison results of region 3

图11 区域3对比分析结果

Fig. 12 Comparison results of region 3

图12 区域4对比分析结果

区域1为属性变化较小且不连续区域。此区域中存在较多在空间上不相邻但空间邻近的子簇,将空间邻近、属性相近的子簇划分为同一簇类,并将空间差异较大的子簇划分为不同簇类更为符合实际。由图9所示,RoCMSAC方法和DDBSC方法将此区域划分为多个簇,而加权复合距离K-均值聚类方法将此区域划分为同一簇。由表2可知,R-1的均值远大于R-8、R-9,将其划分了一个簇与实际不符,同时R-7和R-8虽然在属性上差异较小,但是空间差异较大,也应分为不同簇类。由此可知,加权复合距离K-均值聚类方法在保证聚簇属性均质性和空间差异特性体现上效果较差。
Tab. 2 The mean and standard deviation of clusters in region 1

表2 区域1各簇均值与标准差统计表

RoCMSAC 加权复合距离K-均值 DDBSC
簇号 均值/℃ 标准差 簇号 均值/℃ 标准差 簇号 均值/℃ 标准差
R-7 0.83 0.211 K-7 1.34 0.485 D-1 0.81 0.135
D-7 0.80 0.121
D-8 0.84 0.108
R-8 0.84 0.219 D-11 0.83 0.188
D-12 0.85 0.165
D-19 0.84 0.122
R-9 0.61 0.133 D-13 0.66 0.101
D-16 0.67 0.082
R-1 2.15 0.386 噪 声
对比RoCMSAC方法和DDBSC方法,DDBSC方法的聚类结果较为破碎,而RoCMSAC方法的聚类结果在空间上较为紧凑。以D-13、D-16例,2个簇虽然没有空间相邻关系,但是在空间上十分邻近,且属性差异很小,结合表2图5可得,RoCMSAC方法的划分结果更合理。RoCMSAC方法由于考虑了空间相邻与空间邻近两重约束合并,使栅格簇在保证属性相近的基础上,比DDBSC方法的聚类结果在空间上更为连续与紧凑。
区域2和区域3为属性变化较大的区域。区域2中A、B和区域3中A、B、C虽然空间较为接近,但属性差异较大,应当分为不同的簇类。由图10、11所示,RoCMSAC方法和DDBSC方法的聚类结果较为符合实际,而加权复合距离K-均值方法将2对比区域均划分为同一簇类,与实际情况不符,这也使得其聚类结果的簇内标准差远远高于RoCMSAC方法和DDBSC方法(表3、4)。DDBSC方法将区域3的B部分定义为了噪声,这与参数的设置有关,在本文参数设置下,此部分无法满足DDBSC方法的聚簇条件,定义为噪声是合理的。由此可得,在属性差异较大的区域,RoCMSAC方法和DDBSC方法的聚类结果基本类似,2种方法均能对栅格数据进行较好的空间-属性双约束划分,而加权复合距离K-均值方法划分结果与实际偏离较大,无法保证聚簇属性均质性。
Tab. 3 The mean and standard deviation of clusters in region 2

表3 区域2各簇均值与标准差统计表

区域 RoCMSAC 加权复合距离K-均值 DDBSC
簇号 均值/℃ 标准差 簇号 均值/℃ 标准差 簇号 均值/℃ 标准差
A R-19 -1.04 0.271 K-2 -0.80 0.633 D-20 -1.16 0.223
B R-12 0.78 0.215 D-31 0.95 0.184
Tab. 4 The mean and standard deviation of clusters in region 3

表4 区域3各簇均值与标准差统计表

区域 RoCMSAC 加权复合距离K-均值 DDBSC
簇号 均值/℃ 标准差 簇号 均值/℃ 标准差 簇号 均值/℃ 标准差
A R-22 -1.04 0.271 K-6 -0.57 1.031 D-33 -1.08 0.275
B R-6 0.79 0.293 噪声
C R-16 0.78 0.215 D-42 0.85 0.124
区域4为属性渐变区域。由图12可知,RoCMSAC方法将其划分为2个簇,簇间的均值差异较大,簇内标准差较小;而DDBSC方法将此区域划分为一个簇,簇内标准差较大,这是因为在属性渐变区域,邻近栅格属性的差异不大,然而这种差异的积累会导致同一空间簇中实体间专题属性的差异增大,而DDBSC方法由于自身算法的局限性,无法识别这种渐变差异;加权复合距离K-均值方法将此区域划分为3个簇,这与权值的设定有很大关系,划分结果与实际情况有较大偏差。在属性渐变的区域,RoCMSAC方法的聚类结果更为符合实际,而DDBSC方法和加权复合距离K-均值方法的聚类结果不理想。
Tab. 5 The mean and standard deviation of clusters in region 4

表5 区域4各簇均值与标准差统计表

簇号 RoCMSAC 加权复合距离K-均值 DDBSC
均值/℃ 标准差 簇号 均值/℃ 标准差 簇号 均值/℃ 标准差
R-3 1.72 0.313 K-9 1.53 0.667 D-26 1.25 0.595
R-11 0.88 0.215 K-4 1.02 0.389
K-3 0.80 0.192
综上,加权复合距离K-均值聚类方法无法保证栅格簇属性均质特性,尤其在属性变化较大的区域,聚类效果较差。DDBSC方法在变化较为显著的区域能够较好地实现栅格数据的双重约束聚类,与RoCMSAC方法聚类结果基本一致,但其在属性渐变区域聚类效果不理想;其次,在不考虑空间索引的情况下,DDBSC方法的时间复杂度远高于RoCMSAC方法和基于加权复合距离K-均值聚类方法(表6)。本文提出的RoCMSAC方法在属性不连续区域、属性变化较大区域以及属性渐变区域均能够保证栅格簇属性均质性和空间邻近性,具有较好的聚类效果,且计算复杂度较小。
Tab. 6 Relation between operation time and data size of each method

表6 各方法运行时间与数据量关系表

数据量/103 运行时间/S
RoCMSAC 加权复合距离K-均值 DDBSC
5 0.99 0.85 7.23
10 1.85 1.69 28.76
15 3.02 2.61 35.02
20 4.20 3.34 47.86

4 结论与展望

本文结合栅格数据的空间邻域与空间连通特性,对聚类的相似性准则以及相关概念进行了重新定义,设计了包含属性均质簇生成、空间相邻栅格簇合并和空间邻近栅格簇合并3步的面向栅格的双重约束聚类方法RoCMSAC。通过实例验证和与其他方法进行对比,得出以下结论:① RoCMSAC方法能够有效解决栅格数据的空间-属性双约束聚类问题,发现形状较为复杂的空间栅格簇,可同时满足聚类结果在空间域上邻近和在属性域上相近;② RoCMSAC方法参数输入较少,参数确定所需专业背景知识较少,有利于用户实际使用;③ RoCMSAC方法的计算复杂度较小,在面向大规模的栅格数据处理时,具有计算优势。
在RoCMSAC算法进行中,属性与空间阈值参数应当根据不断的合并进行自适应的动态调整,今后的研究重点在于优化设计算法,使参数自适应确定、调整等方面。此外,面向时序栅格数据的聚类分析,在考虑空间与专题属性双重约束的同时还应加入时间信息,以便更好地挖掘连续渐变的地理对象和现象的时空过程聚簇特性[23]。因此,今后研究工作的另一重点是在RoCMSAC算法的基础上,拓展时空聚类方法。

The authors have declared that no competing interests exist.

[1]
Jain A K, Dubes R C.Algorithms for clustering data[M]. London: Prentice-Hall, 1988:1-334.

[2]
Omran M G H, Engelbrecht A P, Salman A. An overview of clustering methods[J]. Intelligent Data Analysis, 2007,11(6):583-605.Abstract Clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the process of grouping similar objects into different groups, or more precisely, the partitioning of a data set into subsets, so that the data in each subset according to some defined distance measure. This paper covers about clustering algorithms, benefits and its applications. Paper concludes by discussing some limitations.

DOI

[3]
李光强,邓敏,程涛,等.一种基于双重距离的空间聚类方法[J].测绘学报,2008,4:482-488.传统聚类方法大都是基于空间位置或非空间属性的相似性来进行聚类,分裂了空间要素固有的二重特性,从而导致了许多实际应用中空间聚类结果难以同时满足空间位置毗邻和非空间属性相近。然而,兼顾两者特性的空间聚类方法又存在算法复杂、结果不确定以及不易扩展等问题。为此,本文通过引入直接可达和相连概念,提出了一种基于双重距离的空间聚类方法,并给出了基于双重距离空间聚类的算法,分析了算法的复杂度。通过实验进一步验证了基于双重距离空间聚类算法不仅能发现任意形状的类簇,而且具有很好的抗噪性。

DOI

[ Li G Q, Deng M, Cheng T, et al.A dual distance based spatial clustering method[J]. Acta Geodaetica Et Cartographica Sinica, 2008,4:482-488. ]

[4]
Lin C R, Liu K H, Chen M S.Dual clustering: integrating data clustering over optimization and constraint domains[J]. IEEE Trans Knowledge and Data Engineering, 2005,17(5):628-637.Spatial clustering has attracted a lot of research attention due to its various applications. In most conventional clustering problems, the similarity measurement mainly takes the geometric attributes into consideration. However, in many real applications, the nongeometric attributes are what users are concerned about. In the conventional spatial clustering, the input data set is partitioned into several compact regions and data points which are similar to one another in their nongeometric attributes may be scattered over different regions, thus making the corresponding objective difficult to achieve. To remedy this, we propose and explore in this paper a new clustering problem on two domains, called dual clustering, where one domain refers to the optimization domain and the other refers to the constraint domain. Attributes on the optimization domain are those involved in the optimization of the objective function, while those on the constraint domain specify the application dependent constraints. Our goal is to optimize the objective function in the optimization domain while satisfying the constraint specified in the constraint domain. We devise an efficient and effective algorithm, named Interlaced Clustering-Classification, abbreviated as ICC, to solve this problem. The proposed ICC algorithm combines the information in both domains and iteratively performs a clustering algorithm on the optimization domain and also a classification algorithm on the constraint domain to reach the target clustering effectively. The time and space complexities of the ICC algorithm are formally analyzed. Several experiments are conducted to provide the insights into the dual clustering problem and the proposed algorithm.

DOI

[5]
李新运,郑新奇,闫弘文.坐标与属性一体化的空间聚类方法研究[J].地理与地理信息科学,2004,20(2):38-40.通过空间坐标和属性特征的有机结合,定义了3种空间距离,给出了基于空间距离的K平均系统聚类算法,对山东省生态环境质量进行了聚类分析和类型分区.实践表明,空间聚类方法可以较好地反映出空间位置邻近和属性特征相似的空间聚类要求.

DOI

[ Li X Y, Zheng X Q, Yan H W.On spatial clustering of combination of coordinate and attribute[J]. Geography and Geo-Information Science, 2004,20(2):38-40. ]

[6]
王海起,张腾,彭佳琦,等.空间加权距离的GIS数据Fuzzy C-means聚类方法与应用分析[J].地球信息科学学报,2013,15(6):854-861.Fuzzy c-means聚类常采用普通欧式距离进行相似性度量,对于地理空间对象来说,聚类不仅应考虑属性特征的相似性,还应考虑对象的空间邻近性。本文基于普通欧式距离提出了多种形式的空间加权距离公式,不同的距离公式分别在两个坐标方向、各属性上进行加权,权重向量既可以度量空间位置特征、属性特征的作用大小,也可度量位置距离在X、Y空间方向上的各向同性或异性程度。权重向量的获取以空间对象相似性的模糊函数为评价目标,通过动态学习率的梯度下降算法优化计算,并将空间加权距离引入到fuzzy c-means聚类算法中以取代普通欧式距离。本文以空间数据集Meuse为应用实例,分别采用不同形式的空间加权距离进行FCM模糊聚类,类数取为2-10类,通过PC、PE和Xie-Beni等聚类有效性指标的比较表明:空间加权距离的聚类效果要优于普通距离,且在空间数据聚类分析中,除属性信息外位置等空间特征信息同样起到了重要作用。

DOI

[ Wang H Q, Zhang T, Peng J Q, et al.Fuzzy c-means clustering for GIS data based on spatial weighted distance[J]. Journal of Geo-Information Science, 2013,15(6):854-861. ]

[7]
Gärtner T.Kernels for Structured Data[C]. Inductive Logic Programming, International Conference, ILP 2002, Sydney, Australia, July 9-11, 2002. Revised Papers. 2002:216.

[8]
耿协鹏,杜晓初,胡鹏.基于栅格距离变换的扩展对象空间聚类方法[J].测绘学报,2009,38(2):162-167.空间聚类是空间分析和空间数据挖掘的重要方法和研究内容。在地图代数中,通过建立栅格坐标与距离平方对应的栅格平方平面!计算栅格空间的最短距离,实现栅格距离变换。以栅格空间距离变换为基础,通过提取特征等距线,揭示简单的空间点集聚类过程,并将这种算法扩展到点$线$面实体混合分布空间!以及加权距离以及障碍空间的空间聚类,算法分析表明该算法简单、合理。

DOI

[ Geng X P, Du X C, Hu P.Spatial clustering method based on raster distance transform for extended objects[J]. Acta Geodaetica Et Cartographica Sinica, 2009,38(2):162-167. ]

[9]
Birant D, Kut A.ST-DBSCAN: An algorithm for clustering spatial-temporal data[J]. Data & Knowledge Engineering, 2007,60(1):208-221.Abstract This paper presents a new density-based clustering algorithm, ST-DBSCAN, which is based on DBSCAN. We propose three marginal extensions to DBSCAN related with the identification of (i) core objects, (ii) noise objects, and (iii) adjacent clusters. In contrast to the existing density-based clustering algorithms, our algorithm has the ability of discovering clusters according to non-spatial, spatial and temporal values of the objects. In this paper, we also present a spatial–temporal data warehouse system designed for storing and clustering a wide range of spatial–temporal data. We show an implementation of our algorithm by using this data warehouse and present the data mining results.

DOI

[10]
Zhou J, Guan J, Li P.DCAD: A dual clustering algorithm for distributed spatial databases[J]. Geo-spatial Information Science, 2007,10(2):137-144.Spatial objects have two types of attributes: geometrical attributes and non-geometrical attributes, which belong to two different attribute domains (geometrical and non-geometrical domains). Although geometrically scattered in a geometrical domain, spatial objects may be similar to each other in a non-geometrical domain. Most existing clustering algorithms group spatial datasets into different compact regions in a geometrical domain without considering the aspect of a non-geometrical domain. However, many application scenarios require clustering results in which a cluster has not only high proximity in a geometrical domain, but also high similarity in a non-geometrical domain. This means constraints are imposed on the clustering goal from both geometrical and non-geometrical domains simultaneously. Such a clustering problem is called dual clustering. As distributed clustering applications become more and more popular, it is necessary to tackle the dual clustering problem in distributed databases. The DCAD algorithm is proposed to solve this problem. DCAD consists of two levels of clustering: local clustering and global clustering. First, clustering is conducted at each local site with a local clustering algorithm, and the features of local clusters are extracted. Second, local features from each site are sent to a central site where global clustering is obtained based on those features. Experiments on both artificial and real spatial datasets show that DCAD is effective and efficient.

DOI

[11]
Tai C H, Dai B R, Chen M S.Incremental clustering in ge-ography and optimization spaces[C]. Proceedings of PAK-DD’07, 2007:272-283.

[12]
刘启亮,邓敏,石岩,等.一种基于多约束的空间聚类方法[J].测绘学报,2011,40(4):509-516.空间聚类是挖掘空间分布模式与探测空间异常的重要手段,已成为空间数据挖掘与知识发现领域的一个主要研究方向。随着空间聚类技术研究与应用的深入,迫切需要发展能够普适性的空间聚类算法。该算法一方面能够适应海量、分布复杂的空间数据(如任意形状的空间簇、噪声点及空间密度变化),另一方面又能够综合考虑空间邻近与专题属性相似,且人为干预较少。为此,本文借助Delaunay三角网构建空间邻近关系的优势,通过施加不同层次、不同类型的约束,提出了一种空间聚类的新算法。通过实验分析与比较发现,该算法可以探测复杂结构的空间簇,对噪声点稳健,并且能够同时顾及实体间空间位置与专题属性的相似性,从而验证了本文算法的有效性与优越性。

DOI

[ Liu Q L, Deng M, Shi Y, et al.A novel spatial clustering method based on multi-constraints[J]. Acta Geodaetica Et Cartographica Sinica, 2011,40(4):509-516. ]

[13]
焦利民,洪晓峰,刘耀林.空间和属性双重约束下的自组织空间聚类研究[J].武汉大学学报.信息科学版,2011,36(7):862-866.形式化定义了双重聚类的聚类准则及其判定方法,提出了双重聚类的两步法求解思路和自组织双重聚类算法。通过实例验证了该算法的可行性,自组织双重聚类可以发现非空间属性的聚集、延伸等空间分布特征,可以发现任意复杂形状的聚类,并降低了人为影响。

[ Jiao L M, Hong X F, Liu Y L.Self-organizing spatial clustering under spatial and attribute constraints[J]. Geomatics and Information Science of Wuhan University, 2011,36(7):862-866. ]

[14]
Oberle A P.GIS Concepts and ArcGIS methods[J]. Journal of Geography, 2004,103(6):271-271.People have been leaving rural environments and moving into urban environments. By 2007, the most people in the world will live in cities (United Nations 2002). Mexico illustrates this world trend closely. Mexico now publishes data on the Internet that can be used to study the movement of people within the country. A lesson is presented with selected new data connecting regional migration and job opportunities. Students create choropleth maps, analyze patterns, and make generalizations about patterns and processes.

DOI

[15]
Macqueen J.Some methods for classification and analysis of multivariate observations[A]. In 5th BerKeley Symp. Math. Statist. Prob[C]. 1967:281-297.

[16]
Ester M, Kriegel H P, Jiirg S, et al.A density based algorithm for discovering clusters in large spatial databases[A]. Proceedings of International Conference on Knowledge Discovery & Data Mining[C]. 1996:226-231.

[17]
Samsonova E V, KoK JNIjzerman A P. TreeSOM: Cluster analysis in the self-organizing map[J]. Neural NetworKs the Official Journal of the International Neural NetworK Society, 2006,19(6-7):935-949.ABSTRACT Clustering problems arise in various domains of science and engineering. A large number of methods have been developed to date. The Kohonen self-organizing map (SOM) is a popular tool that maps a high-dimensional space onto a small number of dimensions by placing similar elements close together, forming clusters. Cluster analysis is often left to the user. In this paper we present the method TreeSOM and a set of tools to perform unsupervised SOM cluster analysis, determine cluster confidence and visualize the result as a tree facilitating comparison with existing hierarchical classifiers. We also introduce a distance measure for cluster trees that allows one to select a SOM with the most confident clusters.

DOI PMID

[18]
Steinbach M, Klooster S, Torregrosa A, et al.Clustering Earth Science Data: Goals, Issues And Results[J]. Proc of the Fourth Kdd WorKshop on Mining Scientific Datasets, 2001.This paper reports on recent work applying data mining to the task of finding interesting patterns in earth science data derived from global observing satellites, terrestrial observations, and ecosystem models. Patterns are "interesting" if ecosystem scientists can use them to better understand and predict changes in the global carbon cycle and climate system. The initial goal of the work reported here (which is only part of the overall project) is to use clustering to divide the land and ocean areas of the earth into disjoint regions in an automatic, but meaningful, way that enables the direct or indirect discovery of interesting patterns. Finding "meaningful" clusters requires an approach that is aware of various issues related to the spatial and temporal nature of earth science data: the "proper" measure of similarity between time series, removing seasonality from the data to allow detection of non-seasonal patterns, and the presence of spatial and temporal autocorrelation (i.e., measured values that are close in time and space tend to be highly correlated, or similar). While we have techniques to handle some of these spatiotemporal issues (e.g., removing seasonality) and some issues are not a problem (e.g., spatial autocorrelation actually helps our clustering), other issues require more study (e.g., temporal autocorrelation and its effect on time series similarity). Nonetheless, by using the Kmeans as our clustering algorithm and taking linear correlation as our measure of similarity between time series, we have been able to find some interesting ecosystem patterns, including some that are well known to earth scientists and some that require further investigation.

[19]
Xue C, Song W, Qin L, et al.A spatiotemporal mining frameworK for abnormal association patterns in marine environments with a time series of remote sensing images[J]. International Journal of Applied Earth Observation and Geoinformation, 2015,38:105-114.A spatiotemporal mining framework is a novel tool for the analysis of marine association patterns using multiple remote sensing images. From data pretreatment, to algorithm design, to association rule mining and pattern visualization, this paper outlines a spatiotemporal mining framework for abnormal association patterns in marine environments, including pixel-based and object-based mining models. Within this framework, some key issues are also addressed. In the data pretreatment phase, we propose an algorithm for extracting abnormal objects or pixels over marine surfaces, and construct a mining transaction table with object-based and pixel-based strategies. In the mining algorithm phase, a recursion method to construct a direct association pattern tree is addressed with an asymmetric mutual information table, and a recursive mining algorithm to find frequent items. In the knowledge visualization phase, a “ Dimension–Attributes ” visualization framework is used to display spatiotemporal association patterns. Finally, spatiotemporal association patterns for marine environmental parameters in the Pacific Ocean are identified, and the results prove the effectiveness and the efficiency of the proposed mining framework.

DOI

[20]
McPhaden M J, ZebiaK S E, Glantz M H. ENSO as an integrating concept in earth science[J]. science, 2006,314(5806):1740-1745.The El Ni09o-Southern Oscillation (ENSO) cycle of alternating warm El Ni09o and cold La Ni09a events is the dominant year-to-year climate signal on Earth. ENSO originates in the tropical Pacific through interactions between the ocean and the atmosphere, but its environmental and socioeconomic impacts are felt worldwide. Spurred on by the powerful 1997-1998 El Ni09o, efforts to understand the causes and consequences of ENSO have greatly expanded in the past few years. These efforts reveal the breadth of ENSO's influence on the Earth system and the potential to exploit its predictability for societal benefit. However, many intertwined issues regarding ENSO dynamics, impacts, forecasting, and applications remain unresolved. Research to address these issues will not only lead to progress across a broad range of scientific disciplines but also provide an opportunity to educate the public and policy makers about the importance of climate variability and change in the modern world.

DOI PMID

[21]
Wolter K, Timlin M S.El Niño/Southern Oscillation behaviour since 1871 as diagnosed in an extended multivariate ENSO index (MEI. ext)[J]. International Journal of Climatology, 2011,31(7):1074-1087.Abstract El Ni09o/Southern Oscillation (ENSO) remains the most important coupled ocean–atmosphere phenomenon to cause global climate variability on seasonal to interannual time scales. This paper addresses the need for a reliable ENSO index that allows for the historical definition of ENSO events in the instrumental record back to 1871. The Multivariate ENSO Index (MEI) was originally defined as the first seasonally varying principal component of six atmosphere–ocean (COADS) variable fields in the tropical Pacific basin. It provides for a more complete and flexible description of the ENSO phenomenon than single variable ENSO indices such as the SOI or Ni09o 3.4 SST. Here we describe our effort to boil the MEI concept down to its most essential components (based on SLP, SST) to enable historical analyses that more than double its period of record to 1871–2005. The new MEI.ext confirms that ENSO activity went through a lull in the early- to mid-20th century, but was just about as prevalent one century ago as in recent decades. We diagnose strong relationships between peak amplitudes of ENSO events and their duration, as well as between their peak amplitudes and their spacing (periodicity). Our effort is designed to help with the assessment of ENSO conditions through as long a record as possible to be able to differentiate between ‘natural’ ENSO behaviour in all its rich facets, and the ‘Brave New World’ of this phenomenon under evolving GHG-related climate conditions. So far, none of the behaviour of recent ENSO events appears unprecedented, including duration, onset timing, and spacing in the last few decades compared to a full century before then. Copyright 08 2011 Royal Meteorological Society

DOI

[22]
Zhang Y, Wallace J M, Battisti D S.ENSO-like Interdecadal Variability: 1900-93[J]. Journal of Climate, 2010, 10(5):1004-1020.

[23]
薛存金,周成虎,苏奋振,等.面向过程的时空数据模型研究[J].测绘学报,2010,39(1):95-101.根据近20年来时空数据模型的研究现状、存在问题及其原因剖析,以连续渐变地理实体的表达、组织和存储为研究对象,提出面向过程的时空数据模型。根据连续渐变地理实体的内在特性,将其分级抽象为过程对象系列,进一步探讨过程对象及过程对象间逻辑关系,并设计其UML模型结构及物理存储结构。通过抽象的过程对象隐式地记录地理实体动态变化机制,及自定义的过程对象存储表提供演变机制的函数接口模式,实现连续渐变地理实体的过程化组织、存储与动态分析。最后,以海洋数据的过程化组织与分析为例,构建时空过程模型原型系统(海洋过程对象—关系数据库系统与功能分析平台),验证和评价该模型的实用性。

[ Xue C J, Zhou C H, Su F Z, et al.Research on process-oriented spatio-temporal dta model[J]. Acta Geodaetica Et Cartographica Sinica, 2010,39(1):95-101. ]

文章导航

/