GPU加速的改进PAM聚类算法研究与应用

周恩波; 毛善君; 李梅; 孙振明

doi:10.3724/SP.J.1047.2017.00782

地球信息科学学报 >

2017 , Vol. 19 >Issue 6: 782 - 791

DOI: https://doi.org/10.3724/SP.J.1047.2017.00782

地球信息科学理论与方法

GPU加速的改进PAM聚类算法研究与应用

周恩波 ^,¹ ,
毛善君 ^,¹^,^* ,
李梅 ¹ ,
孙振明 ²

展开

1. 北京大学遥感与地理信息系统研究所,北京 100871
2. 中国矿业大学(北京)资源与安全工程学院,北京 100083

*通讯作者：毛善君（1964-）,男,四川成都人,博士,教授,主要从事地理信息科学、数字矿山研究。E-mail: sjmao@pku.edu.cn

作者简介：周恩波（1992-）,男,江西丰城人,研究生,主要从事时空大数据分析与云计算方面研究。E-mail: enbozhou@pku.edu.cn

收稿日期: 2016-10-11

要求修回日期: 2017-03-22

网络出版日期: 2017-06-20

基金资助

国家重点研发计划重点专项（2016YFC0801800）

收起

Research and Application of Accelerating Improved PAM Clustering Algorithm by GPU

ZHOU Enbo ^,¹ ,
MAO Shanjun ^,¹^,^* ,
LI Mei ¹ ,
SUN Zhenming ²

Expand

1. Institute of Remote Sensing and Geographical Information System, Peking University, Beijing 100871, China
2. Resources and Safety Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China;

*Corresponding author: MAO Shanjun, E-mail: sjmao@pku.edu.cn

Received date: 2016-10-11

Request revised date: 2017-03-22

Online published: 2017-06-20

Copyright

《地球信息科学学报》编辑部所有

Fold

摘要

空间聚类是空间数据挖掘的重要方法,而K-Medoids是一种常用的空间聚类算法。K-Medoids聚类算法存在初始点选择问题,而且计算复杂。为了提高算法的有效性和时间效率,本文结合模拟退火算法思想,改进了传统的K-Medoids算法PAM,提出一种基于GPU计算的并行模拟退火PAM算法。类比矩阵乘法运算,定义了一种新的矩阵计算方法,可以有效减少数据在GPU全局内存和共享内存之间的传输,提高了算法在GPU中的执行效率。利用模拟退火算法搜索聚类中心点,保证了聚类结果的全局最优性。基于不同的数据集,将串行和并行模拟退火PAM算法以及已有的遗传PAM算法进行比较,结果表明并行模拟退火PAM算法聚类结果正确,且时间效率高。最后,应用本文改进算法对贵州省安监系统的安全监管隐患数据进行聚类分析,发现了隐患聚集中心,相关结果对政府的决策具有一定的实际应用价值。

关键词： K-Medoids; 模拟退火; GPU; 并行计算; 空间聚类分析

本文引用格式

周恩波 , 毛善君 , 李梅 , 孙振明 . GPU加速的改进PAM聚类算法研究与应用[J]. 地球信息科学学报, 2017 , 19(6) : 782 -791 . DOI: 10.3724/SP.J.1047.2017.00782

Abstract

Spatial clustering is one of the most important methods in spatial data mining. As a common but powerful spatial clustering algorithm, K-Medoids is applied in many fields such as generalization of spatial entity information, spatial point pattern analysis and epidemiology application. However, K-Medoids algorithm meets two main challenges innately as follow. At first, K-Medoids has selection problem of the initial medoids. Different initial medoids may not attain the same clustering results which could lead to a non-optimal results sometimes. Furthermore, time efficiency of the algorithm is not satisfactory because there exist quantities of iterations to find the most suitable partition. Existing studies on the K-Medoids algorithm don’t take the validness and time efficiency into consideration at the same time. Optimal methods like the Genetic Algorithm are applied to improve the validness of K-Medoids but the time efficiency is not acceptable when dealing with growing data. The MapReduce model is utilized to handle with data of high volume which can’t adapt to some circumstances short of computer clusters. In order to improve the result validity and time efficiency of the algorithm, this paper revised the traditional K-Medoids algorithm of Partitioning Around Medoids (PAM) combining with the idea of the Simulate Anneal Arithmetic (SAA) and proposed a parallel Simulate Anneal Partitioning Around Medoids (SAPAM) algorithm which was implemented efficiently in Graphics Processing Units (GPUs). SAA algorithm is used to search for the initial medoids which promises the validness of the algorithm. The stochastic factor introduced in SAA algorithm gives the possibility of eliminating the local optima to attain the global optimal clustering results of PAM. To accelerate the clustering process, we design the parallel SAPAM algorithm to utilize quantities of GPU’s threads which execute the program at the same time. By analogy with the matrix multiplication, a new matrix computation method is defined to reduce the time consumption of data transfer between GPU’s global memory and shared memory. The matrix computation method reuses data in the shared memory of GPU and computes the distances between medoids and many points at a time which improve the algorithm’s performance evidently. To validate the proposed algorithm, we generated eight datasets with different attributes and sizes randomly and conducted experiments on the eight datasets to compare the proposed parallel SAPAM algorithm with the traditional PAM algorithm, sequential SAPAM algorithm and the parallel genetic K-Medoids algorithm. The experiment results showed that SAPAM algorithm attained more accurate clustering results compared with the traditional PAM and the parallel genetic K-Medoids algorithm. Besides, the proposed algorithm performed better than the sequential SAPAM algorithm and the parallel genetic K-Medoids algorithm in time efficiency. According to the results, our GPU-based SAPAM algorithm was four to eight times faster than the traditional PAM algorithm. The results demonstrate that the proposed method can execute efficiently and attain a valid result. Finally, SAPAM algorithm was applied to analyze the safety monitoring data of Guizhou province to get the clustering pattern of the safety threats. The clustering results show us several clusters of the safety threats which may provide some practical application value to the governor.

Key words： K-Medoids; Simulate Anneal Arithmetic; GPU; Parallel Computing; Spatial Clustering Analysis

1 引言

聚类分析将特征相近的空间实体划分为不同的组,使不同组间的差别尽可能大,同一组内的差别尽可能小^[1]。空间聚类是数据挖掘的重要组成内容,具有广泛的实际应用,如空间实体信息概括综合、空间点模式分析和流行病学分析等^[2]。K-Medoids是常用的空间聚类算法,利用中心点（Medoids）作为聚类中心,有效排除了K-Means算法中异常值的影响^[3]。Kaufman和Rousseeuw在1990年提出的PAM（Partitioning Around Medoids）被认为是最有效K-Medoids算法之一^[4],但PAM算法迭代次数较多,时间效率不高,且算法存在初始点选择问题,影响了算法准确性和时间效率。

针对PAM算法存在的问题,众多专家学者进行了研究：Kaufman等^[4]在PAM的基础上提出了CLARA算法,其利用PAM对抽样实体聚类,输出最好的聚类结果,有效地降低了时间复杂度;Lucasius等^[5]通过实验发现,CLARA算法在类别数较多时,聚类性能不高而且CLARA聚类结果严重依赖于样本抽取;Ng等^[6]提出一种基于抽样的K-Medoids,称为CLARANS。其每一步搜索均使用随机样本,并且搜索范围不局限在局部区域,算法结果较CLARA更准确;Vander Laan等^[7]提出一种最大化silhouette的算法,改善了PAM在小类簇上聚类结果;Zhang等^[8]利用中心点的不规则三角网搜寻最优中心点,在聚类精度不变的情况下提高了算法效率。

大数据时代来临,给时空数据处理带来了新要求。为处理海量时空数据,一些学者改进了K-Medoids,减少算法迭代次数和使算法并行化。Yue 等^[9]、张雪萍等^[10]、李静滨等^[11]和王永贵等^[12]等根据MapReduce模型修改了K-Medoids,并在开源分布式系统Hadoop上实现。李静滨^[13]在OpenMP平台上设计实现一种适合多核平台的并行算法。这些方法需要服务器集群,在某些应用场景不适用,且未解决初始点选择问题。He等^[14]提出一种基于MapReduce的高效密度聚类算法,该算法不存在初始点选择问题,需要人为选择阈值,降低了算法的自动性。

提高聚类算法的有效性是近期另外一个研究热点。王家耀等^[15]、Sheng等^[16]提出一种遗传K-均值算法,利用遗传算法解决K-均值易陷入局部最优的问题。Zhang等^[17]在遗传K-Medoids算法基础上,提出了一种考虑障碍物限制的聚类算法。Saha等^[18]提出一种基于模拟退火的类别数据聚类算法。张小鹏等^[19]结合模拟退火和K-Means算法,提出一种改进的聚类算法。这些算法具有鲁棒性,但处理大量数据时效率不高,甚至无法得到结果。

综上所述,现有研究没有兼顾K-Medoids算法的时间效率和有效性,无法适应大数据时代空间聚类的需求,聚类结果的有效性也需要提高。本文试图提出一种鲁棒且能高效执行的PAM改进算法。算法利用模拟退火思想解决PAM算法初始点选择问题,并行化算法提高时间效率,设计了并行模拟退火PAM算法,并在CUDA（Compute Unified Device Architecture）平台上实现。为验证所提算法,本文设计了几组实验,将并行模拟退火PAM算法、串行模拟退火PAM算法以及并行遗传K-Medoids进行比较。实验结果表明,并行模拟退火PAM算法在处理大规模数据时效率较高,并且能得到正确的结果。

2 算法基础

2.1 PAM算法

作为经典聚类算法,PAM算法原理简单,聚类效果好,应用广泛。考虑一个含n个对象的数据集,参数k表示类别数目,代价函数S表示各类中对象到中心对象的距离平方和,PAM算法的主要流程^[3]如图1所示。

View original graphic|Download|PPT slide

Fig. 1 Traditional PAM algorithm procedures

图1 经典PAM算法流程

PAM算法首先从数据集中随机选择k个对象作为初始中心,并将其他对象划分到最近的中心所在的类,计算此种划分的代价函数S。然后,在各类中寻找能使该类代价S减小的对象,临时替换原有中心,重新划分整个数据集,计算新划分下的代价函数

S ′

。若

ΔS = S ′ - S < 0

,用新的中心替换原有中心,重复以上过程,否则终止算法,输出结果。

相比K-Means算法,PAM算法用类中心替换原有类均值,有效消除了离群点对聚类结果的影响。因为搜索类中心需要大量迭代计算,PAM算法时间效率不高,且聚类结果受初始中心影响,无法保证结果的有效性。

2.2 模拟退火算法

退火过程^[20]从最高温度T₀开始降温。当温度为T时,系统处于热均衡状态,根据玻尔兹曼分布可以得出此时系统具有能量E的概率为：

P (E = k) = 1 Z (T) exp - k k b T

（1）

式中：Z（T）是正态分布函数;k_b是玻尔兹曼常数。当温度T下降,玻尔兹曼分布的范围集中于最低能量的区间,因为高能量状态意味着系统不稳定。当温度T很低时,系统能量也很低。只要降温过程足够慢,最终状态将具有最低能量。

类比退火过程,提出一般的模拟退火算法,算法步骤如下：

（1）从初始温度T₀开始,选择一组初始参数值,初始能量函数值为E;

（2）在参数空间中,随机选取一个靠近之状态点的新点,计算新的能量函数值;

（3）比较2个参数的对应的能量函数值,应用Metropolis准则进行处理。

令Δ=E_new-E_old,当且仅当位于（0,1）内的随机数U满足式（2）,改变系统当前状态,式（2）中的T表示当前系统的温度。式（2）等价于式（3）。

U ≤ ex p (- Δ / T),

（2）

E new ≤ E old - TlogU,

（3）

将E_new-E_old和一个均值为T的随机变量进行比较可以发现,如果函数值比之前的能量函数值低,就移动到新状态,而且在任意温度下,系统状态都可能向上移动（即能量函数增大）。当系统能量函数值降低时,接受新状态。当系统能量升高时,由于引入随机因素,以一定概率接受新状态,这个过程可以用流程图2表示。

View original graphic|Download|PPT slide

Fig. 2 Key procedures of SAA

图2 模拟退火的关键步骤

（4）无论系统状态是否移动,重复步骤（2）、（3）。在每一步,比较新状态的能量函数值和原来状态的函数值,直到按照某种标准,系统达到均衡状态。

（5）在某一温度下,一旦系统达到均衡状态,按照事先确定的降温幅度降温,随后从步骤（2）再开始进行,直到按照某种标准系统达到稳定,降温结束,系统处于“冷冻”状态。

2.3 CUDA

CUDA是由英伟达公司研制的并行计算平台和编程模型,其编程模式采用CPU和GPU的异构模型,CPU作为Host,GPU作为Device。在一个系统中,只存在一个Host,多个Device协同工作以提供计算力。CPU中存在多个逻辑控制单元,适合于事务处理和串行计算,GPU擅长提供计算力,适用于并行计算。一个CUDA程序由在CPU中顺序执行的函数和在GPU中并行执行的函数组成。在GPU中并行执行的函数称为核函数。CUDA程序的两部分协同工作,确保程序高效执行。

GPU中执行的核函数以Grid形式组织,每个Grid由若干个Block组成,而每个Block由许多线程组成。核函数的执行以Block为单位,不同的Block可以同时执行,但是他们之间不能通信。CUDA使用共享内存和同步机制来实现同一Block内的线程通信,有效地减少了数据传输。CUDA的架构见图3^[21]。

View original graphic|Download|PPT slide

Fig. 3 The architecture of CUDA

图3 CUDA的体系架构

3 基于GPU的模拟退火PAM算法

针对传统PAM算法的初始中心点选择和时间效率等问题,本文提出一种结合PAM和模拟退火算法的新方法,设计了并行流程,并在CUDA上实现。模拟退火算法在聚类中引入了随机因素,能有效地避免结果陷入局部最优,保证聚类结果的正确性,而算法并行化则提高了时间效率。

3.1 模拟退火PAM算法

模拟退火PAM算法见表1。与退火过程类似,首先初始化一些参数,包括初始温度、止温度和降温系数等。每一次冷却过程均设置一次循环,增加算法随机性。每次循环中,首先找到非中心点,替换原有中心点。产生新中心点的伪代码如表2 所示。

Tab. 1 The proposed new algorithm

表1 提出的新算法

步骤	过程
1	初始化初始温度T₀,终止温度T_E,状态温度T_K,内层循环次数InnerLoop和冷却系数DecRate等参数
2	随机选取k个初始中心点,计算目标函数值F(k)
3	设置循环变量MaxInnerLoop=0
4	计算每个对象和中心点之间的距离,将目标划入最近的类中并重新产生中心点
5	计算新的目标函数值F(k+1),如果F(k+1)<F(k),接受新状态,否则计算P=exp(-(F(k+1)-F(k))/T_K),并产生一个0到1之间的随机数α,若P>α,接受新状态,否则拒绝新状态并且返回前一状态
6	如果MaxInnerLoop<InnerLoop,MaxInnerLoop加1并转到第4步,否则转向第7步
7	如果T_K<T_E,算法终止,输出结果,否则条令T_K=T_K×DecRate

Tab. 2 The pseudocode which generates new medoids

表2 产生新中心点的伪码

输入参数：待聚类的数据数组data,数据总数dataCount,老中心点数组oldCenter,新中心点数组newCenter,聚类数目clusterCount 输出参数：无
void GetNewCenters(data, dataCount, oldCenter, newCenter, clusterCount){ 定义循环变量i; for (i = 0; i < clusterCount; i++){ do{ 产生一个0到dataCount之间的随机整数r; } while (第r个对象data[r]不在第i个老中心点oldCenter[i]所在的类,或者第r个对象data[r]是中心点); newCenter[i] = data[r]; } }

产生新中心点之后,计算新状态下目标函数值,目标函数定义见式（4）。

F (k) = ∑ i = 1 k ∑ p - m ki 2

（4）

式中：m_ki是第i个类在第k次循环中的中心点。

如果目标函数值F（k+1）<F（k）,接受新的对象划分;否则,产生一个0到1之间的随机数字α,并计算概率P。根据Metropolis准则,如果P>α,接受新的划分;如果P<α,拒绝新状态并且返回前一状态;如果循环次数未达到设定值,继续迭代。

在寻找聚类中心的迭代过程中,系统状态可能陷入局部最优状态,导致最终的聚类结果不理想。模拟退火的过程引入随机因素,使系统具有一定概率跳出局部最优,达到全局最优状态,得到较好的聚类结果。

3.2 算法并行化

从表1可看出,模拟退火PAM算法存在多次迭代,聚类过程慢,限制了算法的适用范围。为提高算法效率,考虑并行化算法,并在CUDA上实现。模拟退火PAM算法耗时的步骤主要是每次循环中将对象分类和计算目标函数。考虑到CUDA的架构,可将划分对象和计算代价函数2个步骤在GPU中执行,剩余部分主要包括逻辑控制,在CPU中执行。算法不同步骤的执行安排充分利用了GPU强大的计算能力和CPU逻辑控制单元的优势。

为了在GPU中实现不同对象的划分,定义一种类似于矩阵乘法的矩阵计算方法。在GPU中,对象以n×d维矩阵M的形式表示,其中n是对象个数,d是对象属性的个数。同时,中心点也以d×k维矩阵N表示,其中k是中心点的个数。

M = m 11 m 12 ⋯ m 1 d m 21 m 22 ⋯ m 2 d ⋮ ⋮ ⋮ ⋮ m n 1 m n 2 ⋯ m nd

（5）

N = n 11 n 12 ⋯ n 1 k n 21 n 22 ⋯ n 2 k ⋮ ⋮ ⋮ ⋮ n d 1 n d 2 ⋯ n dk

（6）

定义一个新的运算符⊕,C=M⊕N是一个n×k矩阵,其中

c ij = ∑ m = 1 d (m im - n mj) 2

（7）

当得到矩阵C,那么第i个对象属于

min {c i 1, c i 2, c i 3, …, c ik}

的第二个下标表示的类。例如,c_is=

min {c i 1, c i 2, c i 3, …, c ik}

,那么第i个对象属于第s类。

在GPU中执行的函数称为核函数,对象划分核函数的伪码如表3所示。在GPU的共享内存和全局内存之间往往存在大量的数据传输,为了减少数据传输,使用共享内存来重用GPU中的数据。

Tab. 3 The pseudocode which partitions the objects

表3 对象划分的伪码

输入参数：待聚类的数据数组data,中心点数组centers,聚类结果数组result,数据总数dataCount,聚类数目clusterCount,数据字段数dimensionCount
输出参数：无

__global__ void allocationKernel(data, centers, result, dataCount, clusterCount, dimensionCount){
在共享内存中声明存储数据的数组da[];
在共享内存中声明存储非中心点对象到各中心点的距离的数组dis[];
//对数据矩阵中每一行循环,blockIdx.x表示GPU中Block序号,NUM_BLOCKS表示定义的Block总数
for (int ij = blockIdx.x; ij < n; ij += NUM_BLOCKS){
//将矩阵中的每一行从全局内存中拷贝到共享内存中以重用数据,减少数据传输量
// threadIdx.x表示一个Block内的线程序号,blockDim.x表示一个Block中线程总数
for (int i = threadIdx.x; i < d; i += blockDim.x)
da[i] = data[ij*d + i];
__syncthreads();//线程同步
//使用同一个block中的线程计算a⊕b
for (int j = threadIdx.x; j < k; j += blockDim.x)
calculate dis[j];
__syncthreads();//线程同步
使用每个Block中序号为1的线程计算对象分配结果;
}
}

计算目标函数值的核函数也利用了大量线程来得到离差平方和,考虑到论文篇幅限制,此部分代码未给出。

4 实验与分析

4.1 实验环境和数据

实验所用电脑的详细配置参数如表4所示。

Tab. 4 The experimental computer environment

表4 实验用计算机配置

参数	值
CPU	Intel(R) Core(TM) i7-4710HQ CPU @2.50GHz(2501 MHz)
内存	8G
GPU	NVIDIA GeForce GTX 860M
GPU 内存	4095 MB

为验证提出的算法,随机产生了8组数据集,其描述信息见表5,数据内前2个字段表示对象的地理位置。数据集1、3、4、6和8聚类数目相同,对象个数不同。数据集2和3、5和6以及7和8对象个数相同,聚类数目不同。图4给出了数据集1的分布以及聚类结果,其中图4（a）为数据集1分布,图4（b）表示模拟退火PAM聚类结果,图4（c）为并行遗传PAM聚类结果,图4（d）表示传统PAM聚类结果。比较图4（b）和4（d）可以发现,模拟退火过程使结果得到了改善。

View original graphic|Download|PPT slide

Fig. 4 The distribution of dataset 1 and clustering results

图4 数据集1分布及聚类结果

Tab. 5 The five datasets used in the experiments

表5 实验所用得5组数据集

数据集	对象数/个	字段数/个	聚类数目/个
1	10 000	5	4
2	20 000	5	3
3	20 000	5	4
4	50 000	5	4
5	100 000	5	6
6	100 000	5	4
7	200 000	5	5
8	200 000	5	4

4.2 实验结果比较与分析

为验证本文算法,实现了串行模拟退火PAM和并行模拟退火PAM算法,并根据文献[22]-[23]设计实现了并行遗传K-Medoids算法。基于表5中的数据,比较分析了传统PAM、串行模拟退火PAM、并行模拟退火PAM以及并行遗传K-Medoids 4种算法,结果分析如下。

4.2.1 数据集内结果比较

从聚类结果来看,模拟退火PAM结果优于传统PAM,如图4（b）和4（d）所示。模拟退火PAM算法的串并行程序在同一数据集上结果完全相同,如图4（b）所示。并行模拟退火PAM和遗传PAM算法的结果略有不同,从图4（b）和4（c）中可看出二者结果差异不大。现有的空间聚类有效性评价方法大致包括外部评价法、内部评价法和相对评价法^[3]。由于实验数据没有预先指定的正确聚类结果,故采用内部法评价,即比较不同方法结果的簇内方差,方差小者较优,簇内方差见式（4）。利用簇内方差比较传统PAM、并行遗传K-Medoids和模拟退火PAM,结果见表6。从表6可以发现,相比传统PAM,模拟退火PAM结果的簇内方差更小,聚类结果更优;相比并行遗传K-Medoids,模拟退火PAM在数据集5和7上簇内方差略大,其他数据集均小于并行遗传K-Medoids,二者聚类结果相当。

Tab. 6 Clustering results comparison among traditional PAM, parallel GA K-Medoids and SAPAM

表6 传统PAM、并行遗传K-Medoids和模拟退火PAM聚类结果比较

数据集	聚类结果平方误差
数据集	传统PAM	并行遗传K-Medoids	模拟退火PAM
1	2.09529E+07	1.77726E+07	1.74360E+07
2	3.69280E+07	3.24828E+07	3.24792E+07
3	6.27123E+07	3.29482E+07	3.27136E+07
4	9.01593E+07	8.59827E+07	8.40918E+07
5	2.61621E+08	1.97264E+08	1.99169E+08
6	4.19342E+08	2.70153E+08	2.69891E+08
7	1.10902E+09	7.21299E+08	7.22798E+08
8	1.01737E+09	8.89928E+08	8.83764E+08

从时间效率来看,并行模拟退火PAM算法显著优于串行模拟退火PAM算法,略优于并行遗传PAM算法。图5显示并行模拟退火PAM在各数据集上的运行时间均比串行算法小,加速比t_串行/t_并行均大于4,可见GPU并行加速效果较好。相比于并行遗传PAM算法,并行模拟退火PAM算法在小数据集上效率略低,在大数据集上则较优,这与不同算法优化策略有关。

View original graphic|Download|PPT slide

Fig. 5 The comparison of the result

图5 实验结果比较

4.2.2 数据集间结果比较

串并行模拟退火PAM聚类和并行遗传PAM程序运行时间均随数据集大小变化而变化。图6给出了3种算法时间-数据集大小曲线,图7给出了串并行模拟退火PAM算法的加速比-数据集大小曲线。图6显示3种算法的运行时间均随数据集增大而增大。串行算法运行时间与数据集大小大致成正比,而并行算法运行时间增长较慢,呈现非线性变化。随着数据集增大,多线程并行以及共享内存中数据重用的时间收益掩盖了数据传输时间损耗。由此可以看出,GPU计算具有规模效应,即时间效率随数据集增大而提高。图7显示加速比随数据集增大而增大,数据集较大时增速减缓,表明当GPU计算能力存在饱和现象,加速比具有极限。

View original graphic|Download|PPT slide

Fig. 6 The curve of time-dataset size

图6 时间-数据集大小曲线

View original graphic|Download|PPT slide

Fig. 7 The curve of speedup ratio-dataset size

图7 加速比-数据集大小曲线

4.3 算法应用

企业安全生产事关社会稳定,人民生命财产安全,各类安全事故和隐患是中国安监部门监管工作的重点。为了更好地开展安全监管工作,给相关部门提供决策支持,有必要对事故和隐患数据进行分析,以发现隐患空间分布的内在规律。鉴于此,对贵州省安监局2010-2016年的30 047条隐患数据（表7）进行空间聚类分析。表7中ID表示隐患唯一标识,lng字段表示隐患发生位置的经度,lat表示隐患发生地点的纬度。

Tab. 7 The safety threats data of Guizhou province between 2010 and 2016 （only part fields）

表7 贵州省2010年至2016年安全隐患数据（部分字段）

ID	lng	lat	隐患地点	隐患级别	隐患描述
0	106.11130770827423	27.10817188123285	永贵能源开发有限责任公司黔西县甘棠镇新田煤矿	1	1402回风顺槽工作面前探支护使用不可靠（前探梁使用单环,且其上部未用背板铺设严实）
1	105.3180498760137	27.487599915062724	毕节市七星关区对坡镇先明煤矿	1	井底水仓温度传感器未送有资质部门进行安全性能校检
2	107.48110251895538	27.077516930920083	瓮安县成功磷化有限公司	1	现场职业危害告知卡破损,未及时更新
…	…	...	…	…	…
30046	105.27592194191595	26.154829632942278	贵州路鑫喜义工矿股份有限公司六枝特区中寨乡凸山田煤矿	1	顶板管理

首先将隐患地理坐标进行转换与投影,参考坐标系为Xian_1980_3_Degree_GK_CM_108E,利用本文并行模拟退火PAM算法对隐患数据聚类,距离采用欧式距离度量。实验结果如图8所示,其中图8（a）、（b）、（c）和（d）分别是隐患数据在类别数k为4、5、6和7时的聚类结果。从图8可看出,本文算法能正确地对安全隐患数据进行聚类。当聚类数目为4时,隐患主要集中为贵州省东北部、中部、西部以及西南部;当聚类数目为5时,中部的隐患聚集区分裂为2个;当聚类数目增加到6个时,西南和中部的隐患聚集区进一步分裂,达到5个;最后当聚类数目增加到7个时,东北聚集区继续减小。随着聚类类别数k的增大,贵州省东北部的类簇一直保持稳定,说明该区域的隐患位置较为集中,存在一些大的安全隐患集中区。而当k逐渐增大时,西南和中部的类簇逐渐分裂为几个类簇,说明西南部的隐患多分散聚集于小中心。通过进一步的计算分析可以发现隐患中心,促进隐患中心区域企业单位进行整顿整改。根据聚类结果,还可以对各区域的安全状况进行评价等。

View original graphic|Download|PPT slide

Fig. 8 The clustering result of safety threats data of Guizhou Province

图8 贵州省安全隐患数据不同类别数的聚类结果

5 结论

针对空间聚类算法PAM的初始中心点选择和收敛慢等问题,本文提出一种基于GPU的改进算法。主要研究内容包括：结合模拟退火和传统PAM算法,利用模拟退火算法引入的随机特性,保证聚类结果的有效性;设计了算法并行化策略,并在CUDA上实现,有效提高了时间效率;提出了一种矩阵计算方法,重复利用GPU共享内存中的数据,减少数据在CPU和GPU内存间传输的时间损耗。

为验证算法有效性与时间效率,本文进行了一系列实验,并比较了串并行模拟退火PAM算法和同类的并行遗传PAM算法。实验结果表明,并行模拟退火PAM算法能正确聚类且时间效率较高。从实验结果可以发现,GPU的加速效果随着数据集的增大而呈现非线性提高,即具有规模效应;GPU加速具有饱和效应,加速比可能存在最大值;GPU的加速效应取决于GPU多线程带来的时间收益和数据传输时间滞后效应,为提高加速效果,需要减少数据传输并利用GPU多线程来掩盖时间损耗。最后,应用所提算法分析贵州省安全隐患数据,结果进一步证明算法的有效性。

今后,将研究程序进一步优化策略,减少线程间同步和数据传输的时间损失,并尝试将所提算法集成到现有GIS平台中。此外,结合GPU计算和分布式计算,利用GPU集群提高算法时间效率也是未来研究方向之一。

The authors have declared that no competing interests exist.

参考文献

原文顺序 | 文献年度倒序 | 文中引用次数倒序

[1]	Han J, Kamber M, Tung A K H. Spatial clustering methods in data mining: A survey[M]// In H. J. Miller & J. Han (Eds.), Geographic data mining and knowledge discovery. London and New York: Taylor and Francis, 2001.

[2]

李德仁,李德毅.论空间数据挖掘和知识发现的理论与方法[J].武汉大学学报·信息科学版,2002,27(3):221-233.

首先分析了空间数据挖掘和知识发现(SDMKD)的内涵和外延;然后分别研究了用于SDMKD的概率论、证据理论、空间统计学、规则归纳、聚类分析、空间分析、模糊集、云理论、粗集、神经网络、遗传算法、可视化、决策树、空间在线数据挖掘等理论和方法及其进展;最后展望了SDMKD的发展前景.

DOI

[ Li D

, Li D

.Theories and technologies of spatial data mining and knowledge discovery[J]. Geomatics and Information Science of Wuhan University, 2002,27(3):221-233. ]

[3]	邓敏,刘启亮,李光强,等.空间聚类分析及应用[M].北京:科学出版社,2011:43-155. [ Deng M, Liu Q L, Li G Q, et al.Spatial clustering analysis and application[M]. Beijing: Science Press, 2011:43-155. ]

[4]	Kaufman L, Rousseeuw P J.Finding groups in data: an introduction to cluster analysis[M]. New York: John Wiley & Sons, 2009:68-163.

[5]

Lucasius C

, Dane A

, Kateman

.On k-medoid clustering of large data sets with the aid of a genetic algorithm: Background, feasiblity and comparison[J]. Analytica Chimica Acta, 1993,282(3):647-669.

A novel approach to the problem of κ-medoid clustering of large data sets is presented, using a genetic algorithm. Genetic algorithms comprise a family of optimization methods based loosely upon principles of natural evolution. They have proven to be especially suited to tackle complex, large-scale optimization problems efficiently, including a rapidly growing variety of problems of practical utility. Our pilot study lays emphasis on the feasibility of GCA - our genetic algorithm for κ-medoid clustering of large datasets - and provides some background information to elucidate differences with traditional approaches. The experimental part of this study is done on the basis of artificial data sets and includes a comparison with CLARA - another approach to κ-medoid clustering of large data sets, introduced recently

DOI

[6]

Ng R

, Han

.CLARANS: A method for clustering objects for spatial data mining[J]. IEEE transactions on knowledge and data engineering, 2002,14(5):1003-1016.

Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, the paper investigates how CLARANS can handle not only point objects, but also polygon objects efficiently. One of the methods considered, called the IR-approximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, the paper develops two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.

DOI

[7]

Van der Laan

, Pollard

, Bryan

. A new partitioning around medoids algorithm[J]. Journal of Statistical Computation and Simulation, 2003,73(8):575-584.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.

DOI

[8]	Zhang Q, Couloigner I.A new and efficient k-medoid algorithm for spatial clustering[C]//International Conference on Computational Science and Its Applications. Springer Berlin Heidelberg, 2005:181-189.

[9]	Yue J, Mao S, Li M, et al.An efficient PAM spatial clustering algorithm based on MapReduce[C]. Proceedings of 2014 22^nd International Conference on Geoinformatics, 2014:1-6.

[10]

张雪萍,龚康莉,赵广才.基于MapReduce 的 K-Medoids 并行算法[J].计算机应用,2013,33(4):1023-1025.

为了解决传统K-Medoids聚类算法在处理海量数据信息时所面临的内存容量和CPU处理速度的瓶颈问题，在深入研究K-Medoids算法的基础之上，提出了基于MapReduce编程模型的K-Medoids并行化算法思想。Map函数部分的主要任务是计算每个数据对象到簇类中心点的距离并（重新）分配其所属的聚类簇；Reduce函数部分的主要任务是根据Map部分得到的中间结果，计算出新簇类的中心点，然后作为中心点集给下一次MapReduce过程使用。实验结果表明：运行在Hadoop集群上的基于MapReduce的K-Medoids并行化算法具有较好的聚类结果和可扩展性，对于较大的数据集，该算法得到的加速比更接近于线性。

DOI

[ Zhang X

, Gong K

, Zhao G

.Parallel K-Medoids algorithm based on MapReduce[J]. Journal of Computer Applications, 2013,33(4):1023-1025. ]

[11]	李静滨,杨柳,陈宁江.基于MapReduce的改进K-Medoids并行算法[J].广西大学学报:自然科学版,2014,39(2): 341-345. [ Li J B, Yang L, Chen N J.An improved parallel K-Medoids algorithm based on MapReduce[J]. Journal of Guangxi University: Nat Sci Ed, 2014,39(2):341-345. ]

[12]

王永贵,戴伟,武超.一种基于Hadoop的高效K-Medoids并行算法[J].计算机工程与应用,2015,51(16):47-54.

针对传统K-Medoids算法对初始聚类中心敏感、收敛速度慢,以及在大数据环境下所面临的内存容量和CPU处理速度的瓶颈问题,从改进初始中心选择方案和中心替换策略入手,利用Hadoop分布式计算平台结合基于Top K的并行随机采样策略,实现了一种高效稳定的K-Medoids并行算法,并且通过调整Hadoop平台,实现算法的进一步优化。实验证明,改进的K-Medoids算法不仅有良好的加速比,其收敛性和聚类精度均得到了改善。

DOI

[ Wang Y

, Dai

, Wu

.Highly efficient parallel algorithm of K-Medoids based on Hadoop platform[J]. Computer Engineering and Applications, 2015,51(16):47-54. ]

[13]

李静滨,杨柳,华蓓.基于多核平台并行K-Medoids算法研究[J].计算机应用研究,2011,28(2):498-500.

分析K-Medoids算法的内在并行性,设计一个适合多核平台的并行算法,并利用OpenMP进行实验。实验结果表明,并行算法对多核环境有很好的适应性,在双核及四核计算机上均获得了较好的加速比与运行效率。

DOI

[ Li J

, Yang

, Hua

.Research on parallel K-Medoids algorithm based on multi-core platform[J]. Application Research of Computers, 2011,28(2):498-500. ]

[14]	He Y, Tan H, Luo W, et al.MR-DBSCAN: An efficient parallel density-based clustering algorithm using MapReduce[C]. Proceedings of Parallel and Distributed Systems (ICPADS), 2011:473-480.

[15]

王家耀,张雪萍,周海燕. 一个用于空间聚类分析的遗传K-均值算法[J].计算机工程,2006,32(3):188-190.

空间数据挖掘是数据挖掘的一个新的分支，空间聚类分析是空间数据挖掘中的一个重要研究课题。本文在分析遗传算法及K-均值算法的优越性和不足的基础上，设计了一种遗传K-均值空间聚类分析算法，该算法兼顾了局部收敛和全局收敛性能。实验表明，其结果优于传统K-均值聚类方法及单纯的遗传算法聚类。

DOI

[ Wang J

, Zhang X

, Zhou H

.A Genetic K-means Algorithm for Spatial Clustering[J]. Computer Engineering, 2006,32(3):188-190. ]

[16]

Sheng

, Liu

.A genetic k-medoids clustering algorithm[J]. Journal of Heuristics, 2006,12(6):447-466.

ABSTRACT We propose a hybrid genetic algorithm for k-medoids clustering. A novel heuristic operator is designed and integrated with the genetic algorithm to fine-tune the search. Further, variable length individuals that encode different number of medoids (clusters) are used for evolution with a modified Davies-Bouldin index as a measure of the fitness of the corresponding partitionings. As a result the proposed algorithm can efficiently evolve appropriate partitionings while making no a priori assumption about the number of clusters present in the datasets. In the experiments, we show the effectiveness of the proposed algorithm and compare it with other related clustering methods.

DOI

[17]	Zhang X, Wang J, Wu F, et al.A novel spatial clustering with obstacles constraints based on genetic algorithms and K-medoids[C]. Proceedings of Sixth International Conference on Intelligent Systems Design and Applications, 2006,1:605-610.

[18]	Saha I, Mukhopadhyay A.Genetic algorithm and simulated annealing based approaches to categorical data clustering[C]. Proceedings of 2008 IEEE Region 10 and the Third international Conference on Industrial and Information Systems. IEEE, 2008:1-6.

[19]	张小朋,钱海忠,岳辉丽,等.基于模拟退火的空间聚类算法[J].测绘科学技术学报,2010,27(4):306-309. [ Zhang X P, Qian H Z, Yue H L, et al.Simulated-annealing-based spatial clustering algorithm[J]. Journal of Geomatics Science and Technology, 2010,27(4):306-309. ]

[20]	Brooks S P, Morgan B J T. Optimization using simulated annealing[J]. The Statistician, 1995,44(2):241-257.

[21]	张舒,褚艳利. GPU高性能运算之CUDA[M].北京:中国水利水电出版社,2009:14-43. [ Zhang S,Chu Y L.GPU high-performance computation: CUDA [M]. Beijing: China Water&Power Press, 2009:14-43. ]

[22]

戴文华,焦翠珍,何婷婷三.基于并行遗传算法的 K-means 聚类研究[J].计算机科学,2008,35(6):171-174.

针对传统K-means聚类算法对初始聚类中心的选择敏感,以及聚类数K难以确定的问题,提出一种基于并行遗传算法的K-means聚类方法.该方法采用一种新型的可变长染色体编码方案,随机选择样本点作为初始聚类中心形成染色体,然后结合K-means算法的高效性和并行遗传算法的全局优化能力,通过种群内的遗传、变异和种群间的并行进化、联姻,有效地避免了局部最优解的出现,同时得到了优化的聚类数目和聚类结果.实验表明该方法是一种精确高效的聚类方法.

DOI

[ Dai W

, Jiao C

, He T

.Research of K-means clustering method based on parallel genetic algorithm[J]. Computer Science, 2008,35(6):171-174. ]

[23]

贾瑞玉,管玉勇,李亚龙.基于MapReduce模型的并行遗传k-means聚类算法[J].计算机工程与设计,2014,35(2):657-660.

为了提高遗传k-means算法时间效率和聚类结果的正确率,利用遗传算法的粗粒度并行化设计思想,提出了在Hadoop平台下将遗传k-means算法进行并行化设计。将各个子种群编号作为个体区分,个体所包含的各个聚类中心和其适应度作为值共同作为个体的输入;在并行化过程中,设计了较优的种群迁移策略来避免早熟现象的发生。实验对不同的数据集进行处理,实验结果表明,并行化的遗传k-means算法在处理较大数据集时比传统的串行算法在时间上和最后的结果上都具有明显的优越性。

DOI

[ Jia R

, Guan Y

, Li Y

.Parallel K-Means clustering algorithm based on MapReduce model[J]. Computer Engineering and Design, 2014,35(2):657-660. ]

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

1 引言

2 算法基础

2.1 PAM算法

Fig. 1 Traditional PAM algorithm procedures

2.2 模拟退火算法

Fig. 2 Key procedures of SAA

2.3 CUDA

Fig. 3 The architecture of CUDA

3 基于GPU的模拟退火PAM算法

3.1 模拟退火PAM算法

Tab. 1 The proposed new algorithm

Tab. 2 The pseudocode which generates new medoids

3.2 算法并行化

Tab. 3 The pseudocode which partitions the objects

4 实验与分析

4.1 实验环境和数据

Tab. 4 The experimental computer environment

Fig. 4 The distribution of dataset 1 and clustering results

Tab. 5 The five datasets used in the experiments

4.2 实验结果比较与分析

Tab. 6 Clustering results comparison among traditional PAM, parallel GA K-Medoids and SAPAM

Fig. 5 The comparison of the result

Fig. 6 The curve of time-dataset size

Fig. 7 The curve of speedup ratio-dataset size

4.3 算法应用

Tab. 7 The safety threats data of Guizhou province between 2010 and 2016 （only part fields）

Fig. 8 The clustering result of safety threats data of Guizhou Province

5 结论

参考文献