地球信息科学学报 ›› 2017, Vol. 19 ›› Issue (6): 782-791.doi: 10.3724/SP.J.1047.2017.00782

• 地球信息科学理论与方法 • 上一篇    下一篇

GPU加速的改进PAM聚类算法研究与应用

周恩波1(), 毛善君1,*(), 李梅1, 孙振明2   

  1. 1. 北京大学遥感与地理信息系统研究所,北京 100871
    2. 中国矿业大学(北京)资源与安全工程学院,北京 100083
  • 收稿日期:2016-10-11 修回日期:2017-03-22 出版日期:2017-06-20 发布日期:2017-06-20
  • 通讯作者: 毛善君 E-mail:enbozhou@pku.edu.cn;sjmao@pku.edu.cn
  • 作者简介:

    作者简介:周恩波(1992-),男,江西丰城人,研究生,主要从事时空大数据分析与云计算方面研究。E-mail: enbozhou@pku.edu.cn

  • 基金资助:
    国家重点研发计划重点专项(2016YFC0801800)

Research and Application of Accelerating Improved PAM Clustering Algorithm by GPU

ZHOU Enbo1(), MAO Shanjun1,*(), LI Mei1, SUN Zhenming2   

  1. 1. Institute of Remote Sensing and Geographical Information System, Peking University, Beijing 100871, China
    2. Resources and Safety Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China;
  • Received:2016-10-11 Revised:2017-03-22 Online:2017-06-20 Published:2017-06-20
  • Contact: MAO Shanjun E-mail:enbozhou@pku.edu.cn;sjmao@pku.edu.cn

摘要:

空间聚类是空间数据挖掘的重要方法,而K-Medoids是一种常用的空间聚类算法。K-Medoids聚类算法存在初始点选择问题,而且计算复杂。为了提高算法的有效性和时间效率,本文结合模拟退火算法思想,改进了传统的K-Medoids算法PAM,提出一种基于GPU计算的并行模拟退火PAM算法。类比矩阵乘法运算,定义了一种新的矩阵计算方法,可以有效减少数据在GPU全局内存和共享内存之间的传输,提高了算法在GPU中的执行效率。利用模拟退火算法搜索聚类中心点,保证了聚类结果的全局最优性。基于不同的数据集,将串行和并行模拟退火PAM算法以及已有的遗传PAM算法进行比较,结果表明并行模拟退火PAM算法聚类结果正确,且时间效率高。最后,应用本文改进算法对贵州省安监系统的安全监管隐患数据进行聚类分析,发现了隐患聚集中心,相关结果对政府的决策具有一定的实际应用价值。

关键词: K-Medoids, 模拟退火, GPU, 并行计算, 空间聚类分析

Abstract:

Spatial clustering is one of the most important methods in spatial data mining. As a common but powerful spatial clustering algorithm, K-Medoids is applied in many fields such as generalization of spatial entity information, spatial point pattern analysis and epidemiology application. However, K-Medoids algorithm meets two main challenges innately as follow. At first, K-Medoids has selection problem of the initial medoids. Different initial medoids may not attain the same clustering results which could lead to a non-optimal results sometimes. Furthermore, time efficiency of the algorithm is not satisfactory because there exist quantities of iterations to find the most suitable partition. Existing studies on the K-Medoids algorithm don’t take the validness and time efficiency into consideration at the same time. Optimal methods like the Genetic Algorithm are applied to improve the validness of K-Medoids but the time efficiency is not acceptable when dealing with growing data. The MapReduce model is utilized to handle with data of high volume which can’t adapt to some circumstances short of computer clusters. In order to improve the result validity and time efficiency of the algorithm, this paper revised the traditional K-Medoids algorithm of Partitioning Around Medoids (PAM) combining with the idea of the Simulate Anneal Arithmetic (SAA) and proposed a parallel Simulate Anneal Partitioning Around Medoids (SAPAM) algorithm which was implemented efficiently in Graphics Processing Units (GPUs). SAA algorithm is used to search for the initial medoids which promises the validness of the algorithm. The stochastic factor introduced in SAA algorithm gives the possibility of eliminating the local optima to attain the global optimal clustering results of PAM. To accelerate the clustering process, we design the parallel SAPAM algorithm to utilize quantities of GPU’s threads which execute the program at the same time. By analogy with the matrix multiplication, a new matrix computation method is defined to reduce the time consumption of data transfer between GPU’s global memory and shared memory. The matrix computation method reuses data in the shared memory of GPU and computes the distances between medoids and many points at a time which improve the algorithm’s performance evidently. To validate the proposed algorithm, we generated eight datasets with different attributes and sizes randomly and conducted experiments on the eight datasets to compare the proposed parallel SAPAM algorithm with the traditional PAM algorithm, sequential SAPAM algorithm and the parallel genetic K-Medoids algorithm. The experiment results showed that SAPAM algorithm attained more accurate clustering results compared with the traditional PAM and the parallel genetic K-Medoids algorithm. Besides, the proposed algorithm performed better than the sequential SAPAM algorithm and the parallel genetic K-Medoids algorithm in time efficiency. According to the results, our GPU-based SAPAM algorithm was four to eight times faster than the traditional PAM algorithm. The results demonstrate that the proposed method can execute efficiently and attain a valid result. Finally, SAPAM algorithm was applied to analyze the safety monitoring data of Guizhou province to get the clustering pattern of the safety threats. The clustering results show us several clusters of the safety threats which may provide some practical application value to the governor.

Key words: K-Medoids, Simulate Anneal Arithmetic, GPU, Parallel Computing, Spatial Clustering Analysis