地球信息科学学报 ›› 2015, Vol. 17 ›› Issue (5): 538-546.doi: 10.3724/SP.J.1047.2015.00538

• 本期要文(可全文下载) • 上一篇    下一篇

基于密度的线数据分组算法研究

魏海涛1,2, 杜云艳2, 许开辉1,2, 吴笛2, 易嘉伟2, 莫洋2, 刘张2   

  1. 1. 山东科技大学测绘科学与工程学院, 青岛266510;
    2. 中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室, 北京100101
  • 收稿日期:2014-12-26 修回日期:2015-02-03 出版日期:2015-05-10 发布日期:2015-05-10
  • 通讯作者: 杜云艳(1973-),女,副研究员,研究方向为海洋信息技术及应用。E-mail:duyy@lreis.ac.cn E-mail:duyy@lreis.ac.cn
  • 作者简介:魏海涛(1979-),女,博士生,研究方向为海洋数据并行计算。E-mail:Weiht@lresi.ac.cn
  • 基金资助:

    海洋公益性专项项目"海洋环境信息云计算与云服务体系框架应用研究"(201105033);海洋预报综合信息系统(MiFSIS)研究应用项目(201105017)。

A Line Grouping Algorithm Based on Density

WEI Haitao1,2, DU Yunyan2, XU Kaihui1,2, WU Di2, YI Jiawei2, MO Yang2, LIU Zhang2   

  1. 1. Shandong University of Science and Technology, Qingdao 266510, China;
    2. LREIS, Institute of Geographic Sciences and Natural Resources Research, CAS, Beijing 100101, China
  • Received:2014-12-26 Revised:2015-02-03 Online:2015-05-10 Published:2015-05-10

摘要:

目前,地理空间数据面临着由于数据量膨胀和计算量高速增长而引起算法效率低的问题,采用"分而治之"的数据分组策略提高运算效率已成为研究的热点。面向分布不均匀的线数据,本文提出了基于密度的线数据分组算法(简称LGAD)。首先,算法通过查找高密度区提取样本线段,保证了分组算法的起点落到高密区;其次,考虑线空间拓扑关系的复杂性,引用水平、垂直和夹角距离度量线段间距离,创建样本线段与其他线段的距离矩阵;最后,以距离矩阵和最优选择方法实现数据负载均衡分组。实验结果显示,对数据分组和分组后数据进行线段聚类的2个过程中,该算法体现了较好的时间优势,与串行计算相比,在分组数为2-12 时,平均比率达4.3,提高了应用的响应速度,具有较好的实际意义。

关键词: 分而治之, 负载均衡, 并行计算, 分布不均匀, 线数据分组

Abstract:

Parallel computing provides a promising solution to accelerate complicated spatial data processing, which is becoming increasingly computational intense. Partitioning large datasets into workload-balanced subgroups remains a challenge, particularly for unevenly distributed spatial data. In this study, a density-based data grouping algorithm was developed to tackle the partition problem for large line data. The algorithm includes three procedures: (1) extracting representative segment samples based on data density distribution; (2) generating a distance matrix between segment samples and the rest of the data by using three line distance measurements into calculations; (3) grouping line segments with data load balanced. Experiments show that the algorithm is able to partition large line data efficiently and evenly into equally sized sub-groups. The speed-up ratios of parallel interpolation save up to 65% of the execution time in comparison with consequential interpolation. A high efficiency of parallel computing was achieved when the datasets were divided into an optimal number of child data groups.

Key words: divide and rule, balanced workloads, parallel computing, unevenly distributed data, line segments