基于自定义RDD的海量遥感图像并行镶嵌方法

景维鹏; 霍帅起

doi:10.3724/SP.J.1047.2017.01346

地球信息科学学报 >

2017 , Vol. 19 >Issue 10: 1346 - 1354

DOI: https://doi.org/10.3724/SP.J.1047.2017.01346

遥感科学与应用技术

基于自定义RDD的海量遥感图像并行镶嵌方法

景维鹏 ^,^* ,
霍帅起

展开

东北林业大学信息与计算机工程学院,哈尔滨 150040

作者简介：景维鹏(1979-),男,博士,副教授,研究方向为并行计算、分布式计算、空间数据挖掘。E-mail: nefujwp@163.com

收稿日期: 2017-02-28

要求修回日期: 2017-06-09

网络出版日期: 2017-10-20

基金资助

黑龙江省自然基金重点项目（ZD201403）

收起

A Model of Parallel Mosaicking for Massive Remote Sensing Images Based on Self-defined RDD

JING Weipeng ^,^* ,
HUO Shuaiqi

Expand

College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China

*Corresponding author: JING Weipeng, E-mail: nefujwp@163.com

Received date: 2017-02-28

Request revised date: 2017-06-09

Online published: 2017-10-20

Copyright

《地球信息科学学报》编辑部所有

Fold

摘要

图像镶嵌是遥感图像处理中的重要内容,在跨区域遥感图像分析中发挥重要作用。为了解决传统遥感图像并行算法中存在的计算节点利用率低、频繁数据I/O等问题,本文根据Spark分布式内存计算框架,充分利用Spark利于迭代数据处理的优势,提出了一种基于Spark自定义RDD（弹性分布式数据集）的并行镶嵌方法。该方法首先在集群的多个节点上通过相位相关法执行图像重叠区域估计操作,从而提高了图像重叠区域估计的多节点并行计算;然后,通过重写Spark中RDD的compute和getPartitions方法,自定义针对遥感图像处理的RDD,并将图像镶嵌中的重叠区域估计、图像配准和图像融合3个关键步骤作为自定义RDD的Transformation类型的操作算子;最后,通过隐式转换创建自定义RDD,并调用自定义RDD的操作算子实现图像镶嵌的并行处理。实验结果表明,与传统基于MPI的并行镶嵌算法相比,该方法在保证图像镶嵌效果的基础上,能够有效提高大数据量的图像镶嵌效率。

关键词： 遥感图像; 并行镶嵌; Spark; 相位相关法; 自定义RDD

本文引用格式

景维鹏 , 霍帅起 . 基于自定义RDD的海量遥感图像并行镶嵌方法[J]. 地球信息科学学报, 2017 , 19(10) : 1346 -1354 . DOI: 10.3724/SP.J.1047.2017.01346

Abstract

Image mosaicking is an important part of remote sensing image processing. It plays a vital role in the analysis of trans-regional remote sensing images. In order to solve the problems of low utilization rates of the nodes and frequent data I/O in the traditional parallel algorithms of remote sensing images, we proposed a parallel mosaicking algorithms based on self-defined RDD (Resilient Distributed Datasets), in which the Spark distributed memory computing framework has been used. In this paper, we take full advantage of the Spark, which is conducive to the processing of iterative data, and build remote sensing images parallel mosaic processing model through the operation of the Spark RDD. Firstly, according to the logical separability and data independence of the Fourier transform and inverse Fourier transform in the phase correlation method, we improved the traditional phase correlation method by executing a single instruction on multiple nodes, which are executed parallel in the cluster. We did so to improve the image overlapping region estimation multi-node parallel computation in the algorithm. Then, we override the compute and getPartitions methods in RDD and self-define the RDD for remote sensing image processing. Meanwhile, we used the three key steps of the image mosaicking, including overlapping region estimation, image registration and image fusion, which are the transformation-type operators of the self-defined RDD. These transformation-type operators do not perform calculations in the process of parallel mosaicking, until the final mosaicking image is required to be written to disk or file system. Thus, reducing the time consumption in the process of image parallel mosaicking. Finally, the parallel processing of image mosaicking is realized by calling the operators of self-defined RDD with the method of implicit conversion, compared with the parallel mosaicking algorithm based on MPI. The experimental results show that the parallel mosaicking algorithm of massive remote sensing image based on self-defined RDD can effectively improve the image mosaicking efficiency of large data volume on the basis of guaranteeing the image mosaicking effects.

Key words： remote sensing images; parallel mosaicking; spark; phase correlation methods; self-defined RDD

1 引言

随着对地观测技术的不断发展,遥感图像镶嵌作为遥感图像处理中的重要内容,在计算机视觉、军事、地质勘探等领域发挥着重要作用^[1]。遥感图像镶嵌技术的本质是将多幅航空或卫星图像拼接成一幅无缝图像的过程^[2]。通过遥感图像镶嵌技术,可以得到较大区域的整体遥感图像,有利于进行大区域的地物观察与信息分析。

遥感图像镶嵌算法主要包括图像预处理、重叠区域估计、图像配准和图像融合4个步骤^[3-4]。

Richard Szeliski在1996年提出一种应用于视频图像的全景镶嵌算法,奠定了图像镶嵌技术的理论基础^[5]。随后,文献[6]提出一种柱面全景图像镶嵌算法,将待镶嵌图像投影到柱面上进行镶嵌,效果较好,但该方法受图像尺度空间的影响较大且不稳定。文献[7]对文献[6]中算法进行改进,提出了更加稳健的SIFT算法,但该算法计算量大耗时长。文献[8]在文献[7]的基础上,使用多分辨率法简化图像融合的处理过程,减少了图像镶嵌处理时间,将图像镶嵌技术推向了一个新的高潮。上述各种镶嵌算法均具有计算量大的特点,随着对地观测分辨率的提高,传统的串行方法无法满足应用对海量遥感影像镶嵌的需求^[9-10]。

针对上述问题,文献[11]提出一种基于集群计算系统的海量航空数码影像并行镶嵌算法,该算法需要将航空数字摄影测量原理与并行计算技术相结合进行镶嵌处理,因此仅适用于在航空数字摄影测量中进行并行计算,具有一定的局限性。文献[12]则提出了一种适用于集群系统的细粒度遥感图像并行镶嵌算法,该算法通过二维插值进行图像间的两两镶嵌,但不能解决多幅图像同时镶嵌的问题。文献[13]针对多幅图像同时镶嵌的问题,将一种用于视频图像镶嵌的算法^[14]加以改进使之适用于遥感图像,根据多幅图像的羽化权值得到镶嵌图像,有效地解决了多幅图像同时镶嵌的问题。然而,这些方法由于在镶嵌的过程中需要频繁的数据传输,使集群系统的I/O时间较长,影响了并行算法的效率。针对上述问题,文献[15]提出一种基于动态任务分配和多线程并行I/O的并行镶嵌算法,采用多线程技术,减少了I/O时间消耗,但该算法中采用的最小生成树分配策略易出现遥感图像处理顺序不唯一的问题。文献[16]在文献[15]的基础上,通过创建基于DAG图的动态任务树,对遥感图像的处理顺序进行划分,进一步提高了并行镶嵌效率。但以上提及的并行镶嵌算法中,处理流程基本和串行算法相同,只是在图像配准和图像融合等阶段分别使用数据并行进行处理,在处理过程中容易出现节点利用率低、存在频繁数据I/O等问题。

综合上述处理过程中存在的问题,本文根据Spark分布式内存计算框架,充分利用Spark利于迭代数据处理的优势,构建遥感图像并行镶嵌处理模型,提出了一种基于自定义RDD的海量遥感图像并行镶嵌方法。

2 Spark分布式计算框架

Spark采用了分布式计算中的Master-Slave模型^[17]。Master对应于集群中含有Master进程的节点（ClusterManager）,Slave是集群中含有Worker进程的节点。Master作为整个集群的控制器,负责整个集群的正常运行;Worker相当于是计算节点,接收主节点命令与进行状态汇报;Executor负责任务的执行;Client作为用户的客户端负责提交应用;Driver负责整个应用的执行,其基本架构如图1所示。

View original graphic|Download|PPT slide

Fig. 1 The architecture of Spark

图1 Spark集群架构图

Spark集群部署后,需要在主节点和从节点分别启动Master进程和Worker进程,对整个集群进行控制。在一个Spark应用的执行过程中,Driver和Worker是2个重要角色。Driver程序是应用逻辑执行的起点,负责作业的调度,即Task任务的分发,而多个Worker用来管理计算节点和创建Executor并行处理任务。客户端提交应用后,Master找到一个Worker启动Driver,Driver向Master或者资源管理器申请资源,之后将应用转化为RDD DAG,再由DAGScheduler将RDD DAG转化为Stage的有向无环图提交给TaskScheduler,由TaskScheduler提交任务给Executor执行。在任务执行的过程中,其他组件协同工作,以确保整个应用顺利执行^[18]。

Spark的核心数据结构是弹性分布式数据集,即RDD。它是逻辑集中的实体,在集群中的多台机器上进行数据分区,通过对多台机器上不同RDD分区的控制,能够减少机器之间的数据重排^[19]。在Spark的执行过程中,RDD经历一个个的Transformation算子之后,通过Action算子进行触发操作。逻辑上每经历一次变换,就会将RDD转换为一个新的RDD,在转换的过程中RDD会被划分成很多的分区分布到集群的多个节点中。

3 重叠区域估计并行实现

传统镶嵌算法以串行执行重叠区域估计操作,因此存在处理效率不高的问题。针对该问题,本文根据相位相关法中傅里叶变换、傅里叶反变换等计算步骤的逻辑可分性与数据独立性,通过单指令多节点并行执行的方式对传统相位相关法进行改进,实现在集群的多节点上并行处理。

3.1 相位相关法进行图像重叠区域估计

相位相关法^[20]是进行图像重叠区域估计的传统算法,是一种基于傅里叶功率谱的频域相关技术。设2幅图像I₁(x,y)和I₂(x,y),其中(x,y)表示图像中像元点的坐标。I₂(x,y)是I₁(x,y)经过平移(x₀,y₀)后的图像,则：

I 2 (x, y) = I 1 (x - x 0, y - y 0)

（1）

设图像I₂(x,y)和I₁(x,y)经过傅里叶变换后分别为G(u,v)和F(u,v),其中(u,v)表示经过傅里叶变换后图像像素点的坐标。则：

G (u, v) = F (u, v) e - j 2 π (u x 0 + v y 0)

（2）

然后,通过计算2幅图像间的互功率谱得到图像间的相位差,互功率谱的定义为：

F (u, v) G * (u, v) F (u, v) G * (u, v) = e - j 2 π (u x 0 + v y 0)

（3）

其中,G^*(u,v)是G(u,v)的复共轭。

最后,对式（3）中的互功率谱进行傅里叶反变换,在(x₀,y₀)处将形成一个脉冲函数：

ϑ (x - x 0, y - y 0) = F - 1 (e - j 2 π (u x 0 + v y 0))

（4）

该脉冲函数的峰值是反映图像相似度的参考。峰值点对应的坐标即2幅图像之间的相对平移量x₀和y₀,然后根据平移量估计出图像大致的重叠区域。

3.2 改进的相位相关法进行图像重叠区域估计

传统镶嵌算法在使用相位相关法进行图像重叠区域估计时,每次只能在单一节点执行2幅图像的计算,这将影响计算效率。本文对上述相位相关法进行改进,设集群中节点的个数为n,对于每一节点中的图像I_i₁(x,y)和I_j₂(x,y)有：

∑ j = 1 n I j 2 (x, y) = ∑ i = 1 n I i 1 (x - x 0, y - y 0)

（5）

其中,I_j₂(x,y)是I_i₁(x,y)经过平移(x₀,y₀)后的图像。

对于各节点中的图像I_i₁(x,y)和I_j₂(x,y)分别进行式（2）和（3）的操作,则可得出图像的脉冲函数：

∑ k = 1 n ϑ k (x - x 0, y - y 0) = ∑ k = 1 n F k - 1 (e - j 2 π (u x 0 + v y 0))

（6）

对于式（6）的

ϑ k (x - x 0, y - y 0)

,其对应的坐标峰值指出了在节点

k (k = 1, ⋯, n)

中图像之间的平移量(x₀,y₀),然后根据平移量得出节点k中图像的重叠区域。通过此方法能够在各个节点中计算遥感图像的重叠区域,从而实现了多节点并行操作。

4 自定义遥感图像镶嵌处理RDD

为了更好地提高遥感图像重叠区域估计多节点并行处理效率,本文在Spark集群中将重叠区域估计作为自定义RDD的一个操作算子,通过对自定义RDD的操作完成整个镶嵌过程。

RDD作为Spark中的核心数据结构,通过RDD的依赖关系形成Spark的调度顺序,通过对RDD的操作完成并行计算^[21-22]。由于RDD的操作算子中没有针对图像处理的相关方法,因此,本文对RDD进行扩展,构建针对遥感图像处理的RDD来完成整个并行镶嵌过程,主要包括2方面内容：① 自定义针对遥感图像处理的RDD;② 通过隐式转换在应用程序中调用自定义RDD的相关方法完成镶嵌工作。

4.1 自定义RDD的具体实现

Spark中的RDD是一种抽象的数据集合,它是一个抽象类,可以简单的把RDD理解为一个提供了许多操作接口的数据集合^[23]。在RDD的内部有许多的操作方法（如map、filter、flatMap等）,通过对这些操作方法的调用,将RDD转化为新的RDD,从而完成相应的操作^[24]。在RDD的众多操作方法中,最基本的有2个：① compute方法用来计算RDD中每个分区的数据;② getPartitions方法用来定义RDD中的分区策略。

本文的镶嵌方法,通过继承Spark中的RDD类,并重写RDD中的compute和getPartitions方法,以此来提高镶嵌效率,具体的自定义过程如图2 所示。

View original graphic|Download|PPT slide

Fig. 2 Self-defined RDD implementation details

图2 自定义RDD的实现细节

在自定义RDD时,通过extends关键字继承自RDD,并重写compute和getPartitions方法。在重写的compute方法中调用父RDD的iterator方法,拉取父RDD对应分区中的数据,iterator方法会返回一个迭代器对象,迭代器内部存储的每一个元素（即父RDD）对应分区内的记录数据。然后,将存储有图像数据的迭代器返回,从而在自定义RDD中获取到从父RDD传递的图像数据以进行后续的操作;而在重写getPartitions方法时调用父RDD的partitions方法,返回父RDD的分区。在程序的执行过程中,如果不指定分区的个数,则系统会使用默认的分区策略,默认的最小分区为2（集群环境下）。同时在自定义RDD中添加图像镶嵌方法,即对图像镶嵌过程中重叠区域估计、图像配准和图像融合3个操作算子的设计。

在进行遥感图像镶嵌时,Spark集群会根据操作算子的不同类型,选择对整个应用程序开始进行计算操作的时间。RDD的操作算子有Transformation和Action 2个类型：Transformation算子是一种链式的逻辑Action,记录了RDD的演变过程,并不会真正的触发计算操作;而Action则是实质触发Transformation开始计算的动作^[25]。

由于Transformation类型算子懒加载的特性（即无论执行多少次Transformation操作,RDD都不会真正执行运算）,只有当Action操作被执行时,运算才会触发。而在图像镶嵌中,各操作步骤都需要大量的计算操作。综合以上2点,本文将并行镶嵌算法中的3个操作算子都设计为Transformation类型,在并行镶嵌的过程中不进行计算,当生成最终的镶嵌图像需要写入磁盘或文件系统时,才触发真正的操作,该设计思想能够有效减少图像并行镶嵌过程中的时间消耗。

4.2 隐式转换创建自定义RDD

在使用Spark集群进行遥感图像镶嵌处理时,通过对RDD的操作完成整个处理过程,因此需要通过隐式转换调用自定义RDD的镶嵌方法。隐式转换,即为当对象调用某个方法,而这个方法又不是这个对象的方法时,Spark程序就会寻找作用范围内的隐式转换来完成这个方法。在Spark中通过implicit关键字进行隐式转换,在隐式转换时,Spark程序在发现对象的类型不匹配时,会在代码中尝试匹配implicit声明的对象。隐式转换的一个重要作用就是对RDD进行扩展,实现功能增强^[26]。

本文在使用自定义RDD进行遥感图像镶嵌处理时,通过隐式转换的处理思想,调用自定义RDD的图像镶嵌方法完成图像的镶嵌工作。其主要过程为：通过SparkContext对象的textFile方法从HDFS中读取遥感图像数据,并转化成初始RDD,初始RDD调用使用implicit声明的创建自定义RDD的方法,从而产生自定义RDD对象。通过隐式转换创建了自定义RDD对象,从而完成图像镶嵌的整个处理过程。创建自定义RDD的详细细节如算法1如下：

算法1：

初始化：创建SparkConf对象conf,将conf作为SparkContext构造函数的参数创建SparkContext对象sc,调用sc的textFile 方法创建初始RDD

阶段1：在自定义RDD中添加操作方法
Iterator[BufferedImage]←compute（split: Partition,context: TaskContext）//调用父RDD的iterator方法,返回一个内部
//元素类型为bufferImage的迭代器对象
Array[Partition]←firstParent[BufferedImage].partitions//调用父RDD的partitions方法,返回父RDD的分区
RDD[BufferedImage]←Image overlap region estimation//重叠区域估计方法
RDD[BufferedImage]←Image registration//图像配准方法
RDD[BufferedImage]←Image fusion//图像融合方法
阶段2：调用隐式转换的处理方法
self-definedRDD[rdd]←exchange（rdd:RDD[String]）//转换类中的exchange方法由implicit关键字修饰,RDD为方法参数,//自定义RDD作为返回值
import RDDtoSelf-defiendRDD.exchange//在程序中导入声明的隐式转换的方法
阶段3：生成自定义RDD对象.
imageRDD ←fileRDD.exchange//初始RDD调用exchange方法生成自定义RDD对象

在算法1中,TextFile函数需要2个参数：FilePath和Partition。其中,FilePath为遥感图像数据在HDFS中的存储路径;Partition为分区数。将fileRDD作为初始RDD,在RDDtoSelf-definedRDD中定义implicit修饰的exchange方法,在该方法中RDD为方法参数,而self-definedRDD为返回值,即算法1中的imageRDD。

5 基于自定义RDD的遥感图像并行镶嵌处理

通过将图像重叠区域计算实现并行处理,及自定义针对遥感图像镶嵌的RDD。本文对传统的并行镶嵌算法进行优化,提出了一种基于自定义RDD的遥感图像并行镶嵌算法,优化后的并行镶嵌算法如图3所示。

View original graphic|Download|PPT slide

Fig. 3 Parallel mosaicking algorithm based on self-defined RDD

图3 基于自定义RDD的并行镶嵌算法

从图3可看出,客户端将经过预处理的遥感图像数据提交到Spark集群的Master节点,Master节点将图像镶嵌任务分发给Worker节点,在各Worker节点进行图像重叠区域并行计算,及后续的图像配准和图像融合操作,提高重叠区域估计的处理效率;在镶嵌任务的整个执行过程中,各Worker节点通过启用Executor进程依次循环执行重叠区域估计、图像配准和图像融合3个步骤,直至生成最终的镶嵌图像。通过此方法可以充分利用节点的处理能力,从而提高节点的利用率;利用Spark基于内存的计算特性,在得到最终镶嵌图像之前各阶段产生的中间数据,通过RDD的持久化操作,保存在Executor进程的内存中,而无需写入磁盘。当进行下一阶段的处理时,直接从Executor进程所在的Worker节点的内存中取出所需数据,从而解决需要频繁进行数据读写操作的问题。

下面将详细描述在Spark分布式计算框架中,通过对自定义RDD的操作完成遥感图像镶嵌处理的过程,主要包括以下2方面内容：① Spark集群中遥感图像并行镶嵌处理步骤;② Spark集群中遥感图像并行镶嵌运行逻辑。

5.1 Spark集群中遥感图像并行镶嵌处理过程

结合自定义RDD的实现细节及图3中对并行镶嵌算法的描述,可以得到在Spark集群中遥感图像并行镶嵌详细处理步骤如下：

（1）客户端将遥感图像镶嵌作业提交到Spark集群之后,会启动一个对应的Driver进程,该Driver进程向集群的资源管理器申请运行Spark作业需要使用的资源,即Executor进程的数量。Executor进程是在各个Worker节点上启动的,每个Executor进程都需要占用内存和CPU。这部分的操作在使用Spark的spark-submit脚本提交任务时指定资源参数。

（2）在申请到作业执行所需的资源之后,Driver进程会将编写的Spark作业代码拆分为多个Stage：在遥感图像镶嵌处理程序中,由于图像重叠区域估计和图像配准过程中,每一Partition中遥感图像的数量并不会发生改变,即子RDD中的Partition对父RDD的Partition依赖的数量不会随着RDD数据规模的改变而改变,因此它们为一个Stage的处理过程;而在图像融合中会使多张图像融合成一张,即子RDD对父RDD依赖的Partition的数量可能随着RDD数据规模的改变而改变,因此其与上述2个处理过程在不同的Stage中。

（3）在Spark中为每个Stage创建一批Task,然后将这些Task分配到各个Executor进程中执行。Task是Spark中最小的计算单元,负责具体执行在镶嵌程序中的重叠区域估计、图像配准和图像融合的代码。其中,每一个Task执行一模一样的代码,但是不同的Task处理的数据不同。在Spark集群中,通过Task在各个处理节点的执行,实现遥感图像镶嵌的并行处理。

（4）当重叠区域估计和图像配准所在Stage中所有Task都执行完毕之后,Driver就会调度运行包含图像融合的Stage,而该Stage的Task的输入数据是包含重叠区域估计和图像配准的Stage的输出数据。如此循环往复,直到计算完所有的数据,生成最终的镶嵌图像。

5.2 Spark集群中遥感图像并行镶嵌运行逻辑

在Spark集群中进行遥感图像并行镶嵌处理中,整个执行流程在逻辑上会形成有向无环图,当Action算子触发以后,由调度器调度该图上的任务进行运算。生成的有向无环图如图4所示。

图4中,A、B、C、D、E均为不同类型的RDD,RDD内的方块代表分区。数据从HDFS输入Spark形成RDD A,RDD A通过隐式转换形成自定义RDD B,RDD B执行重叠区域估计操作转换成RDD C,RDD C执行图像配准操作转换成RDD D,RDD D执行图像融合操作转换成RDD E,然后依次循环执行重叠区域估计、图像配准和图像融合3个算子,直到生成最终的镶嵌图像,并通过函数SaveAsTextFile输出并保存到HDFS中。由于在图像镶嵌处理的3个操作算子都是transformation类型,而transformation操作是延迟计算的,因此在RDD的转换过程不是马上执行,从而保证在每一操作步骤后不会立即对数据进行处理和保存,只有等到有action操作,即图4中对最终得到的镶嵌图像进行输出时,才真正触发运算。Action算子触发以后,将所有累积的算子形成一个有向无环图,然后由调度器调度该图中的任务进行运算。

View original graphic|Download|PPT slide

Fig. 4 Parallel mosaicking directed acyclic graphs of remote sensing images

图4 遥感图像并行镶嵌有向无环图

6 实验结果与分析

6.1 实验环境

本文所采用的实验环境是：5个节点构成的Spark集群,其中1个为Master,其余4个为Worker,其中4个节点的主机配置为曙光I450-G10,塔式服务器,一个InterXeon E5-2620六核2.1GHZ处理器,8 GB内存,硬盘300 G。每个节点上安装的是Red Hat6.2, Linux的内核版本是2.6.32,Hadoop的版本是hadoop 2.5.2,Spark的版本是spark 1.2.0。为了对比图像镶嵌效率,本文使用5个节点构成的MPI并行集群,服务器的硬件配置信息和Spark集群的相同,而gcc的版本是gcc 4.4.7,MPI的版本是mpich 3.0.4。

6.2 镶嵌效果展示

该实验选取Landsat 5第5波段分辨率为30 m的15幅TM遥感图像进行镶嵌,图像数据量为2 GB,得到的镶嵌结果如图5所示。

View original graphic|Download|PPT slide

Fig. 5 Parallel mosaicking algorithm based on Spark

图5 镶嵌效果图

从实验结果图像可看出,由于在镶嵌算法中采用融合技术对图像的重叠区域进行处理,消除了重叠区域的镶嵌痕迹,为整个区域提供了一个连续的镶嵌视图,达到了良好的镶嵌效果。

6.3 处理效率分析

6.3.1 加速比对比实验

该实验选取Landsat7第 8波段分辨率为15 m的ETM+遥感数据,数据量为72 GB,实验对比基于MPI实现的并行镶嵌程序,研究随着处理进程的增加,本文采用的基于自定义RDD的遥感图像并行镶嵌技术与基于MPI的并行镶嵌程序加速比的差异,其中加速比的定义为：

speedup = T 1 T p

（7）

式中：T₁为单处理器下的运行时间;T_p为p个处理器的并行运行时间。变化趋势如图6所示。

从图6的实验结果可知,相比于传统基于MPI实现的并行镶嵌程序,本文基于自定义RDD实现的并行镶嵌过程具有更好的加速比。这是因为：① 本文采用的并行镶嵌算法,首先实现图像重叠区域估计的并行处理,并将镶嵌的处理过程通过自定义RDD添加到其操作算子中,在集群的各个节点中调用自定义RDD操作算子生成不同类型的RDD,通过对RDD的操作从而完成整个镶嵌处理过程。这能够充分利用RDD的并行计算能力,提高了集群内部的并行操作效率;② 镶嵌过程中的中间图像不写回文件系统中,通过RDD的Transformation操作直接作为下一处理阶段的数据输入。由于在图像镶嵌的过程中,在生成最终的图像之前存在大量的中间图像,频繁的将中间图像写回文件系统,并从文件系统中读入内存进行后续操作会消耗大量的I/O时间。本文充分利用Spark基于内存计算的特性,将镶嵌过程中的中间图像不写回文件系统中,这在一定程度上节省了文件传输的时间消耗。

View original graphic|Download|PPT slide

Fig. 6 Speedup contrast chart (with increasing number of processes)

图6 加速比对比图（随进程数增加）

6.3.2 处理时间及吞吐率对比实验

该实验选取Landsat7第 8波段分辨率为15 m的ETM+遥感数据,数据量从0.7 GB增加到126.5 GB,实验对比基于MPI实现的并行镶嵌程序,研究随着数据规模的不断增大。本文采用的基于自定义RDD的遥感图像并行镶嵌技术与基于MPI的并行镶嵌程序在处理时间及数据吞吐率等性能指标的变化,其中吞吐率的定义为：

M T

（8）

式中：F为吞吐率;M为数据量;T为处理时间。

随着数据量的增加,基于MPI的并行镶嵌程序与本文基于自定义RDD的并行镶嵌程序处理时间和吞吐率的的变化趋势如图7、8所示。

View original graphic|Download|PPT slide

Fig. 7 Running time comparison chart （with the increase of data size）

图7 运行时间对比图（随数据规模增加）

View original graphic|Download|PPT slide

Fig. 8 Throughput comparison chart （with the increase of data size）

图8 吞吐率对比图（随数据规模增加）

从图7、8的实验结果可知,在数据量较小时,本文采用的并行镶嵌算法在进行处理时的数据吞吐率低于传统的基于MPI的并行镶嵌程序,其主要原因可能是：当需要处理的数据量较少时,Spark集群中可并发执行的task数大于需要处理的任务数,从而造成一些计算节点处于空闲状态。随着数据量的增加,可以并行执行更多的任务,集群的并行度将得到提高,数据吞吐率也将比基于MPI实现的镶嵌程序大幅提升。

通过实验分析可知,本文采用的基于自定义RDD的遥感图像并行镶嵌算法,在保证图像镶嵌效果的基础上,有效地提高了并行处理效率。

7 结语

遥感图像镶嵌由于数据量大、处理过程复杂等特点,使镶嵌过程耗时较长。本文针对传统并行镶嵌算法中存在的问题,结合Spark分布式内存计算框架的特性,提出一种基于自定义RDD的并行镶嵌算法,该算法通过调用隐式转换得到的自定义RDD的镶嵌算子,从而完成整个镶嵌工作。该算法在保证镶嵌图像效果的基础上,比传统基于MPI实现的并行镶嵌程序具有更良好的处理性能。下一步的研究工作是对Spark内部的调度机制进行优化,从而进一步提高图像镶嵌的并行处理效率。

The authors have declared that no competing interests exist.

参考文献

原文顺序 | 文献年度倒序 | 文中引用次数倒序

[1]

Franklin S

, Wulder M

Remote sensing methods in medium spatial resolution satellite data land cover classification of large areas[J]. Progress in Physical Geography, 2002,26(2):173-205.

Numerous large-area, multiple image-based, multiple sensor land cover mapping programs exist or have been proposed, often within the context of national forest monitoring, mapping and modelling initiatives, worldwide. Common methodological steps have been identified that include data acquisition and preprocessing, map legend development, classification approach, stratification, incorporation of ancillary data and accuracy assessment. In general, procedures used in any large-area land cover classification must be robust and repeatable; because of data acquisition parameters, it is likely that compilation of the maps based on the classification will occur with original image acquisitions of different seasonality and perhaps acquired in different years and by different sensors. This situation poses some new challenges beyond those encountered in large-area single image classifications. The objective of this paper is to review and assess general medium spatial resolution satellite remote sensing land cover classification approaches with the goal of identifying the outstanding issues that must be overcome in order to implement a large-area, land cover classification protocol.

DOI

[2]

Scheidt

, Ramsey

, Lancaster

Radiometric normalization and image mosaic generation of ASTER thermal infrared data: An application to extensive sand sheets and dune fields[J]. Remote Sensing of Environment, 2008,112(3):920-933.

Data from the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) have a significant advantage over previous datasets because of the combination of high spatial resolution (15鈥90m) and enhanced multispectral capabilities, particularly in the thermal infrared (TIR) atmospheric window (8鈥12渭m) of the Earth where common silicate minerals are more easily identified. However, the 60km swath width of ASTER can limit the effectiveness of accurately tracing large-scale features, such as eolian sediment transport pathways, over long distances. The primary goal of this paper is to describe a method for generating a seamless and radiometrically accurate ASTER TIR mosaic of atmospherically corrected radiance and from that, extract surface emissivity for arid lands, specifically, sand seas. The Gran Desierto in northern Sonora, Mexico was used as a test location for the radiometric normalization technique because of past remote sensing studies of the region, its compositional diversity, and its size. A linear approach was taken to transform adjacent image swaths into a direct linear relationship between image acquisition dates. Pseudo-invariant features (PIFs) were selected using a threshold of correlation between radiance values, and change-pixels were excluded from the linear regression used to determine correction factors. The degree of spectral correlation between overlapping pixels is directly related to the amount of surface change over time; therefore, the gain and offsets between scenes were based only on regions of high spectral correlation. The result was a series of radiometrically normalized radiance-at-surface images that were combined with a minimum of image edge seams present. These edges were subsequently blended to create the final mosaic. The advantages of this approach for TIR radiance (as opposed to emissivity) data include the ability to: (1) analyze data acquired on different dates (with potentially very different surface temperatures) as one seamless compositional dataset; (2) perform decorrelation stretches (DCS) on the entire dataset in order to identify and discriminate compositional units; and (3) separate brightness temperature from surface emissivity for quantitative compositional analysis of the surface, reducing seam-line error in the emissivity mosaic. The approach presented here is valid for any ASTER-related study of large geographic regions where numerous images spanning different temporal and atmospheric conditions are encountered.

DOI

[3]	Bielski C, Grazzini J, Soille P.Automated morphological image composition for mosaicing large image data sets[C]. Geoscience and Remote Sensing Symposium, 2007. IGAR-SS 2007. IEEE International, 2007:4068-4071.

[4]

Liu

, Gong

, Shi P

Study on image automatic registration based on layered multi-template matching[J]. Computer Applications, 2005,25(2):193-203.

In automatic registration of high resolution imagery, the main problem was how to select templates or subimages, which were noise free and had strong features that can aid in the registration, and how to reduce search data and space. This paper provided a description of the method and flow chart of an automatic and subimage selection and presented a method of gradient energy-based selection and subimage selection with more noise and dynamic objects such as cars and trees which was easier to change in different temporal image. Multi-resolution wavelet decomposition techniques and self-adaptive image matching are used to reduce search space and data. Test results with high spatial imagery show good performance.

[5]

Szeliski

Video mosaics for virtual environments[J]. IEEE Computer Graphics& Applications, 1996,16(2):22-30.

This article presents automated techniques for creating large, high-resolution images and environment maps from regular low-resolution video and photographic imagery. By panning a camera over a scene and automatically compositing the resulting frames, this system can create images and panoramas of arbitrary shape and detail. Translating the camera causes motion parallax in the video, which can be exploited to recover depth maps of the scene and thereby enable limited 3D rendering through view interpolation. The article discusses applications of these techniques to the creation of novel virtual environments and experiences such as virtual travel, home walkthroughs, and home supermarket shopping.

DOI

[6]	Shum H Y, He L W.Rendering with concentric mosaics[C]. Proceeding of Siggraph'99, Los Angeles, California, 1999:8-13.

[7]	Lowe D G.Object recognition from local scale-invariant features[C]. Proceedings of the 7^th International Conference on Computer Vision, 1999:1150-1157.

[8]	Brown M, Lowe D G.Recognizing panoramas[C]. Proceedings of IEEE International Conference on Computer Vision, Washington, 2003:1218-1225.

[9]

Afek

, Brand

Mosaicking of orthorectified aerial images[J]. Photogrammetric Engineering & Remote Sensing, 1998,64(2):115-125.

Aerial photographs are widely used in surveying, geographic information systems (GIS), and other applications. Analysis of a large area requires the creation of an image mosaic, which is composed of several aerial photographs. In an ideal situation, a perfect mosaic can be generated using a series of rigid transformations on the source images. In practice, geometric distortions and radiometric differences interfere with the mosaicking process. In this paper a complete algorithm to mosaic images taken at different times and conditions with geometric distortions and radiometric differences is presented. The algorithm, which works without any human intervention, integrates global feature matching algorithms into the process of selecting a seam line. The algorithm may be applied to mosaic any set of images for which an appropriate matching algorithm exists. The creation of an image mosaic is accomplished using local transformations along a computed seam line and a rigid transformation elsewhere. An automatic stereo matching algorithm, originally developed for surface height measurement, is used to detect matching pairs of tie points across frame boundaries. These tie points are used to compute the seam line for the mosaic, and to compute geometric and radiometric correcting transformations around this seam line.

DOI

[10]	朱述龙,钱曾波.遥感影像镶嵌时拼接缝的消除方法[J].遥感学报,2002,6(3):183-187.对现有的影像并接缝消除方法的优缺点进行了分析，提出了拼接缝消除的强制改正方法，并用大量的实际图像进行了试验。结果表明：所提出的方法具有较好的拼接缝消除效果，且算法简单，易于实现，可以处理彩色和黑白等多种图像。 DOI [ Zhu S L, Qian Z B.The seam-line removal under mosaicking of remotely sensed images[J]. Journal of Remote Sensing, 2002,6(3):183-187. ]

[11]

张剑清,柯涛,孙明伟.基于集群计算机的海量航空数码影像并行处理—并行计算在航空数字摄影测量中的应用[J]计算机工程与应用,2008,44(13):12-15.

航空数码影像的获取频率越来越快,同时数据量也越来越大,传统的基于串行计算的影像处理方式已很难满足高效率的生产需求和快速响应,因此必需采用并行计算来提高数据处理的效率。论述了一种基于集群计算机系统的海量航空数码影像并行处理方法,介绍了并行计算在航空数字摄影测量中的应用,并结合数字摄影测量原理和并行处理技术,提出了一种可满足快速响应需求的无控制影像镶嵌图快速制作方法。生产实践验证了并行方法的可行性和高效性,生产效率比传统数字摄影测量工作站提高了3-10倍。

DOI

[ Zhang J

, Ke

, Sun M

Parallel processing of mass aerial digital images based on cluster computer-the application of parallel computing in aerial digital photogrammetry[J]. Computer Engineering and Applications, 2008,44(13):12-15. ]

[12]

安兴华,王小鸽,都志辉,等.一种适用于机群系统的细粒度遥感图像镶嵌并行算法[J].清华大学学报(自然科学版),2002,42(10):1389-1392.

为得到全局色调一致的无缝镶嵌图像 ,提出了一种细粒度遥感图像镶嵌并行算法。该算法通过维护一个“双缓冲队列”和采用“任务动态选择”算法进行任务分配 ,极大地减少了通讯开销和任务等待 ,而且图像的色调均衡化和重采样都可融入到镶嵌算法中并行执行。还提出了一种基于不规则边界的改进λ插值图像融合算法 ,用以进行图像重叠部分的无缝拼接。对镶嵌算法性能的实际测试和分析结果表明 ,该算法在机群系统上获得了近似线性的加速比和良好的视觉效果。

DOI

[ An X

, Wang X

, Du Z

, et al.Fine-grained parallel algorithm for remote sensing image mosaics for cluster system[J]. Journal of Tsinghua University (Science and Technology), 2002,42(10):1389-1392. ]

[13]	陈晨,谭毅华,李海涛,等.遥感图像快速镶嵌并行算法研究[J].微电子学与计算机,2011,28(3):59-62. [ Chen C, Tan Y H, Li H T, et al.A fast and automatic parallel algorithm of remote sensing image mosaic[J]. Microelectronics and Computer, 2011,28(3): 59-62. ]

[14]

Zhao

Flexible image blending for image mosaicing with reduced artifacts[J]. International Journal of Pat-tern Recognition and Artificial Intelligence, 2006,20(4):609-628.

Image mosaicing involves geometric alignment among video frames and image compositing or blending. For dynamic mosaicing, image mosaics are constructed dynamically along with incoming video frames. Consequently, dynamic mosaicing demands efficient operations for both alignment and blending in order to achieve real-time performance. In this paper, we focus on efficient image blending methods that create good-quality image mosaics from any number of overlapping frames. One of the driving forces for efficient image processing is the huge market of mobile devices such as cell phones, PDAs that have image sensors and processors. In particular, we show that it is possible to have efficient sequential implementations of blending methods that simultaneously involve all accumulated video frames. The choices of image blending include traditional averaging, overlapping and flexible ones that take into consideration temporal order of video frames and user control inputs. In addition, we show that artifacts due to mis-alignment, image intensity difference can be significantly reduced by efficiently applying weighting functions when blending video frames. These weighting functions are based on pixel locations in a frame, view perspective and temporal order of this frame. One interesting application of flexible blending is to visualize moving objects on a mosaiced stationary background. Finally, to correct for significant exposure difference in video frames, we propose a pyramid extension based on intensity matching of aligned images at the coarsest resolution. Our experiments with real image sequences demonstrate the advantages of the proposed methods.

DOI

[15]	王妍颖,马艳,刘定生.一种基于动态分组策略和多线程并行IO的并行镶嵌算法优化[J].遥感信息,2012(2):3-8. [ Wang Y Y, Ma Y, Liu D S.Optimization of image mosaic algorithm based on parallel I/O and dynamic grouping[J]. Remote Sensing Information, 2012,2:3-8. ]

[16]

, Wang

, Zomaya A

, et al.Task-tree based large-scale mosaicking for massive remote sensed imageries with dynamic DAG scheduling[J]. IEEE Transactions on Parallel & Distributed Systems, 2014,25(8):2126-2137.

Remote sensed imagery mosaicking at large scale has been receiving increasing attentions in regional to global research. However, when scaling to large areas, image mosaicking becomes extremely challenging for the dependency relationships among a large collection of tasks which give rise to ordering constraint, the demand of significant processing capabilities and also the difficulties inherent in organizing these enormous tasks and RS image data. We propose a task-tree based mosaicking for remote sensed imageries at large scale with dynamic DAG scheduling. It expresses large scale mosaicking as a data-driven task tree with minimal height. And also a critical path based dynamical DAG scheduling solution with status queue named CPDS-SQ is provided to offer an optimized schedule on multi-core cluster with minimal completion time. All the individual dependent tasks are run by a core parallel mosaicking program implemented with MPI to perform mosaicking on different pairs of images. Eventually, an effective but easier approach is offered to improve the large-scale processing capability by decoupling the dependence relationships among tasks from the complex parallel processing procedure. Through experiments on large-scale mosaicking, we confirmed that our approach were efficient and scalable.

DOI

[17]	Tabbb Y, Medouri A, Tetouan M.Towards a next gereration of scientific computing in the cloud[J]. International of Computer Science, 2012,9(6):177-183.

[18]	夏俊鸾. Spark大数据处理技术[M].北京:电子工业出版社,2015. [ Xia J L.Big data processing technology with spark[M]. Beijing: Electronic Industry Press, 2015. ]

[19]	高彦杰. Spark大数据处理:技术、应用与性能优化[M].北京:机械工业出版社,2014. [ Gao Y J.Data processing with spark technology, application and performance optimization[M]. Beijing: China Machine Press, 2014. ]

[20]	Kuglin C D.The phase correlation image alignment method[J]. Proc.intl Conf.cybernetics & Society, 1975:163-165.CiteSeerX - Scientific documents that cite the following paper: The phase correlation image alignment method

[21]	Zaharia M, Chowdhusry M, Das T, et al.Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]// Usenix Conference on Networked Systems Design and Implementation. 2012:141-146.

[22]	Zaharia M, Das T, Li H, et al.Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters[C]//Proceedings of the 4^th USENIX conference on Hot Topics in Cloud Computing. USENIX Association, 2012:10-10.

[23]	Engle C, Lupher A, Xin R, et al.Shark: fast data analysis using coarse-grained distributed memory[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM. 2012:689-692.

[24]	Xin R S. Rosen J. Zaharia M, et al.Shark: SQL and rich analytics at scale[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. ACM, 2013:13-24.

[25]	Ramirez-Gallego S, Garcia S, Mourino-Talin H, et al.Distributed entropy minimization discretizer for big data analysis under apache spark[C]// Trustcom/bigdatase/ispa. IEEE, 2015.

[26]	Han Z, Zhang Y.Spark: A big data processing platform based on memory computing[C]//International Symposiumon Parallel Architectures. IEEE, 2015:172-176.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

1 引言

2 Spark分布式计算框架

Fig. 1 The architecture of Spark

3 重叠区域估计并行实现

3.1 相位相关法进行图像重叠区域估计

3.2 改进的相位相关法进行图像重叠区域估计

4 自定义遥感图像镶嵌处理RDD

4.1 自定义RDD的具体实现

Fig. 2 Self-defined RDD implementation details

4.2 隐式转换创建自定义RDD

算法1：

5 基于自定义RDD的遥感图像并行镶嵌处理

Fig. 3 Parallel mosaicking algorithm based on self-defined RDD

5.1 Spark集群中遥感图像并行镶嵌处理过程

5.2 Spark集群中遥感图像并行镶嵌运行逻辑

Fig. 4 Parallel mosaicking directed acyclic graphs of remote sensing images

6 实验结果与分析

6.1 实验环境

6.2 镶嵌效果展示

Fig. 5 Parallel mosaicking algorithm based on Spark

6.3 处理效率分析

Fig. 6 Speedup contrast chart (with increasing number of processes)

Fig. 7 Running time comparison chart （with the increase of data size）

Fig. 8 Throughput comparison chart （with the increase of data size）

7 结语

参考文献