地球信息科学学报 ›› 2015, Vol. 17 ›› Issue (2): 185-190.doi: 10.3724/SP.J.1047.2015.00185

• • 上一篇    下一篇

一种主动发现网络地理信息服务的主题爬虫

沈平1(), 桂志鹏2,*(), 游兰1,3, 胡凯1, 吴华意1   

  1. 1. 武汉大学测绘遥感信息工程国家重点实验室,武汉 430079
    2. 武汉大学遥感信息工程学院,武汉 430079
    3. 湖北大学计算机与信息工程学院,武汉 430062
  • 收稿日期:2014-11-14 修回日期:2014-12-21 出版日期:2015-02-10 发布日期:2015-02-10
  • 通讯作者: 桂志鹏 E-mail:shenping@whu.edu.cn;Zhipeng.Gui@whu.edu.cn
  • 作者简介:

    作者简介:沈 平(1991-),女,湖北人,硕士生,研究方向为地理信息资源的在线搜索以及地理信息服务。E-mail:shenping@whu.edu.cn

  • 基金资助:
    国家自然科学基金面上项目(41371372);武汉大学遥感信息工程学院探索性研发基金“基于时空计算特征挖掘的空间信息云计算优化方法研究”

A Topic Crawler for Discovering Geospatial Web Services

SHEN Ping1(), GUI Zhipeng2,*(), YOU Lan1,3, HU Kai1, WU Huayi1   

  1. 1. State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
    2. School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
    3. School of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China
  • Received:2014-11-14 Revised:2014-12-21 Online:2015-02-10 Published:2015-02-10
  • Contact: GUI Zhipeng E-mail:shenping@whu.edu.cn;Zhipeng.Gui@whu.edu.cn
  • About author:

    *The author: SHEN Jingwei, E-mail:jingweigis@163.com

摘要:

地理信息服务已成为分布式环境下获取地理数据的重要来源,从海量的网络资源中找到地理信息服务,是共享与互操作地理数据的基础。目前,地理信息服务主动搜索主要采用通用搜索引擎的接口或者通用爬虫的抓取方式,但这2种方式存在搜索效率低、搜索结果可用性差等不足。针对这一问题,本文设计了一种搜索地理信息服务的主题爬虫。该算法在最佳优先搜索的基础上进行了改进,综合考虑网页内容的主题相关度和链接文本的主题相关度确定链接优先级,优先爬取与地理信息服务相关的链接,并通过舍弃无关网页中的无关链接,减少无效爬取,进而提高搜索效率。此外,本文采用关键词匹配结合能力文档探测的方式识别地理信息服务,有效筛选出可用的地理信息服务,提高了服务搜索结果的可利用率。最后,本文以OGC WMS为实例,实现爬虫算法的原型系统并进行实验,实验证明该算法有效可行。

关键词: 主题爬虫, 网络地理信息服务, 最佳优先搜索, 能力文档探测

Abstract:

In Internet era, geospatial web services (GWSs) are the primary approaches to share and interoperate geographical data. After more than ten years of development and the widely adoption on specifications, an increased number of geospatial web services have been published and are available for online public access. To obtain those geographical data, it is necessary to find an effective approach to locate and discover GWSs among massive web resources. Currently, the most widely used methods in practical for GWSs discovering are either based on Google Search API or based on generic web crawler. But the aforementioned approaches have some shortages, such as relatively inefficient search performance, irrelevant results, and low precision on GWS identification. To partially address the above issues, this paper developed a topic crawler to harvest GWSs based on the modified Best First Search strategy. The core of the proposed algorithm is that through combining the topic relevance of the link text and the topic relevance of the webpage text synthetically to predict the crawling priority of the unvisited URL. Then, we can utilize the priority thresholds to filter out the irrelevant URLs and narrow the search range at the same time. Moreover, a capabilities document detecting operation is added to GWSs recognition process to improve the search precision. Finally, we use the most widely adopted GWS specification: Web Map Service (WMS), which is proposed by Open Geospatial Consortium (OGC), as a case study. Two groups of experiments were conducted to compare the proposed method and a generic web crawler. The experimental results verified the feasibility of the proposed algorithm.

Key words: topic crawler, Geospatial Web Services, Best First Search, capability detection