地球信息科学学报 ›› 2016, Vol. 18 ›› Issue (4): 435-442.doi: 10.3724/SP.J.1047.2016.00435

• •    下一篇

基于链接分析的网页文本核心地名提取方法

钟翔(), 高勇*(), 邬伦   

  1. 北京大学遥感与地理信息系统研究所,北京 100871
  • 收稿日期:2015-08-03 修回日期:2015-08-26 出版日期:2016-04-20 发布日期:2016-04-19
  • 通讯作者: 高勇 E-mail:zhongxiang0902@sina.com;gaoyong@pku.edu.cn
  • 作者简介:

    作者简介:钟 翔(1991-),男,湖南益阳人,硕士生,研究方向为文本空间数据挖掘与地理信息检索。E-mail: zhongxiang0902@sina.com

  • 基金资助:
    国家自然科学基金项目(41271385)

Extract Core Toponyms from Web Page Text Based on Link Analysis

ZHONG Xiang(), GAO Yong*(), WU Lun   

  1. Institute of Remote Sensing and Geographical Information System, Peking University, Beijing 100871, China
  • Received:2015-08-03 Revised:2015-08-26 Online:2016-04-20 Published:2016-04-19
  • Contact: GAO Yong E-mail:zhongxiang0902@sina.com;gaoyong@pku.edu.cn

摘要:

本文围绕互联网中网页文本蕴含的丰富地理空间信息,抽取网页文本中蕴含的地名实体,提出了一种地名共现网络模型,该模型综合考虑网页中地名的频次信息,表达网页文本中地名的共现及联系传递特征。在此基础上,提出一种基于链接分析的网页文本核心地名的提取方法,通过PageRank算法计算每个地名在共现网络中的链接权重,对网页文本构建的共现地名网络进行核心地名的提取,从而在庞大的网络资源中发现具有显著的焦点特征或导航枢纽特征的重要地名。最后,采用人民日报与新浪新闻体育版2份语料进行实验验证,证明了该方法的有效性。

关键词: 地名, 地名共现, 链接分析, 复杂网络, 地理信息检索

Abstract:

Geographical information explodes with the emergence of Internet, which also adopts brand new ideas to obtain geospatial data with traditional GIS methods. With the abundant geospatial information on the web, we proposed a toponym co-occurrences network model by extracting the toponym entities from web page texts using nature language process methods, as well as uniforming the toponyms, in order to conduct a comprehensive analysis of the web pages. The network set up in this paper is a weighted directed graph, of which every vertex represents a distinct toponym, and the co-occurrence of each two toponyms is displayed as one edge of this network. The frequency of geographic names is taken into consideration synthetically, which shows the weight of each network edge, as well as explains the co-occurrence relationship and transformation occurrence characteristics of those toponyms. On this basis, a method of toponym extraction from web page texts based on link analysis is carried out, taking advantage of the PageRank algorithm to calculate the link weight of every toponym in the co-occurrence network and rank each geographical name with a PageRank score. In this way, the importance of the toponym is calculated and the core geographic names with remarkable features or navigation features in all huge network resources can be found. A case study based on the actual data extracted from People’s Daily and Sina News Sport web pages is carried out to verify the technical solution, which shows that the proposed solution is both feasible and practically effective, which can also be applied to geographical information retrieval. Results show that the core toponym of co-occurrence network differs in different themes of web pages, and when the time sequence factor is taken into account, the core toponym results may also be different within a single theme of web pages.

Key words: toponym, toponym co-occurrence, link analysis, complex network, geographical information retrieval