地球信息科学学报 ›› 2015, Vol. 17 ›› Issue (4): 416-422.doi: 10.3724/SP.J.1047.2015.00416

• • 上一篇    下一篇

互联网文本蕴含道路交通信息抽取的模式匹配方法

仇培元1,2(), 张恒才1,*(), 陆锋1   

  1. 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室, 北京 100101
    2. 中国科学院大学, 北京 100101
  • 收稿日期:2014-05-04 修回日期:2014-06-25 出版日期:2015-04-10 发布日期:2015-04-10
  • 作者简介:

    作者简介:仇培元(1986-),男,山东青岛人,博士生,研究方向为互联网空间信息搜索。E-mail:qiupy@lreis.ac.cn

  • 基金资助:
    国家“863”计划课题(2012AA12A211、2013AA120305)

A Pattern Matching Method for Extracting Road Traffic Information from Internet Texts

QIU Peiyuan1,2(), ZHANG Hengcai1,*(), LU Feng1   

  1. 1. State Key Lab of Resources and Environmental Information System, IGSNRR, CAS, Beijing 100101, China
    2. University of Chinese Academy of Sciences, Beijing 100101, China
  • Received:2014-05-04 Revised:2014-06-25 Online:2015-04-10 Published:2015-04-10
  • Contact: ZHANG Hengcai
  • About author:

    *The author: SHEN Jingwei, E-mail:jingweigis@163.com

摘要:

互联网页面和社交网络文本中蕴含丰富的道路交通信息,是其他交通信息采集平台的有效补充。然而,自然语言文本形式的交通信息多以线性参考或地标方位描述交通事件空间位置,且大量存在事件元素缺失或隐含现象,对交通信息的自动化抽取有着较大影响。考虑到交通信息的自然语言表达方式虽然自由随意,但表达模式相对固定,提出一种从互联网文本中抽取道路交通信息的模式匹配方法。首先,基于道路交通事件描述的语言特征构建模式库;然后,以特征词词性序列的形式表达互联网文本和抽取模式,利用DTW距离度量序列相似度,实现抽取模式匹配;最后,在匹配抽取模式和填补规则指导下获取结构化的道路交通信息。由上海市城市交通相关门户网站和微博客平台的实验过程显示,本文所提出的模式匹配方法,抽取道路交通信息的准确率和召回率分别达到90%和80%以上,表明该方法能有效抽取互联网文本蕴含的道路交通信息,且实现过程相对简单,易于扩展,具有可用性。

关键词: 互联网文本, 道路交通信息, 模式匹配, DTW距离, 信息抽取

Abstract:

Internet pages and microblog messages usually contain a great amount of road traffic information that can become an important data source for city road traffic collection. However, current information extraction technology for Chinese natural language text is not applicable to extract road traffic information from Internet texts for two reasons: (1) the location descriptions in these texts are usually in the form of linear reference methods; and (2) some information elements are missing or ignored in the expressions. In this paper, we propose a pattern matching method for extracting road traffic information from Internet texts. This method focuses on obtaining the location element and event element of road traffic information, due to the fact that these elements are often associated with the above issues. Firstly, extraction pattern is defined as a sequence in which each item contains two parts: part of speech (POS) of the road traffic feature words, and information attribute type. Then an extraction pattern library is established based on the linguistic features of the road traffic event description. Secondly, the Internet text after pre-progressing and the extraction patterns are both represented by POS sequences. Thirdly, the method of measuring similarity between sequences with dynamic time warping (DTW) theory is used in pattern matching to look for the most suitable extraction pattern for this text from the library. Finally, the elements and attributes of traffic information are extracted from the text under the guidance of the matching pattern. To add the missing or ignored elements, special filling rules based on the syntactic structure of information expression are introduced into this extraction process. In an experiment that takes relevant Internet texts for road traffic in Shanghai as the test data, whose sources are mainly from the official traffic information websites and Sina microblog platform, the precision and recall rate of road traffic information extraction is analyzed to be over 90% and 80% respectively. The result verifies the effectiveness of the presented approach. This method satisfies the requirement since the data accuracy is higher than average in real world public traffic service, and could effectively exact structure road traffic information from texts in any websites of different cities, by using the corresponding road lexicons.

Key words: Internet text, road traffic information, pattern matching, DTW, information extraction