地球信息科学学报 ›› 2016, Vol. 18 ›› Issue (7): 886-893.doi: 10.3724/SP.J.1047.2016.00886

• 地球信息科学理论与方法 • 上一篇    下一篇

蕴含地理事件微博客消息的自动识别方法

仇培元1,2(), 陆锋1, 张恒才1,*(), 余丽1,2   

  1. 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
    2. 中国科学院大学,北京 100101
  • 收稿日期:2015-09-07 修回日期:2015-11-03 出版日期:2016-07-15 发布日期:2016-07-15
  • 通讯作者: 张恒才 E-mail:qiupy@lreis.ac.cn;zhanghc@lreis.ac.cn
  • 作者简介:

    作者简介:仇培元(1986-),男,博士生,研究方向为互联网空间信息搜索。E-mail: qiupy@lreis.ac.cn

  • 基金资助:
    国家“863”计划课题(2013AA120305);国家自然科学基金项目(41401460)

Automatic Identification Method of Micro-blog Messages Containing Geographical Events

QIU Peiyuan1,2(), LU Feng1, ZHANG Hengcai1,*(), YU Li1,2   

  1. 1. State Key Lab of Resources and Environmental Information System, IGSNRR, CAS, Beijing 100101, China
    2. University of Chinese Academy of Sciences, Beijing 100101, China
  • Received:2015-09-07 Revised:2015-11-03 Online:2016-07-15 Published:2016-07-15
  • Contact: ZHANG Hengcai E-mail:qiupy@lreis.ac.cn;zhanghc@lreis.ac.cn

摘要:

微博客文本蕴含类型丰富的地理事件信息,能够弥补传统定点监测手段的不足,提高事件应急响应质量。然而,由于大规模标注语料的普遍匮乏,无法利用监督学习过程识别蕴含地理事件信息的微博客文本。为此,本文提出一种蕴含地理事件微博客消息的自动识别方法,通过快速获取的语料资源增强识别效果。该方法利用主题模型具有提取文档中主题集合的优势,通过主题过滤候选语料文本,实现地理事件语料的自动提取。同时,将分布式表达词向量模型引入事件相关性计算过程,借助词向量隐含的语义信息丰富微博客短文本的上下文内容,进一步增强事件消息的识别效果。通过以新浪微博为数据源开展的实验分析表明,本文提出的蕴含地理事件信息微博客消息识别方法,识别来自事件微博话题的消息文本的F-1值可达到71.41%,比经典的基于SVM模型的监督学习方法提高了10.79%。在模拟真实微博环境的500万微博客数据集上的识别准确率达到60%。

关键词: 微博客, 地理事件, 事件文本识别, 主题模型, 词向量

Abstract:

Micro-blogs usually contain abundant types of geographical event information, which could compensate for the shortcomings of traditional fixed point monitoring technologies and improve the quality of emergency response. Identify the micro-blog messages that containing the geographical event information is the prerequisite for fully utilizing this data source. The trigger-based and the supervised machine learning methods are commonly adopted to identify the event related texts. Comparatively, the supervised machine learning methods have better performance than the trigger-based ones for unrestricted texts. Unfortunately, the lack of large-scale tagged corpuses cause the supervised machine learning methods cannot be implemented to identify the geographical event related messages. In this paper, we propose an automatic method for recognizing micro-blogs that are related to geographical events based on the topic model and word vector. This method could achieve a satisfying identification result by increasing the corpus scale rapidly. Firstly, the topic model is capable to extract topics from documents. Thus, the web pages fetched by a search engine are grouped by the topics, and the corpus is obtained after combining the pages under the topics that are related to geographical events through judging their keywords of each topic. Secondly, the distributed representation word vector model is introduced to compensate the lack of context in the micro-blog, which is caused by its character count limit. These word vectors are integrated into the context semantic information from corpus training during the vector generation process. Thirdly, the correlation between the micro-blog message and the given geographical event is calculated and applied to determine whether this message contains the specified geographical event or not. In addition, some heuristic rules are used to correct the error correlations of very short messages. Experiments where the rainstorm is set as the targeting geographical event are conducted to validate the feasibility of this approach. The test conducted on Sina topic micro-blog shows that the F-1 of identification reaches 71.41% and is 10.79% higher than the traditional machine learning algorithm based on Support Vector Machine. Based on the premise that the precision loss is limited, the recall rate would rise with an increase in the corpus scale. The recognition precision could achieve 60% in a dataset containing five million micro-blog texts that simulating the actual data content and environment. These recognized event related micro-blogs could be used to extract detailed information elements in the future.

Key words: micro-blog, geographical event, event text identification, topic model, word vector