  • 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
  • 2. 中国科学院大学,北京 100101


Automatic Identification Method of Micro-blog Messages Containing Geographical Events

  • 1. State Key Lab of Resources and Environmental Information System, IGSNRR, CAS, Beijing 100101, China
  • 2. University of Chinese Academy of Sciences, Beijing 100101, China
*Corresponding author: ZHANG Hengcai, E-mail:

Micro-blogs usually contain abundant types of geographical event information, which could compensate for the shortcomings of traditional fixed point monitoring technologies and improve the quality of emergency response. Identify the micro-blog messages that containing the geographical event information is the prerequisite for fully utilizing this data source. The trigger-based and the supervised machine learning methods are commonly adopted to identify the event related texts. Comparatively, the supervised machine learning methods have better performance than the trigger-based ones for unrestricted texts. Unfortunately, the lack of large-scale tagged corpuses cause the supervised machine learning methods cannot be implemented to identify the geographical event related messages. In this paper, we propose an automatic method for recognizing micro-blogs that are related to geographical events based on the topic model and word vector. This method could achieve a satisfying identification result by increasing the corpus scale rapidly. Firstly, the topic model is capable to extract topics from documents. Thus, the web pages fetched by a search engine are grouped by the topics, and the corpus is obtained after combining the pages under the topics that are related to geographical events through judging their keywords of each topic. Secondly, the distributed representation word vector model is introduced to compensate the lack of context in the micro-blog, which is caused by its character count limit. These word vectors are integrated into the context semantic information from corpus training during the vector generation process. Thirdly, the correlation between the micro-blog message and the given geographical event is calculated and applied to determine whether this message contains the specified geographical event or not. In addition, some heuristic rules are used to correct the error correlations of very short messages. Experiments where the rainstorm is set as the targeting geographical event are conducted to validate the feasibility of this approach. The test conducted on Sina topic micro-blog shows that the F-1 of identification reaches 71.41% and is 10.79% higher than the traditional machine learning algorithm based on Support Vector Machine. Based on the premise that the precision loss is limited, the recall rate would rise with an increase in the corpus scale. The recognition precision could achieve 60% in a dataset containing five million micro-blog texts that simulating the actual data content and environment. These recognized event related micro-blogs could be used to extract detailed information elements in the future.

1 引言


2 识别方法

Fig.1 Flowchart of the identification method

2.1 语料提取

由于网页文本集合中的信息类别未知,并且缺乏标注,监督方法难以应用,因此使用无监督的主题模型筛选地理事件相关文本。主题模型能够从文档集合中根据语义联系生成主题集合,并获取各文档的主题概率分布,以及各主题的词项概率分布。因此,利用主题模型可以将候选语料按主题划分,并根据各主题的高概率关键词集合筛选地理事件相关主题,从而得到地理事件文本语料。其中,浅层狄利克雷分布(Latent Dirichlet Allocation,LDA)是应用最广泛的主题模型之一,当前大多数主题模型研究均与之有关[18]。LDA是由Blei等提出[19],是在概率隐性语义索引(Probabilistic Latent Semantic Indexing,PLSI)基础上扩展的三层贝叶斯概率模型。模型假设文档中的每一个词都是由“一定概率选择了某个主题,并以一定概率从该主题中选择了某个词”的生成过程得到,且2个概率均服从Dirichlet分布。由于在不预先进行人工判读的情况下,难以确定网页文本集合包含的信息类别,因此研究选择层次LDA(Hierarchical LDA,HLDA)模型提取目标地理事件相关文本。HLDA由Blei等在LDA模型基础上改进,用于建立主题之间的树状层次关联,并能通过CRP(Chinese Restaurant Process)自动估计每一层的主题数量[20]
Fig.2 An example of topic extraction and correlation interpretation from the candidate corpuses about rainstorm

图2 暴雨事件候选语料的主题提取及相关性判读示意

2.2 词向量构建

在自然语言的计算过程中,通常利用词向量表示词语。分布式表达模型(Distributed Representation)在词向量生成过程中,融入词汇在语料文本中的上下文语义,使其拥有比较词间语义相似性或相关性的能力。Bengio等在提出的神经网络语言模型(Neural Network Language Model,NNLM)中引入分布式表达词向量[21],成为后续相关研究的基础。其后,Mikolov等提出的CBOW和Skip-gram模型[22]则去除NNLM中的神经网络隐含层,通过损失一部分准确率以大幅提高模型训练效率。其中,Skip-gram模型输出结果在语义相似性计算上的效果较好,因此基于该模型由事件相关语料构建词向量集合。图3为词汇相关性计算结果示例,每列下方的列表是与首词相关性最高的10个词。
Fig.3 An example of related words computation based on the word vector

图3 基于词向量的相关词计算结果示例

2.3 事件消息识别

2.3.1 相关度计算
若词 w i w j 的词向量分别为 ve c i k , , ve c i n w j = ve c j 1 , ve c j 2 , , ve c j k , , ve c j n ,则基于夹角余弦的词间相关度relword(wi,wj)可通过式(1)计算。
re l word w i , w j = k = 1 n ve c i k ve c j k k = 1 n ve c i k 2 k = 1 n ve c j k 2 (1)
若微博客消息文本为,核心词集合为则微博客消息文本与目标地理事件 topic 的相关度 re l event text , topic 计算公式如式(2)所示。
re l event text , topic = k = 1 n max ( re l word ( w k , key w g ) ) 1 g m n (2)
2.3.2 启发式规则约束
Tab.1 Some instances of speech patterns

表1 词性模式

模式 出现次数
v n 327
n v 170
n n 72
m q n 19
n m q 18
n d v 16
a n 16
v m n 15
n a 11
v b n 10
m n p v 10
v u n 10


2.3.3 分类阈值

3 实验分析

3.1 实验环境

以暴雨事件作为目标地理事件,验证识别方法效果。事件训练语料来源于百度搜索结果,分别利用“北京暴雨”、“广东暴雨”、“上海暴雨”、“成都暴雨”等24组关键词采集相关网页,经去重、正文提取后,得到10 041篇网页文本作为候选语料。测试数据来源于:(1)标注微博数据集。利用爬虫抓取“#北京暴雨#”、“#广东暴雨#”、“#成都暴雨#”、“#南京暴雨#”、“#上海暴雨#”、“#天津暴雨#”、“#重庆暴雨#”7个暴雨相关微博话题下的微博客消息,去除话题标签后,人工判读并标注暴雨事件相关消息。各随机选取500条相关消息和不相关消息组成实验数据集;(2)500万微博数据集。北京理工大学张华平博士开放的新浪微博数据集,包含4 993 581条微博消息。提取算法基于Java语言实现,其中,分词算法调用NLPIR 2015工具包(http://ictclas.nlpir.org/),HLDA算法调用Mallet工具包(http://mallet.cs.umass.edu/;https://github.com/chyikwei/topicModels),skip-gram词向量生成算法调用Google word2vec工具包(https://code.google.com/p/word2vec/)。
实验采用准确率 P 、召回率 R F-值3个指标对方法性能进行评价。3个指标的计算如式(3)-(5)所示。
P = 正确识别的相关消息数量 识别的相关消息数量 (3)
R = 正确识别的相关消息数量 应识别的相关消息数量 (4)
F - = β 2 + 1 × P × R β 2 P + R (5)
F-值基于准确率和召回率对识别方法效果作综合评价。其中, β 用于调节准确率和召回率的比重,一般取1,即准确率和召回率重要性相同(式(6))。
F - 1 = 2 × P × R P + R (6)
由于实际应用过程中优先考虑消息的可靠性,即识别结果的准确性,因此需同时考察 β = 0.5 时的F-值(式(7))。
F - 0.5 = 1.25 × P × R 0.25 × P + R (7)

3.2 实验结果

3.2.1 分类阈值计算
基于暴雨事件训练语料和标注微博数据验证提出的分类阈值计算方法效果。由暴雨事件训练语料计算得到的分类阈值为0.505,识别结果为 P=69.71%,R=73.20%,F-1值=71.41%,F-0.5值=70.38%。依次计算分类阈值为[0.1,0.8]的识别结果,如图4所示,图中数字为各分类阈值对应的F-0.5值。
Fig.4 Identification results under different thresholds

图4 不同分类阈值识别结果

3.2.2 识别效果比较
利用暴雨事件训练语料和标注微博数据,比较提出方法与现有监督学习方法的识别效果。对比监督学习方法选用基于支持向量机(Support Vector Machine,SVM)的识别方法[24],并参考文献[8]、[24]-[26]的工作,选取的识别特征包括:微博消息中词个数、各词词频、名词个数、停用词个数、事件词个数和数词个数。实验过程中,将测试数据随机分成5组,4组数据作为SVM模型的训练数据,剩余1组作为测试数据,交叉验证后的平均值作为最终识别结果,如表2所示。
Tab.2 Performance of the identification approach for micro-blogs containing rainstorm events

表2 蕴含暴雨事件消息识别结果

抽取方法 准确率/(%) 召回率/(%) F-1值/(%) F-0.5值/(%)
本文方法 69.71 73.20 71.41 70.38
SVM方法 68.48 54.88 60.62 65.00
Fig.5 Identification results using different scales of candidate corpuses
3.2.3 开放环境实验

3.3 讨论

(1)本文方法仍需一定人工参与,主要在语料提取阶段,根据关键词判断各主题是否与目标地理事件相关。但主题数量远少于需标注的语料数量,如实验中暴雨事件候选语料提取的主题数量为51,文本数量为10 041,随着语料资源的增加,二者之间的差异将更加显著。因此,本文方法可以明显地减少人工成本,实现语料快速更新,满足新类型地理事件消息识别的需求。

4 结论


