地球信息科学学报 ›› 2019, Vol. 21 ›› Issue (8): 1152-1160.doi: 10.12082/dqxxkx.2019.190046

• 地球信息科学理论与方法 • 上一篇    下一篇

主题模型在基于社交媒体的灾害分类中的应用及比较

苏凯1,程昌秀1,*(),Nikita Murzintcev2,张婷1   

  1. 1. 北京师范大学地理科学学部,地理数据与应用分析中心,北京 100875
    2. 中国科学院地理科学与资源研究所,北京 100101
  • 收稿日期:2019-01-25 修回日期:2019-04-24 出版日期:2019-08-25 发布日期:2019-08-25
  • 通讯作者: 程昌秀 E-mail:chengcx@bnu.edu.cn
  • 作者简介:苏 凯(1994-),男,浙江杭州人,硕士生,研究方向为灾害大数据分析。E-mail: <email>sukai_silence@163.com</email>
  • 基金资助:
    国家重点研发计划项目(2017YFB0504102);中央高校基本科研业务费专项资金资助

Application and Comparison of Topic Model in Identifying Latent Topics from Disaster-Related Tweets

SU Kai1,CHENG Changxiu1,*(),Nikita Murzintcev2,ZHANG Ting1   

  1. 1. Center for Geodata and Analysis, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China
    2. Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
  • Received:2019-01-25 Revised:2019-04-24 Online:2019-08-25 Published:2019-08-25
  • Contact: CHENG Changxiu E-mail:chengcx@bnu.edu.cn
  • Supported by:
    National Key Research and Development Program of China(2017YFB0504102);Supported by the Fundamental Research Funds for the Central Universities

摘要:

“一带一路”沿线为自然灾害高发地区,且多为经济欠发达、抗灾能力弱的发展中国家。灾害发生时,挖掘和分析相关推特数据有助于开展应急救援、灾情评估、减灾防灾等工作,为中国国际救援与救助工作提供重要支撑。主题模型能在没有经验语料库的情况下,从海量灾害相关推文中快速聚合出对灾害救援、评估有价值的信息。本文采用BTM模型和LDA模型,对2013年海燕台风相关推文进行细粒度的主题聚类,分析2个模型的精度并测试它们对近似灾害主题的区分能力,并基于“需求相关”主题类的推文,通过地名匹配,分析了海燕台风发生过程中菲律宾物资、医疗等需求程度的空间分布。结果表明: ① 在区分主题近似的短文本时,BTM总体精度为0.598,LDA的总体精度仅为0.321,说明在海燕台风灾害推文的主题识别中,BTM模型的精度高于LDA模型;② BTM能够较好识别出“灾害地点相关”、“祈福相关”等较为精细的灾害主题;③ 经初步验证,基于“需求相关”主题文本生成的物资、医疗等需求的需求程度空间分布与实际需求情况基本相符。

关键词: 主题模型, BTM, LDA, 推文, 主题分类, 自然灾害, 应急管理

Abstract:

From 1990 to 2010, the occurrence of natural disasters was increasing in countries along the "One Belt and One Road" where most countries are developing countries with underdeveloped economy and weak disaster resistance. When disasters happen, people in those countries will tweet about the disasters in real time. The tweets contain important information for emergency rescue, disaster assessment, disaster reduction and prevention, etc. Therefore, mining and analyzing relevant tweets can provide powerful support for China's international rescue and relief work. However, twitter data is fragmented and unstructured, and the number of topics that tweets contain are huge and miscellaneous. Therefore, how to rapidly screen out relevant information from tweets becomes a research challenge. Without empirical corpus, topic model can rapidly aggregate information from a large number of disaster-related tweets, which are valuable for disaster relief and assessment. In this paper, the BTM model and LDA model, that are widely used in the study of natural language processing, were adopted to cluster Haiyan typhoon-related tweets at fine granularity topics. Then we verified and compared the accuracy of two models, and tested their ability to distinguish similar disaster topics. In addition, based on the "demand-related" tweets obtained from topic categorization, through place-name matching, we analyzed the spatial distribution of demand degree of materials and medical care in the Philippines during the occurrence of Haiyan typhoon. The result shows that: (1) In classifying Haiyan typhoon-related tweets at fine granularity topics, the overall accuracy of BTM was 0.598, while that of LDA was only 0.321, indicating that BTM can outperform LDA. (2) The F1-measure values of BTM in "disaster location-related" and "blessing-related" tweets were 0.8 and 0.78, indicating that BTM can better identify tweets of those two topics. (3) After preliminary verification, the spatial distribution of material and medical needs generated based on "demand-related" tweets was basically consistent with the actual demand. Our findings can help quickly obtain first-hand disaster information from twitter when China lacks relevant data of disasters occurring in the "One Belt and One Road" region, so to provide data support for China's international rescue work. Besides, our methodology can be used for studying domestic microblog in disasters.

Key words: Topic model, BTM, LDA, Tweet, Topic categorization, Natural hazard, Emergency management