地球信息科学学报 ›› 2019, Vol. 21 ›› Issue (9): 1392-1401.doi: 10.12082/dqxxkx.2019.190005

• 地理信息科学理论与方法 • 上一篇    下一篇

基于通用知识库的地理实体开放关系过滤方法

高嘉良1,2,余丽3,*(),仇培元1,陆锋1,2,4   

  1. 1 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
    2 中国科学院大学,北京 100049
    3 中国科学院文献情报中心,北京100190
    4 江苏省地理信息资源开发与利用协同创新中心,南京 210023
  • 收稿日期:2019-01-02 修回日期:2019-05-23 出版日期:2019-09-25 发布日期:2019-09-24
  • 作者简介:高嘉良(1994-),男,山东临沂人,博士生,主要从事自然语言处理与地理知识图谱研究。E-mail:gaojl@lreis.ac.cn
  • 基金资助:
    国家自然科学基金重点项目(41631177)

A Knowledge-based Method for Filtering Geo-entity Relations

GAO Jialiang1,2,YU Li3,*(),QIU Peiyuan1,LU Feng1,2,4   

  1. 1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
    3. National Science Library, Chinese Academy of Sciences, Beijing 100190, China
    4. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
  • Received:2019-01-02 Revised:2019-05-23 Online:2019-09-25 Published:2019-09-24
  • Contact: YU Li
  • Supported by:
    National Natural Science Foundation of China(41631177)

摘要:

文本数据为地理知识服务提供了海量资源。面向文本数据的地理实体关系抽取是地理知识图谱构建的核心技术,直接影响地理知识推理与服务的质量。由于文本数据不可避免地含有噪声,从文本中抽取的地理实体关系需要质量评价和信息过滤。本文提出一种基于通用知识库的地理实体关系过滤方法,针对已抽取的地理实体关系从中筛选出高质量的结果:先利用“本体知识”、“事实知识”和“同义词知识”构建地理关系知识库,作为信息过滤的参照数据;再基于分布式向量表示模型度量已抽取的地理实体关系与参照数据之间的语义相似性,以提高地理知识图谱的丰度与鲜度。实验结果表明,相比业界流行的“Stanford OpenIE”工具,本文所提出的方法可将置信度区间[0, 0.2]和[0.8, 1]的MSE(Mean Square Error)从59.27%降至3.94%,AUC(Area Under the ROC Curve)从0.51提升至0.89。

关键词: 文本数据, 地理实体关系抽取, 地理知识图谱构建, 通用知识库, 开放关系抽取, 地理信息质量评价, 信息过滤

Abstract:

Knowledge Graphs (KGs) are crucial resources for supporting geographical knowledge services. Given the vast geographical knowledge in web text, extraction of geo-entity relations from web text has become the core technology for constructing geographical KGs. Furthermore, it directly affects the quality of geographical knowledge services. However, web text inevitably contains noise and geographical knowledge can be sparsely distributed, both greatly restricting the quality of geo-entity relationship extraction. Here, we proposed a method for filtering geo-entity relations based on existing Knowledge Bases (KBs). Specifically, ontology knowledge, fact knowledge, and synonym knowledge were integrated to generate geo-related knowledge. Then, the extracted geo-entity relationships and the geo-related knowledge were transferred into vectors, and the maximum similarity between vectors was the confidence value of one extracted geo-entity relationship triple. Our method takes full advantage of existing KBs to assess the quality of geographical information in web text, which helps improve the richness and freshness of geographical KGs. Compared with the Stanford OpenIE method, our method decreased the Mean Square Error (MSE) from 0.62 to 0.06 in the confidence interval [0.7, 1], and improved the area under the Receiver Operating Characteristic (ROC) Curve (AUC) from 0.51 to 0.89.

Key words: text data, geo-entity relations extraction, geo-KG building, common knowledge bases, open relation extraction, evaluation of geographic information quality, information filtering